2015-07-15 07:56:11

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v10 0/7] sched: Rewrite runnable load and utilization average tracking

Hi Peter and Ingo,

Changes are made to the 10th version:

1) Include Vincent's fix to update blocked load when CONFIG_FAIR_GROUP_SCHED=n
2) Get runnable_load_avg back
3) Clean up the references to load averages a little bit

Thanks a lot to Dietmar for his review, tests, and suggestions.
Thanks a lot for the thoughts and suggestions from Morten and Peter.

The discussion we made is really inspiring...

Thanks,
Yuyang

v9 changes:

1) util_avg fix
2) update_tg_load_avg() in update_blocked_averages(), because
cfs_rq's load_avg is updated
3) Add some debug print

v8 changes:

1) Rebase to the latest tip tree
2) scale_load_down the weight when doing the averages
3) change util_sum to u32

Thanks a lot for Ben's comments, which lead to this version.
Thanks to Vincent for review.

v7 changes:

The 7th version mostly is to accommodate the utilization load average recently
merged into kernel. The general idea is as well to update the cfs_rq as a whole
as opposed to only updating an entity at a time and update the cfs_rq with the
only updated entity.

1) Rename utilization_load_avg to util_avg to be concise and meaningful

2) To track the cfs_rq util_avg, simply use "cfs_rq->curr != NULL" as the
predicate. This should be equivalent to but simpler than aggregating each
individual child sched_entity's util_avg when "cfs_rq->curr == se". Because
if cfs_rq->curr != NULL, the cfs_rq->curr has to be some se.

3) Remove se's util_avg from its cfs_rq's when migrating it, this was already
proposed by Morten and patches sent

3) The group entity's load average is initiated when the entity is created

4) Small nits: the entity's util_avg is removed from switched_from_fair()
and task_move_group_fair().

Thanks a lot for Vincent and Morten's help for the 7th version.

Thanks,
Yuyang

v6 changes:

Many thanks to PeterZ for his review, to Dietmar, and to Fengguang for 0Day and LKP.

Rebased on v3.18-rc2.

- Unify decay_load 32 and 64 bits by mul_u64_u32_shr
- Add force option in update_tg_load_avg
- Read real-time cfs's load_avg for calc_tg_weight
- Have tg_load_avg_contrib ifdef CONFIG_FAIR_GROUP_SCHED
- Bug fix

v5 changes:

Thank Peter intensively for reviewing this patchset in detail and all his comments.
And Mike for general and cgroup pipe-test. Morten, Ben, and Vincent in the discussion.

- Remove dead task and task group load_avg
- Do not update trivial delta to task_group load_avg (threshold 1/64 old_contrib)
- mul_u64_u32_shr() is used in decay_load, so on 64bit, load_sum can afford
about 4353082796 (=2^64/47742/88761) entities with the highest weight (=88761)
always runnable, greater than previous theoretical maximum 132845
- Various code efficiency and style changes

We carried out some performance tests (thanks to Fengguang and his LKP). The results
are shown as follows. The patchset (including threepatches) is on top of mainline
v3.16-rc5. We may report more perf numbers later.

Overall, this rewrite has better performance, and reduced net overhead in load
average tracking, flat efficiency in multi-layer cgroup pipe-test.

v4 changes:

Thanks to Morten, Ben, and Fengguang for v4 revision.

- Insert memory barrier before writing cfs_rq->load_last_update_copy.
- Fix typos.

v3 changes:

Many thanks to Ben for v3 revision.

Regarding the overflow issue, we now have for both entity and cfs_rq:

struct sched_avg {
.....
u64 load_sum;
unsigned long load_avg;
.....
};

Given the weight for both entity and cfs_rq is:

struct load_weight {
unsigned long weight;
.....
};

So, load_sum's max is 47742 * load.weight (which is unsigned long), then on 32bit,
it is absolutly safe. On 64bit, with unsigned long being 64bit, but we can afford
about 4353082796 (=2^64/47742/88761) entities with the highest weight (=88761)
always runnable, even considering we may multiply 1<<15 in decay_load64, we can
still support 132845 (=4353082796/2^15) always runnable, which should be acceptible.

load_avg = load_sum / 47742 = load.weight (which is unsigned long), so it should be
perfectly safe for both entity (even with arbitrary user group share) and cfs_rq on
both 32bit and 64bit. Originally, we saved this division, but have to get it back
because of the overflow issue on 32bit (actually load average itself is safe from
overflow, but the rest of the code referencing it always uses long, such as cpu_load,
etc., which prevents it from saving).

- Fix overflow issue both for entity and cfs_rq on both 32bit and 64bit.
- Track all entities (both task and group entity) due to group entity's clock issue.
This actually improves code simplicity.
- Make a copy of cfs_rq sched_avg's last_update_time, to read an intact 64bit
variable on 32bit machine when in data race (hope I did it right).
- Minor fixes and code improvement.

v2 changes:

Thanks to PeterZ and Ben for their help in fixing the issues and improving
the quality, and Fengguang and his 0Day in finding compile errors in different
configurations for version 2.

- Batch update the tg->load_avg, making sure it is up-to-date before update_cfs_shares
- Remove migrating task from the old CPU/cfs_rq, and do so with atomic operations

Yuyang Du (7):
sched: Remove rq's runnable avg
sched: Rewrite runnable load and utilization average tracking
sched: Implement update_blocked_averages() for
CONFIG_FAIR_GROUP_SCHED=n
sched: Init cfs_rq's sched_entity load average
sched: Remove task and group entity load when they are dead
sched: Provide runnable_load_avg back to cfs_rq
sched: Clean up load average references

include/linux/sched.h | 41 ++-
kernel/sched/core.c | 5 +-
kernel/sched/debug.c | 48 ++--
kernel/sched/fair.c | 736 +++++++++++++++++++-------------------------------
kernel/sched/sched.h | 34 +--
5 files changed, 334 insertions(+), 530 deletions(-)

--
2.1.4


2015-07-15 07:56:14

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v10 1/7] sched: Remove rq's runnable avg

The current rq->avg is not used at all since its merge into kernel,
and the code is in the scheduler's hot path, so remove it.

Signed-off-by: Yuyang Du <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/debug.c | 7 +------
kernel/sched/fair.c | 25 ++++---------------------
kernel/sched/sched.h | 2 --
3 files changed, 5 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f94724e..ca39cb7 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -68,13 +68,8 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
#define PN(F) \
SEQ_printf(m, " .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))

- if (!se) {
- struct sched_avg *avg = &cpu_rq(cpu)->avg;
- P(avg->runnable_avg_sum);
- P(avg->avg_period);
+ if (!se)
return;
- }
-

PN(se->exec_start);
PN(se->vruntime);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40a7fcb..7922532 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2728,19 +2728,12 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
}
}

-static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
-{
- __update_entity_runnable_avg(rq_clock_task(rq), cpu_of(rq), &rq->avg,
- runnable, runnable);
- __update_tg_runnable_avg(&rq->avg, &rq->cfs);
-}
#else /* CONFIG_FAIR_GROUP_SCHED */
static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
int force_update) {}
static inline void __update_tg_runnable_avg(struct sched_avg *sa,
struct cfs_rq *cfs_rq) {}
static inline void __update_group_entity_contrib(struct sched_entity *se) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
#endif /* CONFIG_FAIR_GROUP_SCHED */

static inline void __update_task_entity_contrib(struct sched_entity *se)
@@ -2944,7 +2937,6 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
*/
void idle_enter_fair(struct rq *this_rq)
{
- update_rq_runnable_avg(this_rq, 1);
}

/*
@@ -2954,7 +2946,6 @@ void idle_enter_fair(struct rq *this_rq)
*/
void idle_exit_fair(struct rq *this_rq)
{
- update_rq_runnable_avg(this_rq, 0);
}

static int idle_balance(struct rq *this_rq);
@@ -2963,7 +2954,6 @@ static int idle_balance(struct rq *this_rq);

static inline void update_entity_load_avg(struct sched_entity *se,
int update_cfs_rq) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
struct sched_entity *se,
int wakeup) {}
@@ -4262,10 +4252,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_entity_load_avg(se, 1);
}

- if (!se) {
- update_rq_runnable_avg(rq, rq->nr_running);
+ if (!se)
add_nr_running(rq, 1);
- }
+
hrtick_update(rq);
}

@@ -4323,10 +4312,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_entity_load_avg(se, 1);
}

- if (!se) {
+ if (!se)
sub_nr_running(rq, 1);
- update_rq_runnable_avg(rq, 1);
- }
+
hrtick_update(rq);
}

@@ -6034,9 +6022,6 @@ static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
*/
if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
list_del_leaf_cfs_rq(cfs_rq);
- } else {
- struct rq *rq = rq_of(cfs_rq);
- update_rq_runnable_avg(rq, rq->nr_running);
}
}

@@ -8020,8 +8005,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

if (numabalancing_enabled)
task_tick_numa(rq, curr);
-
- update_rq_runnable_avg(rq, 1);
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f10a445..d465a5c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -595,8 +595,6 @@ struct rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this cpu: */
struct list_head leaf_cfs_rq_list;
-
- struct sched_avg avg;
#endif /* CONFIG_FAIR_GROUP_SCHED */

/*
--
2.1.4

2015-07-15 07:56:19

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v10 2/7] sched: Rewrite runnable load and utilization average tracking

The idea of runnable load average (let runnable time contribute to weight)
was proposed by Paul Turner and Ben Segall, and it is still followed by
this rewrite. This rewrite aims to solve the following issues:

1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
updated at the granularity of an entity at a time, which results in the
cfs_rq's load average is stale or partially updated: at any time, only
one entity is up to date, all other entities are effectively lagging
behind. This is undesirable.

To illustrate, if we have n runnable entities in the cfs_rq, as time
elapses, they certainly become outdated:

t0: cfs_rq { e1_old, e2_old, ..., en_old }

and when we update:

t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }

t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }

...

We solve this by combining all runnable entities' load averages together
in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
on the fact that if we regard the update as a function, then:

w * update(e) = update(w * e) and

update(e1) + update(e2) = update(e1 + e2), then

w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)

therefore, by this rewrite, we have an entirely updated cfs_rq at the
time we update it:

t1: update cfs_rq { e1_new, e2_new, ..., en_new }

t2: update cfs_rq { e1_new, e2_new, ..., en_new }

...

2. cfs_rq's load average is different between top rq->cfs_rq and other
task_group's per CPU cfs_rqs in whether or not blocked_load_average
contributes to the load.

The basic idea behind runnable load average (the same for utilization)
is that the blocked state is taken into account as opposed to only
accounting for the currently runnable state. Therefore, the average
should include both the runnable/running and blocked load averages.
This rewrite does that.

In addition, we also combine runnable/running and blocked averages
of all entities into the cfs_rq's average, and update it together at
once. This is based on the fact that:

update(runnable) + update(blocked) = update(runnable + blocked)

This significantly reduces the codes as we don't need to separately
maintain/update runnable/running load and blocked load.

3. How task_group entities' share is calculated is complex and imprecise.

We reduce the complexity in this rewrite to allow a very simple rule:
the task_group's load_avg is aggregated from its per CPU cfs_rqs's
load_avgs. Then group entity's weight is simply proportional to its
own cfs_rq's load_avg / task_group's load_avg. To illustrate,

if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,

task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then

cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share

To sum up, this rewrite in principle is equivalent to the current one, but
fixes the issues described above. Turns out, it significantly reduces the
code complexity and hence increases clarity and efficiency. In addition,
the new averages are more smooth/continuous (no spurious spikes and valleys)
and updated more consistently and quickly to reflect the load dynamics. As a
result, we have less load tracking overhead, better performance, and
especially better power efficiency due to more balanced load.

Signed-off-by: Yuyang Du <[email protected]>
---
include/linux/sched.h | 41 ++--
kernel/sched/core.c | 3 -
kernel/sched/debug.c | 41 ++--
kernel/sched/fair.c | 630 +++++++++++++++++---------------------------------
kernel/sched/sched.h | 28 +--
5 files changed, 250 insertions(+), 493 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index af0eeba..5cc071b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1183,29 +1183,24 @@ struct load_weight {
u32 inv_weight;
};

+/*
+ * The load_avg/util_avg accumulates an infinite geometric series.
+ * 1) load_avg factors the amount of time that a sched_entity is
+ * runnable on a rq into its weight. For cfs_rq, it is the aggregated
+ * such weights of all runnable and blocked sched_entities.
+ * 2) util_avg factors frequency scaling into the amount of time
+ * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
+ * For cfs_rq, it is the aggregated such times of all runnable and
+ * blocked sched_entities.
+ * The 64 bit load_sum can:
+ * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
+ * the highest weight (=88761) always runnable, we should not overflow
+ * 2) for entity, support any load.weight always runnable
+ */
struct sched_avg {
- u64 last_runnable_update;
- s64 decay_count;
- /*
- * utilization_avg_contrib describes the amount of time that a
- * sched_entity is running on a CPU. It is based on running_avg_sum
- * and is scaled in the range [0..SCHED_LOAD_SCALE].
- * load_avg_contrib described the amount of time that a sched_entity
- * is runnable on a rq. It is based on both runnable_avg_sum and the
- * weight of the task.
- */
- unsigned long load_avg_contrib, utilization_avg_contrib;
- /*
- * These sums represent an infinite geometric series and so are bound
- * above by 1024/(1-y). Thus we only need a u32 to store them for all
- * choices of y < 1-2^(-32)*1024.
- * running_avg_sum reflects the time that the sched_entity is
- * effectively running on the CPU.
- * runnable_avg_sum represents the amount of time a sched_entity is on
- * a runqueue which includes the running time that is monitored by
- * running_avg_sum.
- */
- u32 runnable_avg_sum, avg_period, running_avg_sum;
+ u64 last_update_time, load_sum;
+ u32 util_sum, period_contrib;
+ unsigned long load_avg, util_avg;
};

#ifdef CONFIG_SCHEDSTATS
@@ -1271,7 +1266,7 @@ struct sched_entity {
#endif

#ifdef CONFIG_SMP
- /* Per-entity load-tracking */
+ /* Per entity load average tracking */
struct sched_avg avg;
#endif
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d5078c0..4dfab27 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1828,9 +1828,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
-#ifdef CONFIG_SMP
- p->se.avg.decay_count = 0;
-#endif
INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ca39cb7..56d83f3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -88,12 +88,8 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
#endif
P(se->load.weight);
#ifdef CONFIG_SMP
- P(se->avg.runnable_avg_sum);
- P(se->avg.running_avg_sum);
- P(se->avg.avg_period);
- P(se->avg.load_avg_contrib);
- P(se->avg.utilization_avg_contrib);
- P(se->avg.decay_count);
+ P(se->avg.load_avg);
+ P(se->avg.util_avg);
#endif
#undef PN
#undef P
@@ -207,21 +203,19 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
#ifdef CONFIG_SMP
- SEQ_printf(m, " .%-30s: %ld\n", "runnable_load_avg",
- cfs_rq->runnable_load_avg);
- SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
- cfs_rq->blocked_load_avg);
- SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg",
- cfs_rq->utilization_load_avg);
+ SEQ_printf(m, " .%-30s: %lu\n", "load_avg",
+ cfs_rq->avg.load_avg);
+ SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
+ cfs_rq->avg.util_avg);
+ SEQ_printf(m, " .%-30s: %ld\n", "removed_load_avg",
+ atomic_long_read(&cfs_rq->removed_load_avg));
+ SEQ_printf(m, " .%-30s: %ld\n", "removed_util_avg",
+ atomic_long_read(&cfs_rq->removed_util_avg));
#ifdef CONFIG_FAIR_GROUP_SCHED
- SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib",
- cfs_rq->tg_load_contrib);
- SEQ_printf(m, " .%-30s: %d\n", "tg_runnable_contrib",
- cfs_rq->tg_runnable_contrib);
+ SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib",
+ cfs_rq->tg_load_avg_contrib);
SEQ_printf(m, " .%-30s: %ld\n", "tg_load_avg",
atomic_long_read(&cfs_rq->tg->load_avg));
- SEQ_printf(m, " .%-30s: %d\n", "tg->runnable_avg",
- atomic_read(&cfs_rq->tg->runnable_avg));
#endif
#endif
#ifdef CONFIG_CFS_BANDWIDTH
@@ -632,12 +626,11 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)

P(se.load.weight);
#ifdef CONFIG_SMP
- P(se.avg.runnable_avg_sum);
- P(se.avg.running_avg_sum);
- P(se.avg.avg_period);
- P(se.avg.load_avg_contrib);
- P(se.avg.utilization_avg_contrib);
- P(se.avg.decay_count);
+ P(se.avg.load_sum);
+ P(se.avg.util_sum);
+ P(se.avg.load_avg);
+ P(se.avg.util_avg);
+ P(se.avg.last_update_time);
#endif
P(policy);
P(prio);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7922532..452c932 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -283,9 +283,6 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
return grp->my_q;
}

-static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
- int force_update);
-
static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
{
if (!cfs_rq->on_list) {
@@ -305,8 +302,6 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
}

cfs_rq->on_list = 1;
- /* We should have no load, but we need to update last_decay. */
- update_cfs_rq_blocked_load(cfs_rq, 0);
}
}

@@ -669,19 +664,31 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
static int select_idle_sibling(struct task_struct *p, int cpu);
static unsigned long task_h_load(struct task_struct *p);

-static inline void __update_task_entity_contrib(struct sched_entity *se);
-static inline void __update_task_entity_utilization(struct sched_entity *se);
+/*
+ * We choose a half-life close to 1 scheduling period.
+ * Note: The tables below are dependent on this value.
+ */
+#define LOAD_AVG_PERIOD 32
+#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
+#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */

/* Give new task start runnable values to heavy its load in infant time */
void init_task_runnable_average(struct task_struct *p)
{
- u32 slice;
+ struct sched_avg *sa = &p->se.avg;

- slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
- p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
- p->se.avg.avg_period = slice;
- __update_task_entity_contrib(&p->se);
- __update_task_entity_utilization(&p->se);
+ sa->last_update_time = 0;
+ /*
+ * sched_avg's period_contrib should be strictly less then 1024, so
+ * we give it 1023 to make sure it is almost a period (1024us), and
+ * will definitely be update (after enqueue).
+ */
+ sa->period_contrib = 1023;
+ sa->load_avg = scale_load_down(p->se.load.weight);
+ sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
+ sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
+ sa->util_sum = LOAD_AVG_MAX;
+ /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
}
#else
void init_task_runnable_average(struct task_struct *p)
@@ -1702,8 +1709,8 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
delta = runtime - p->last_sum_exec_runtime;
*period = now - p->last_task_numa_placement;
} else {
- delta = p->se.avg.runnable_avg_sum;
- *period = p->se.avg.avg_period;
+ delta = p->se.avg.load_sum / p->se.load.weight;
+ *period = LOAD_AVG_MAX;
}

p->last_sum_exec_runtime = runtime;
@@ -2351,13 +2358,13 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
long tg_weight;

/*
- * Use this CPU's actual weight instead of the last load_contribution
- * to gain a more accurate current total weight. See
- * update_cfs_rq_load_contribution().
+ * Use this CPU's real-time load instead of the last load contribution
+ * as the updating of the contribution is delayed, and we will use the
+ * the real-time load to calc the share. See update_tg_load_avg().
*/
tg_weight = atomic_long_read(&tg->load_avg);
- tg_weight -= cfs_rq->tg_load_contrib;
- tg_weight += cfs_rq->load.weight;
+ tg_weight -= cfs_rq->tg_load_avg_contrib;
+ tg_weight += cfs_rq->avg.load_avg;

return tg_weight;
}
@@ -2367,7 +2374,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
long tg_weight, load, shares;

tg_weight = calc_tg_weight(tg, cfs_rq);
- load = cfs_rq->load.weight;
+ load = cfs_rq->avg.load_avg;

shares = (tg->shares * load);
if (tg_weight)
@@ -2429,14 +2436,6 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_SMP
-/*
- * We choose a half-life close to 1 scheduling period.
- * Note: The tables below are dependent on this value.
- */
-#define LOAD_AVG_PERIOD 32
-#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
-#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */
-
/* Precomputed fixed inverse multiplies for multiplication by y^n */
static const u32 runnable_avg_yN_inv[] = {
0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
@@ -2485,9 +2484,8 @@ static __always_inline u64 decay_load(u64 val, u64 n)
local_n %= LOAD_AVG_PERIOD;
}

- val *= runnable_avg_yN_inv[local_n];
- /* We don't use SRR here since we always want to round down. */
- return val >> 32;
+ val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+ return val;
}

/*
@@ -2546,23 +2544,23 @@ static u32 __compute_runnable_contrib(u64 n)
* load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
* = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
*/
-static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
- struct sched_avg *sa,
- int runnable,
- int running)
+static __always_inline int __update_load_avg(u64 now, int cpu,
+ struct sched_avg *sa,
+ unsigned long weight,
+ int running)
{
u64 delta, periods;
- u32 runnable_contrib;
+ u32 contrib;
int delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);

- delta = now - sa->last_runnable_update;
+ delta = now - sa->last_update_time;
/*
* This should only happen when time goes backwards, which it
* unfortunately does during sched clock init when we swap over to TSC.
*/
if ((s64)delta < 0) {
- sa->last_runnable_update = now;
+ sa->last_update_time = now;
return 0;
}

@@ -2573,26 +2571,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
delta >>= 10;
if (!delta)
return 0;
- sa->last_runnable_update = now;
+ sa->last_update_time = now;

/* delta_w is the amount already accumulated against our next period */
- delta_w = sa->avg_period % 1024;
+ delta_w = sa->period_contrib;
if (delta + delta_w >= 1024) {
- /* period roll-over */
decayed = 1;

+ /* how much left for next period will start over, we don't know yet */
+ sa->period_contrib = 0;
+
/*
* Now that we know we're crossing a period boundary, figure
* out how much from delta we need to complete the current
* period and accrue it.
*/
delta_w = 1024 - delta_w;
- if (runnable)
- sa->runnable_avg_sum += delta_w;
+ if (weight)
+ sa->load_sum += weight * delta_w;
if (running)
- sa->running_avg_sum += delta_w * scale_freq
- >> SCHED_CAPACITY_SHIFT;
- sa->avg_period += delta_w;
+ sa->util_sum += delta_w * scale_freq >> SCHED_CAPACITY_SHIFT;

delta -= delta_w;

@@ -2600,334 +2598,156 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
periods = delta / 1024;
delta %= 1024;

- sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
- periods + 1);
- sa->running_avg_sum = decay_load(sa->running_avg_sum,
- periods + 1);
- sa->avg_period = decay_load(sa->avg_period,
- periods + 1);
+ sa->load_sum = decay_load(sa->load_sum, periods + 1);
+ sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);

/* Efficiently calculate \sum (1..n_period) 1024*y^i */
- runnable_contrib = __compute_runnable_contrib(periods);
- if (runnable)
- sa->runnable_avg_sum += runnable_contrib;
+ contrib = __compute_runnable_contrib(periods);
+ if (weight)
+ sa->load_sum += weight * contrib;
if (running)
- sa->running_avg_sum += runnable_contrib * scale_freq
- >> SCHED_CAPACITY_SHIFT;
- sa->avg_period += runnable_contrib;
+ sa->util_sum += contrib * scale_freq >> SCHED_CAPACITY_SHIFT;
}

/* Remainder of delta accrued against u_0` */
- if (runnable)
- sa->runnable_avg_sum += delta;
+ if (weight)
+ sa->load_sum += weight * delta;
if (running)
- sa->running_avg_sum += delta * scale_freq
- >> SCHED_CAPACITY_SHIFT;
- sa->avg_period += delta;
-
- return decayed;
-}
-
-/* Synchronize an entity's decay with its parenting cfs_rq.*/
-static inline u64 __synchronize_entity_decay(struct sched_entity *se)
-{
- struct cfs_rq *cfs_rq = cfs_rq_of(se);
- u64 decays = atomic64_read(&cfs_rq->decay_counter);
+ sa->util_sum += delta * scale_freq >> SCHED_CAPACITY_SHIFT;

- decays -= se->avg.decay_count;
- se->avg.decay_count = 0;
- if (!decays)
- return 0;
+ sa->period_contrib += delta;

- se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
- se->avg.utilization_avg_contrib =
- decay_load(se->avg.utilization_avg_contrib, decays);
+ if (decayed) {
+ sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+ sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
+ }

- return decays;
+ return decayed;
}

#ifdef CONFIG_FAIR_GROUP_SCHED
-static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
- int force_update)
-{
- struct task_group *tg = cfs_rq->tg;
- long tg_contrib;
-
- tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
- tg_contrib -= cfs_rq->tg_load_contrib;
-
- if (!tg_contrib)
- return;
-
- if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
- atomic_long_add(tg_contrib, &tg->load_avg);
- cfs_rq->tg_load_contrib += tg_contrib;
- }
-}
-
/*
- * Aggregate cfs_rq runnable averages into an equivalent task_group
- * representation for computing load contributions.
+ * Updating tg's load_avg is necessary before update_cfs_share (which is done)
+ * and effective_load (which is not done because it is too costly).
*/
-static inline void __update_tg_runnable_avg(struct sched_avg *sa,
- struct cfs_rq *cfs_rq)
+static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
{
- struct task_group *tg = cfs_rq->tg;
- long contrib;
-
- /* The fraction of a cpu used by this cfs_rq */
- contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
- sa->avg_period + 1);
- contrib -= cfs_rq->tg_runnable_contrib;
+ long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;

- if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
- atomic_add(contrib, &tg->runnable_avg);
- cfs_rq->tg_runnable_contrib += contrib;
- }
-}
-
-static inline void __update_group_entity_contrib(struct sched_entity *se)
-{
- struct cfs_rq *cfs_rq = group_cfs_rq(se);
- struct task_group *tg = cfs_rq->tg;
- int runnable_avg;
-
- u64 contrib;
-
- contrib = cfs_rq->tg_load_contrib * tg->shares;
- se->avg.load_avg_contrib = div_u64(contrib,
- atomic_long_read(&tg->load_avg) + 1);
-
- /*
- * For group entities we need to compute a correction term in the case
- * that they are consuming <1 cpu so that we would contribute the same
- * load as a task of equal weight.
- *
- * Explicitly co-ordinating this measurement would be expensive, but
- * fortunately the sum of each cpus contribution forms a usable
- * lower-bound on the true value.
- *
- * Consider the aggregate of 2 contributions. Either they are disjoint
- * (and the sum represents true value) or they are disjoint and we are
- * understating by the aggregate of their overlap.
- *
- * Extending this to N cpus, for a given overlap, the maximum amount we
- * understand is then n_i(n_i+1)/2 * w_i where n_i is the number of
- * cpus that overlap for this interval and w_i is the interval width.
- *
- * On a small machine; the first term is well-bounded which bounds the
- * total error since w_i is a subset of the period. Whereas on a
- * larger machine, while this first term can be larger, if w_i is the
- * of consequential size guaranteed to see n_i*w_i quickly converge to
- * our upper bound of 1-cpu.
- */
- runnable_avg = atomic_read(&tg->runnable_avg);
- if (runnable_avg < NICE_0_LOAD) {
- se->avg.load_avg_contrib *= runnable_avg;
- se->avg.load_avg_contrib >>= NICE_0_SHIFT;
+ if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
+ atomic_long_add(delta, &cfs_rq->tg->load_avg);
+ cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
}
}

#else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
- int force_update) {}
-static inline void __update_tg_runnable_avg(struct sched_avg *sa,
- struct cfs_rq *cfs_rq) {}
-static inline void __update_group_entity_contrib(struct sched_entity *se) {}
+static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-static inline void __update_task_entity_contrib(struct sched_entity *se)
-{
- u32 contrib;
-
- /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
- contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
- contrib /= (se->avg.avg_period + 1);
- se->avg.load_avg_contrib = scale_load(contrib);
-}
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);

-/* Compute the current contribution to load_avg by se, return any delta */
-static long __update_entity_load_avg_contrib(struct sched_entity *se)
+/* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
+static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
{
- long old_contrib = se->avg.load_avg_contrib;
+ int decayed;
+ struct sched_avg *sa = &cfs_rq->avg;

- if (entity_is_task(se)) {
- __update_task_entity_contrib(se);
- } else {
- __update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
- __update_group_entity_contrib(se);
+ if (atomic_long_read(&cfs_rq->removed_load_avg)) {
+ long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
+ sa->load_avg = max_t(long, sa->load_avg - r, 0);
+ sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
}

- return se->avg.load_avg_contrib - old_contrib;
-}
-
-
-static inline void __update_task_entity_utilization(struct sched_entity *se)
-{
- u32 contrib;
-
- /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
- contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
- contrib /= (se->avg.avg_period + 1);
- se->avg.utilization_avg_contrib = scale_load(contrib);
-}
+ if (atomic_long_read(&cfs_rq->removed_util_avg)) {
+ long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
+ sa->util_avg = max_t(long, sa->util_avg - r, 0);
+ sa->util_sum = max_t(s32, sa->util_sum -
+ ((r * LOAD_AVG_MAX) >> SCHED_LOAD_SHIFT), 0);
+ }

-static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
-{
- long old_contrib = se->avg.utilization_avg_contrib;
+ decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+ scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL);

- if (entity_is_task(se))
- __update_task_entity_utilization(se);
- else
- se->avg.utilization_avg_contrib =
- group_cfs_rq(se)->utilization_load_avg;
-
- return se->avg.utilization_avg_contrib - old_contrib;
-}
+#ifndef CONFIG_64BIT
+ smp_wmb();
+ cfs_rq->load_last_update_time_copy = sa->last_update_time;
+#endif

-static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
- long load_contrib)
-{
- if (likely(load_contrib < cfs_rq->blocked_load_avg))
- cfs_rq->blocked_load_avg -= load_contrib;
- else
- cfs_rq->blocked_load_avg = 0;
+ return decayed;
}

-static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
-
-/* Update a sched_entity's runnable average */
-static inline void update_entity_load_avg(struct sched_entity *se,
- int update_cfs_rq)
+/* Update task and its cfs_rq load average */
+static inline void update_load_avg(struct sched_entity *se, int update_tg)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- long contrib_delta, utilization_delta;
int cpu = cpu_of(rq_of(cfs_rq));
- u64 now;
+ u64 now = cfs_rq_clock_task(cfs_rq);

/*
- * For a group entity we need to use their owned cfs_rq_clock_task() in
- * case they are the parent of a throttled hierarchy.
+ * Track task load average for carrying it to new CPU after migrated, and
+ * track group sched_entity load average for task_h_load calc in migration
*/
- if (entity_is_task(se))
- now = cfs_rq_clock_task(cfs_rq);
- else
- now = cfs_rq_clock_task(group_cfs_rq(se));
+ __update_load_avg(now, cpu, &se->avg,
+ se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);

- if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
- cfs_rq->curr == se))
- return;
-
- contrib_delta = __update_entity_load_avg_contrib(se);
- utilization_delta = __update_entity_utilization_avg_contrib(se);
-
- if (!update_cfs_rq)
- return;
-
- if (se->on_rq) {
- cfs_rq->runnable_load_avg += contrib_delta;
- cfs_rq->utilization_load_avg += utilization_delta;
- } else {
- subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
- }
+ if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
+ update_tg_load_avg(cfs_rq, 0);
}

-/*
- * Decay the load contributed by all blocked children and account this so that
- * their contribution may appropriately discounted when they wake up.
- */
-static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
+/* Add the load generated by se into cfs_rq's load average */
+static inline void
+enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- u64 now = cfs_rq_clock_task(cfs_rq) >> 20;
- u64 decays;
-
- decays = now - cfs_rq->last_decay;
- if (!decays && !force_update)
- return;
+ struct sched_avg *sa = &se->avg;
+ u64 now = cfs_rq_clock_task(cfs_rq);
+ int migrated = 0, decayed;

- if (atomic_long_read(&cfs_rq->removed_load)) {
- unsigned long removed_load;
- removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
- subtract_blocked_load_contrib(cfs_rq, removed_load);
+ if (sa->last_update_time == 0) {
+ sa->last_update_time = now;
+ migrated = 1;
}
-
- if (decays) {
- cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
- decays);
- atomic64_add(decays, &cfs_rq->decay_counter);
- cfs_rq->last_decay = now;
+ else {
+ __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+ se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
}

- __update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
-}
+ decayed = update_cfs_rq_load_avg(now, cfs_rq);

-/* Add the load generated by se into cfs_rq's child load-average */
-static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se,
- int wakeup)
-{
- /*
- * We track migrations using entity decay_count <= 0, on a wake-up
- * migration we use a negative decay count to track the remote decays
- * accumulated while sleeping.
- *
- * Newly forked tasks are enqueued with se->avg.decay_count == 0, they
- * are seen by enqueue_entity_load_avg() as a migration with an already
- * constructed load_avg_contrib.
- */
- if (unlikely(se->avg.decay_count <= 0)) {
- se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
- if (se->avg.decay_count) {
- /*
- * In a wake-up migration we have to approximate the
- * time sleeping. This is because we can't synchronize
- * clock_task between the two cpus, and it is not
- * guaranteed to be read-safe. Instead, we can
- * approximate this using our carried decays, which are
- * explicitly atomically readable.
- */
- se->avg.last_runnable_update -= (-se->avg.decay_count)
- << 20;
- update_entity_load_avg(se, 0);
- /* Indicate that we're now synchronized and on-rq */
- se->avg.decay_count = 0;
- }
- wakeup = 0;
- } else {
- __synchronize_entity_decay(se);
+ if (migrated) {
+ cfs_rq->avg.load_avg += sa->load_avg;
+ cfs_rq->avg.load_sum += sa->load_sum;
+ cfs_rq->avg.util_avg += sa->util_avg;
+ cfs_rq->avg.util_sum += sa->util_sum;
}

- /* migrated tasks did not contribute to our blocked load */
- if (wakeup) {
- subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
- update_entity_load_avg(se, 0);
- }
-
- cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
- cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
- /* we force update consideration on load-balancer moves */
- update_cfs_rq_blocked_load(cfs_rq, !wakeup);
+ if (decayed || migrated)
+ update_tg_load_avg(cfs_rq, 0);
}

/*
- * Remove se's load from this cfs_rq child load-average, if the entity is
- * transitioning to a blocked state we track its projected decay using
- * blocked_load_avg.
+ * Task first catches up with cfs_rq, and then subtract
+ * itself from the cfs_rq (task must be off the queue now).
*/
-static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se,
- int sleep)
+void remove_entity_load_avg(struct sched_entity *se)
{
- update_entity_load_avg(se, 1);
- /* we force update consideration on load-balancer moves */
- update_cfs_rq_blocked_load(cfs_rq, !sleep);
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ u64 last_update_time;
+
+#ifndef CONFIG_64BIT
+ u64 last_update_time_copy;

- cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
- cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
- if (sleep) {
- cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
- se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
- } /* migrations, e.g. sleep=0 leave decay_count == 0 */
+ do {
+ last_update_time_copy = cfs_rq->load_last_update_time_copy;
+ smp_rmb();
+ last_update_time = cfs_rq->avg.last_update_time;
+ } while (last_update_time != last_update_time_copy);
+#else
+ last_update_time = cfs_rq->avg.last_update_time;
+#endif
+
+ __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0);
+ atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
+ atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
}

/*
@@ -2952,16 +2772,10 @@ static int idle_balance(struct rq *this_rq);

#else /* CONFIG_SMP */

-static inline void update_entity_load_avg(struct sched_entity *se,
- int update_cfs_rq) {}
-static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se,
- int wakeup) {}
-static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se,
- int sleep) {}
-static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
- int force_update) {}
+static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
+static inline void
+enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static inline void remove_entity_load_avg(struct sched_entity *se) {}

static inline int idle_balance(struct rq *rq)
{
@@ -3093,7 +2907,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
+ enqueue_entity_load_avg(cfs_rq, se);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);

@@ -3168,7 +2982,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
+ update_load_avg(se, 1);

update_stats_dequeue(cfs_rq, se);
if (flags & DEQUEUE_SLEEP) {
@@ -3258,7 +3072,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
- update_entity_load_avg(se, 1);
+ update_load_avg(se, 1);
}

update_stats_curr_start(cfs_rq, se);
@@ -3358,7 +3172,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
- update_entity_load_avg(prev, 1);
+ update_load_avg(prev, 0);
}
cfs_rq->curr = NULL;
}
@@ -3374,8 +3188,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
/*
* Ensure that runnable average is periodically updated.
*/
- update_entity_load_avg(curr, 1);
- update_cfs_rq_blocked_load(cfs_rq, 1);
+ update_load_avg(curr, 1);
update_cfs_shares(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK
@@ -4248,8 +4061,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;

+ update_load_avg(se, 1);
update_cfs_shares(cfs_rq);
- update_entity_load_avg(se, 1);
}

if (!se)
@@ -4308,8 +4121,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;

+ update_load_avg(se, 1);
update_cfs_shares(cfs_rq);
- update_entity_load_avg(se, 1);
}

if (!se)
@@ -4448,7 +4261,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
static void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = READ_ONCE(jiffies);
- unsigned long load = this_rq->cfs.runnable_load_avg;
+ unsigned long load = this_rq->cfs.avg.load_avg;
unsigned long pending_updates;

/*
@@ -4494,7 +4307,7 @@ void update_cpu_load_nohz(void)
*/
void update_cpu_load_active(struct rq *this_rq)
{
- unsigned long load = this_rq->cfs.runnable_load_avg;
+ unsigned long load = this_rq->cfs.avg.load_avg;
/*
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
@@ -4505,7 +4318,7 @@ void update_cpu_load_active(struct rq *this_rq)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->cfs.runnable_load_avg;
+ return cpu_rq(cpu)->cfs.avg.load_avg;
}

/*
@@ -4555,7 +4368,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
- unsigned long load_avg = rq->cfs.runnable_load_avg;
+ unsigned long load_avg = rq->cfs.avg.load_avg;

if (nr_running)
return load_avg / nr_running;
@@ -4674,7 +4487,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* w = rw_i + @wl
*/
- w = se->my_q->load.weight + wl;
+ w = se->my_q->avg.load_avg + wl;

/*
* wl = S * s'_i; see (2)
@@ -4695,7 +4508,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* wl = dw_i = S * (s'_i - s_i); see (3)
*/
- wl -= se->load.weight;
+ wl -= se->avg.load_avg;

/*
* Recursively apply this logic to all parent groups to compute
@@ -4769,14 +4582,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
*/
if (sync) {
tg = task_group(current);
- weight = current->se.load.weight;
+ weight = current->se.avg.load_avg;

this_load += effective_load(tg, this_cpu, -weight, -weight);
load += effective_load(tg, prev_cpu, 0, -weight);
}

tg = task_group(p);
- weight = p->se.load.weight;
+ weight = p->se.avg.load_avg;

/*
* In low-load situations, where prev_cpu is idle and this_cpu is idle
@@ -4969,12 +4782,12 @@ done:
* tasks. The unit of the return value must be the one of capacity so we can
* compare the usage with the capacity of the CPU that is available for CFS
* task (ie cpu_capacity).
- * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * cfs.avg.util_avg is the sum of running time of runnable tasks on a
* CPU. It represents the amount of utilization of a CPU in the range
* [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
* capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
- * because of unfortunate rounding in avg_period and running_load_avg or just
+ * Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE
+ * because of unfortunate rounding in util_avg or just
* after migrating tasks until the average stabilizes with the new running
* time. So we need to check that the usage stays into the range
* [0..cpu_capacity_orig] and cap if necessary.
@@ -4983,7 +4796,7 @@ done:
*/
static int get_cpu_usage(int cpu)
{
- unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long usage = cpu_rq(cpu)->cfs.avg.util_avg;
unsigned long capacity = capacity_orig_of(cpu);

if (usage >= SCHED_LOAD_SCALE)
@@ -5089,26 +4902,22 @@ unlock:
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
* other assumptions, including the state of rq->lock, should be made.
*/
-static void
-migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+static void migrate_task_rq_fair(struct task_struct *p, int next_cpu)
{
- struct sched_entity *se = &p->se;
- struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
/*
- * Load tracking: accumulate removed load so that it can be processed
- * when we next update owning cfs_rq under rq->lock. Tasks contribute
- * to blocked load iff they have a positive decay-count. It can never
- * be negative here since on-rq tasks have decay-count == 0.
+ * We are supposed to update the task to "current" time, then its up to date
+ * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
+ * what current time is, so simply throw away the out-of-date time. This
+ * will result in the wakee task is less decayed, but giving the wakee more
+ * load sounds not bad.
*/
- if (se->avg.decay_count) {
- se->avg.decay_count = -__synchronize_entity_decay(se);
- atomic_long_add(se->avg.load_avg_contrib,
- &cfs_rq->removed_load);
- }
+ remove_entity_load_avg(&p->se);
+
+ /* Tell new CPU we are migrated */
+ p->se.avg.last_update_time = 0;

/* We have migrated, no longer consider this task hot */
- se->exec_start = 0;
+ p->se.exec_start = 0;
}
#endif /* CONFIG_SMP */

@@ -5995,36 +5804,6 @@ static void attach_tasks(struct lb_env *env)
}

#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
- * update tg->load_weight by folding this cpu's load_avg
- */
-static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
-{
- struct sched_entity *se = tg->se[cpu];
- struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
-
- /* throttled entities do not contribute to load */
- if (throttled_hierarchy(cfs_rq))
- return;
-
- update_cfs_rq_blocked_load(cfs_rq, 1);
-
- if (se) {
- update_entity_load_avg(se, 1);
- /*
- * We pivot on our runnable average having decayed to zero for
- * list removal. This generally implies that all our children
- * have also been removed (modulo rounding error or bandwidth
- * control); however, such cases are rare and we can fix these
- * at enqueue.
- *
- * TODO: fix up out-of-order children on enqueue.
- */
- if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
- list_del_leaf_cfs_rq(cfs_rq);
- }
-}
-
static void update_blocked_averages(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -6033,17 +5812,18 @@ static void update_blocked_averages(int cpu)

raw_spin_lock_irqsave(&rq->lock, flags);
update_rq_clock(rq);
+
/*
* Iterates the task_group tree in a bottom up fashion, see
* list_add_leaf_cfs_rq() for details.
*/
for_each_leaf_cfs_rq(rq, cfs_rq) {
- /*
- * Note: We may want to consider periodically releasing
- * rq->lock about these updates so that creating many task
- * groups does not result in continually extending hold time.
- */
- __update_blocked_averages_cpu(cfs_rq->tg, rq->cpu);
+ /* throttled entities do not contribute to load */
+ if (throttled_hierarchy(cfs_rq))
+ continue;
+
+ if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq))
+ update_tg_load_avg(cfs_rq, 0);
}

raw_spin_unlock_irqrestore(&rq->lock, flags);
@@ -6073,14 +5853,13 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
}

if (!se) {
- cfs_rq->h_load = cfs_rq->runnable_load_avg;
+ cfs_rq->h_load = cfs_rq->avg.load_avg;
cfs_rq->last_h_load_update = now;
}

while ((se = cfs_rq->h_load_next) != NULL) {
load = cfs_rq->h_load;
- load = div64_ul(load * se->avg.load_avg_contrib,
- cfs_rq->runnable_load_avg + 1);
+ load = div64_ul(load * se->avg.load_avg, cfs_rq->avg.load_avg + 1);
cfs_rq = group_cfs_rq(se);
cfs_rq->h_load = load;
cfs_rq->last_h_load_update = now;
@@ -6092,8 +5871,8 @@ static unsigned long task_h_load(struct task_struct *p)
struct cfs_rq *cfs_rq = task_cfs_rq(p);

update_cfs_rq_h_load(cfs_rq);
- return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
- cfs_rq->runnable_load_avg + 1);
+ return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
+ cfs_rq->avg.load_avg + 1);
}
#else
static inline void update_blocked_averages(int cpu)
@@ -6102,7 +5881,7 @@ static inline void update_blocked_averages(int cpu)

static unsigned long task_h_load(struct task_struct *p)
{
- return p->se.avg.load_avg_contrib;
+ return p->se.avg.load_avg;
}
#endif

@@ -8103,15 +7882,18 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
}

#ifdef CONFIG_SMP
- /*
- * Remove our load from contribution when we leave sched_fair
- * and ensure we don't carry in an old decay_count if we
- * switch back.
- */
- if (se->avg.decay_count) {
- __synchronize_entity_decay(se);
- subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
- }
+ /* Catch up with the cfs_rq and remove our load when we leave */
+ __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq), &se->avg,
+ se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
+
+ cfs_rq->avg.load_avg =
+ max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
+ cfs_rq->avg.load_sum =
+ max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0);
+ cfs_rq->avg.util_avg =
+ max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
+ cfs_rq->avg.util_sum =
+ max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0);
#endif
}

@@ -8168,8 +7950,8 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
#ifdef CONFIG_SMP
- atomic64_set(&cfs_rq->decay_counter, 1);
- atomic_long_set(&cfs_rq->removed_load, 0);
+ atomic_long_set(&cfs_rq->removed_load_avg, 0);
+ atomic_long_set(&cfs_rq->removed_util_avg, 0);
#endif
}

@@ -8214,14 +7996,14 @@ static void task_move_group_fair(struct task_struct *p, int queued)
if (!queued) {
cfs_rq = cfs_rq_of(se);
se->vruntime += cfs_rq->min_vruntime;
+
#ifdef CONFIG_SMP
- /*
- * migrate_task_rq_fair() will have removed our previous
- * contribution, but we must synchronize for ongoing future
- * decay.
- */
- se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
- cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ /* Virtually synchronize task with its new cfs_rq */
+ p->se.avg.last_update_time = cfs_rq->avg.last_update_time;
+ cfs_rq->avg.load_avg += p->se.avg.load_avg;
+ cfs_rq->avg.load_sum += p->se.avg.load_sum;
+ cfs_rq->avg.util_avg += p->se.avg.util_avg;
+ cfs_rq->avg.util_sum += p->se.avg.util_sum;
#endif
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d465a5c..3dfec8d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -245,7 +245,6 @@ struct task_group {

#ifdef CONFIG_SMP
atomic_long_t load_avg;
- atomic_t runnable_avg;
#endif
#endif

@@ -366,27 +365,18 @@ struct cfs_rq {

#ifdef CONFIG_SMP
/*
- * CFS Load tracking
- * Under CFS, load is tracked on a per-entity basis and aggregated up.
- * This allows for the description of both thread and group usage (in
- * the FAIR_GROUP_SCHED case).
- * runnable_load_avg is the sum of the load_avg_contrib of the
- * sched_entities on the rq.
- * blocked_load_avg is similar to runnable_load_avg except that its
- * the blocked sched_entities on the rq.
- * utilization_load_avg is the sum of the average running time of the
- * sched_entities on the rq.
+ * CFS load tracking
*/
- unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
- atomic64_t decay_counter;
- u64 last_decay;
- atomic_long_t removed_load;
-
+ struct sched_avg avg;
#ifdef CONFIG_FAIR_GROUP_SCHED
- /* Required to track per-cpu representation of a task_group */
- u32 tg_runnable_contrib;
- unsigned long tg_load_contrib;
+ unsigned long tg_load_avg_contrib;
+#endif
+ atomic_long_t removed_load_avg, removed_util_avg;
+#ifndef CONFIG_64BIT
+ u64 load_last_update_time_copy;
+#endif

+#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* h_load = weight * f(tg)
*
--
2.1.4

2015-07-15 07:57:28

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v10 3/7] sched: Implement update_blocked_averages() for CONFIG_FAIR_GROUP_SCHED=n

The load and the util of idle cpus must be updated periodically in
order to decay the blocked part.

If CONFIG_FAIR_GROUP_SCHED is not set, the load and util of idle cpus
are not decayed and stay at the values set before becoming idle.

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 452c932..3e9bccf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5877,6 +5877,16 @@ static unsigned long task_h_load(struct task_struct *p)
#else
static inline void update_blocked_averages(int cpu)
{
+ struct rq *rq = cpu_rq(cpu);
+ struct cfs_rq *cfs_rq = &rq->cfs;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ update_rq_clock(rq);
+
+ update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
}

static unsigned long task_h_load(struct task_struct *p)
--
2.1.4

2015-07-15 07:56:29

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v10 4/7] sched: Init cfs_rq's sched_entity load average

The runnable load and utilization averages of cfs_rq's sched_entity
were not initiated. Like done to a task, give new cfs_rq' sched_entity
start values to heavy its load in infant time.

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 11 ++++++-----
kernel/sched/sched.h | 2 +-
3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4dfab27..2d4c597 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2112,7 +2112,7 @@ void wake_up_new_task(struct task_struct *p)
#endif

/* Initialize new task's runnable average */
- init_task_runnable_average(p);
+ init_entity_runnable_average(&p->se);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9bccf..edc404c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -672,10 +672,10 @@ static unsigned long task_h_load(struct task_struct *p);
#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */

-/* Give new task start runnable values to heavy its load in infant time */
-void init_task_runnable_average(struct task_struct *p)
+/* Give new sched_entity start runnable values to heavy its load in infant time */
+void init_entity_runnable_average(struct sched_entity *se)
{
- struct sched_avg *sa = &p->se.avg;
+ struct sched_avg *sa = &se->avg;

sa->last_update_time = 0;
/*
@@ -684,14 +684,14 @@ void init_task_runnable_average(struct task_struct *p)
* will definitely be update (after enqueue).
*/
sa->period_contrib = 1023;
- sa->load_avg = scale_load_down(p->se.load.weight);
+ sa->load_avg = scale_load_down(se->load.weight);
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
sa->util_sum = LOAD_AVG_MAX;
/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
}
#else
-void init_task_runnable_average(struct task_struct *p)
+void init_entity_runnable_average(struct sched_entity *se)
{
}
#endif
@@ -8065,6 +8065,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)

init_cfs_rq(cfs_rq);
init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
+ init_entity_runnable_average(se);
}

return 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3dfec8d..f2b17ea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1293,7 +1293,7 @@ extern void init_dl_task_timer(struct sched_dl_entity *dl_se);

unsigned long to_ratio(u64 period, u64 runtime);

-extern void init_task_runnable_average(struct task_struct *p);
+extern void init_entity_runnable_average(struct sched_entity *se);

static inline void add_nr_running(struct rq *rq, unsigned count)
{
--
2.1.4

2015-07-15 07:56:27

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v10 5/7] sched: Remove task and group entity load when they are dead

When task exits or group is destroyed, the entity's load should be
removed from its parent cfs_rq's load. Otherwise, it will take time
for the parent cfs_rq to decay the dead entity's load to 0, which
is not desired.

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index edc404c..19f1199 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4919,6 +4919,11 @@ static void migrate_task_rq_fair(struct task_struct *p, int next_cpu)
/* We have migrated, no longer consider this task hot */
p->se.exec_start = 0;
}
+
+static void task_dead_fair(struct task_struct *p)
+{
+ remove_entity_load_avg(&p->se);
+}
#endif /* CONFIG_SMP */

static unsigned long
@@ -8027,8 +8032,11 @@ void free_fair_sched_group(struct task_group *tg)
for_each_possible_cpu(i) {
if (tg->cfs_rq)
kfree(tg->cfs_rq[i]);
- if (tg->se)
+ if (tg->se) {
+ if (tg->se[i])
+ remove_entity_load_avg(tg->se[i]);
kfree(tg->se[i]);
+ }
}

kfree(tg->cfs_rq);
@@ -8215,6 +8223,7 @@ const struct sched_class fair_sched_class = {
.rq_offline = rq_offline_fair,

.task_waking = task_waking_fair,
+ .task_dead = task_dead_fair,
#endif

.set_curr_task = set_curr_task_fair,
--
2.1.4

2015-07-15 07:56:48

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
Before this series, sometimes the runnable_load_avg is used, and sometimes
the load_avg is used. Completely replacing all uses of runnable_load_avg
with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
to result in overrated load. Therefore, we get runnable_load_avg back.

The new cfs_rq's runnable_load_avg is improved to be updated with all of the
runnable sched_eneities at the same time, so the one sched_entity updated and
the others stale problem is solved.

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/debug.c | 2 ++
kernel/sched/fair.c | 57 +++++++++++++++++++++++++++++++++++++++++-----------
kernel/sched/sched.h | 2 ++
3 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 56d83f3..bfb7bb2 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -205,6 +205,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "load_avg",
cfs_rq->avg.load_avg);
+ SEQ_printf(m, " .%-30s: %lu\n", "runnable_load_avg",
+ cfs_rq->runnable_load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed_load_avg",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19f1199..a1bffe1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2545,9 +2545,9 @@ static u32 __compute_runnable_contrib(u64 n)
* = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
*/
static __always_inline int __update_load_avg(u64 now, int cpu,
- struct sched_avg *sa,
- unsigned long weight,
- int running)
+ struct sched_avg *sa,
+ unsigned long weight,
+ int running, struct cfs_rq *cfs_rq)
{
u64 delta, periods;
u32 contrib;
@@ -2587,8 +2587,11 @@ static __always_inline int __update_load_avg(u64 now, int cpu,
* period and accrue it.
*/
delta_w = 1024 - delta_w;
- if (weight)
+ if (weight) {
sa->load_sum += weight * delta_w;
+ if (cfs_rq)
+ cfs_rq->runnable_load_sum += weight * delta_w;
+ }
if (running)
sa->util_sum += delta_w * scale_freq >> SCHED_CAPACITY_SHIFT;

@@ -2599,19 +2602,28 @@ static __always_inline int __update_load_avg(u64 now, int cpu,
delta %= 1024;

sa->load_sum = decay_load(sa->load_sum, periods + 1);
+ if (cfs_rq)
+ cfs_rq->runnable_load_sum =
+ decay_load(cfs_rq->runnable_load_sum, periods + 1);
sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);

/* Efficiently calculate \sum (1..n_period) 1024*y^i */
contrib = __compute_runnable_contrib(periods);
- if (weight)
+ if (weight) {
sa->load_sum += weight * contrib;
+ if (cfs_rq)
+ cfs_rq->runnable_load_sum += weight * contrib;
+ }
if (running)
sa->util_sum += contrib * scale_freq >> SCHED_CAPACITY_SHIFT;
}

/* Remainder of delta accrued against u_0` */
- if (weight)
+ if (weight) {
sa->load_sum += weight * delta;
+ if (cfs_rq)
+ cfs_rq->runnable_load_sum += weight * delta;
+ }
if (running)
sa->util_sum += delta * scale_freq >> SCHED_CAPACITY_SHIFT;

@@ -2619,6 +2631,9 @@ static __always_inline int __update_load_avg(u64 now, int cpu,

if (decayed) {
sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+ if (cfs_rq)
+ cfs_rq->runnable_load_avg =
+ div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
}

@@ -2666,7 +2681,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
}

decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
- scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL);
+ scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);

#ifndef CONFIG_64BIT
smp_wmb();
@@ -2688,7 +2703,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
* track group sched_entity load average for task_h_load calc in migration
*/
__update_load_avg(now, cpu, &se->avg,
- se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
+ se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se, NULL);

if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
update_tg_load_avg(cfs_rq, 0);
@@ -2708,11 +2723,15 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
}
else {
__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
- se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
+ se->on_rq * scale_load_down(se->load.weight),
+ cfs_rq->curr == se, NULL);
}

decayed = update_cfs_rq_load_avg(now, cfs_rq);

+ cfs_rq->runnable_load_avg += sa->load_avg;
+ cfs_rq->runnable_load_sum += sa->load_sum;
+
if (migrated) {
cfs_rq->avg.load_avg += sa->load_avg;
cfs_rq->avg.load_sum += sa->load_sum;
@@ -2724,6 +2743,18 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
update_tg_load_avg(cfs_rq, 0);
}

+/* Remove the runnable load generated by se from cfs_rq's runnable load average */
+static inline void
+dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ update_load_avg(se, 1);
+
+ cfs_rq->runnable_load_avg =
+ max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
+ cfs_rq->runnable_load_sum =
+ max_t(s64, cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
+}
+
/*
* Task first catches up with cfs_rq, and then subtract
* itself from the cfs_rq (task must be off the queue now).
@@ -2745,7 +2776,7 @@ void remove_entity_load_avg(struct sched_entity *se)
last_update_time = cfs_rq->avg.last_update_time;
#endif

- __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0);
+ __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
}
@@ -2775,6 +2806,8 @@ static int idle_balance(struct rq *this_rq);
static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
static inline void
enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static inline void
+dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
static inline void remove_entity_load_avg(struct sched_entity *se) {}

static inline int idle_balance(struct rq *rq)
@@ -2982,7 +3015,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- update_load_avg(se, 1);
+ dequeue_entity_load_avg(cfs_rq, se);

update_stats_dequeue(cfs_rq, se);
if (flags & DEQUEUE_SLEEP) {
@@ -7899,7 +7932,7 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
#ifdef CONFIG_SMP
/* Catch up with the cfs_rq and remove our load when we leave */
__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq), &se->avg,
- se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
+ se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se, NULL);

cfs_rq->avg.load_avg =
max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f2b17ea..118f5ae 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -368,6 +368,8 @@ struct cfs_rq {
* CFS load tracking
*/
struct sched_avg avg;
+ u64 runnable_load_sum;
+ unsigned long runnable_load_avg;
#ifdef CONFIG_FAIR_GROUP_SCHED
unsigned long tg_load_avg_contrib;
#endif
--
2.1.4

2015-07-15 07:56:34

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v10 7/7] sched: Clean up load average references

For cfs_rq, we have load.weight, runnable_load_avg, and load_avg. We
now start to clean up how they are used.

First, as group sched_entity already largely uses load_avg, we now expand
to use load_avg in all cases. Second, for CPU-wide load balancing, we
choose to use runnable_load_avg in all cases, which is the same as before
this series.

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++---------------
1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1bffe1..cd07138 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -690,6 +690,9 @@ void init_entity_runnable_average(struct sched_entity *se)
sa->util_sum = LOAD_AVG_MAX;
/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
}
+
+static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
+static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq);
#else
void init_entity_runnable_average(struct sched_entity *se)
{
@@ -2364,7 +2367,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
*/
tg_weight = atomic_long_read(&tg->load_avg);
tg_weight -= cfs_rq->tg_load_avg_contrib;
- tg_weight += cfs_rq->avg.load_avg;
+ tg_weight += cfs_rq_load_avg(cfs_rq);

return tg_weight;
}
@@ -2374,7 +2377,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
long tg_weight, load, shares;

tg_weight = calc_tg_weight(tg, cfs_rq);
- load = cfs_rq->avg.load_avg;
+ load = cfs_rq_load_avg(cfs_rq);

shares = (tg->shares * load);
if (tg_weight)
@@ -2799,6 +2802,16 @@ void idle_exit_fair(struct rq *this_rq)
{
}

+static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->runnable_load_avg;
+}
+
+static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->avg.load_avg;
+}
+
static int idle_balance(struct rq *this_rq);

#else /* CONFIG_SMP */
@@ -4273,6 +4286,12 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
sched_avg_update(this_rq);
}

+/* Used instead of source_load when we know the type == 0 */
+static unsigned long weighted_cpuload(const int cpu)
+{
+ return cfs_rq_runnable_load_avg(&cpu_rq(cpu)->cfs);
+}
+
#ifdef CONFIG_NO_HZ_COMMON
/*
* There is no sane way to deal with nohz on smp when using jiffies because the
@@ -4294,7 +4313,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
static void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = READ_ONCE(jiffies);
- unsigned long load = this_rq->cfs.avg.load_avg;
+ unsigned long load = weighted_cpuload(cpu_of(this_rq));
unsigned long pending_updates;

/*
@@ -4340,7 +4359,7 @@ void update_cpu_load_nohz(void)
*/
void update_cpu_load_active(struct rq *this_rq)
{
- unsigned long load = this_rq->cfs.avg.load_avg;
+ unsigned long load = weighted_cpuload(cpu_of(this_rq));
/*
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
@@ -4348,12 +4367,6 @@ void update_cpu_load_active(struct rq *this_rq)
__update_cpu_load(this_rq, load, 1);
}

-/* Used instead of source_load when we know the type == 0 */
-static unsigned long weighted_cpuload(const int cpu)
-{
- return cpu_rq(cpu)->cfs.avg.load_avg;
-}
-
/*
* Return a low guess at the load of a migration-source cpu weighted
* according to the scheduling class and "nice" value.
@@ -4401,7 +4414,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
- unsigned long load_avg = rq->cfs.avg.load_avg;
+ unsigned long load_avg = weighted_cpuload(cpu);

if (nr_running)
return load_avg / nr_running;
@@ -4520,7 +4533,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* w = rw_i + @wl
*/
- w = se->my_q->avg.load_avg + wl;
+ w = cfs_rq_load_avg(se->my_q) + wl;

/*
* wl = S * s'_i; see (2)
@@ -5891,13 +5904,14 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
}

if (!se) {
- cfs_rq->h_load = cfs_rq->avg.load_avg;
+ cfs_rq->h_load = cfs_rq_load_avg(cfs_rq);
cfs_rq->last_h_load_update = now;
}

while ((se = cfs_rq->h_load_next) != NULL) {
load = cfs_rq->h_load;
- load = div64_ul(load * se->avg.load_avg, cfs_rq->avg.load_avg + 1);
+ load = div64_ul(load * se->avg.load_avg,
+ cfs_rq_load_avg(cfs_rq) + 1);
cfs_rq = group_cfs_rq(se);
cfs_rq->h_load = load;
cfs_rq->last_h_load_update = now;
@@ -5910,7 +5924,7 @@ static unsigned long task_h_load(struct task_struct *p)

update_cfs_rq_h_load(cfs_rq);
return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
- cfs_rq->avg.load_avg + 1);
+ cfs_rq_load_avg(cfs_rq) + 1);
}
#else
static inline void update_blocked_averages(int cpu)
--
2.1.4

2015-07-21 08:35:19

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

On Tue, Jul 21, 2015 at 09:08:07AM +0800, Boqun Feng wrote:
> Hi Yuyang,
>
> On Wed, Jul 15, 2015 at 08:04:41AM +0800, Yuyang Du wrote:
> > The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
> > Before this series, sometimes the runnable_load_avg is used, and sometimes
> > the load_avg is used. Completely replacing all uses of runnable_load_avg
> > with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
> > to result in overrated load. Therefore, we get runnable_load_avg back.
> >
> > The new cfs_rq's runnable_load_avg is improved to be updated with all of the
> > runnable sched_eneities at the same time, so the one sched_entity updated and
> > the others stale problem is solved.
> >
>
> How about tracking cfs_rq's blocked_load_avg instead of
> runnable_load_avg, because, AFAICS:
>
> cfs_rq->runnable_load_avg = se->avg.load_avg - cfs_rq->blocked_load_avg.

No, cfs_rq->runnable_load_avg = cfs_rq->avg.load_avg - cfs_rq->blocked_load_avg,
without rounding errors and the like.

> se is the corresponding sched_entity of cfs_rq. And when we need the
> runnable_load_avg, we just calculate by the expression above.
>
> This can be thought as a lazy way to update runnable_load_avg, and we
> don't need to modify __update_load_avg any more.

Not lazy at all, but adding (as of now) useless blocked_load_avg and an
extra subtraction.

Or did you forget blocked_load_avg also needs to be updated/decayed as
time elapses?

2015-07-21 01:08:22

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

Hi Yuyang,

On Wed, Jul 15, 2015 at 08:04:41AM +0800, Yuyang Du wrote:
> The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
> Before this series, sometimes the runnable_load_avg is used, and sometimes
> the load_avg is used. Completely replacing all uses of runnable_load_avg
> with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
> to result in overrated load. Therefore, we get runnable_load_avg back.
>
> The new cfs_rq's runnable_load_avg is improved to be updated with all of the
> runnable sched_eneities at the same time, so the one sched_entity updated and
> the others stale problem is solved.
>

How about tracking cfs_rq's blocked_load_avg instead of
runnable_load_avg, because, AFAICS:

cfs_rq->runnable_load_avg = se->avg.load_avg - cfs_rq->blocked_load_avg.

se is the corresponding sched_entity of cfs_rq. And when we need the
runnable_load_avg, we just calculate by the expression above.

This can be thought as a lazy way to update runnable_load_avg, and we
don't need to modify __update_load_avg any more.

Regards,
Boqun

2015-07-21 10:18:54

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

On Tue, Jul 21, 2015 at 08:44:01AM +0800, Yuyang Du wrote:
> On Tue, Jul 21, 2015 at 09:08:07AM +0800, Boqun Feng wrote:
> > Hi Yuyang,
> >
> > On Wed, Jul 15, 2015 at 08:04:41AM +0800, Yuyang Du wrote:
> > > The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
> > > Before this series, sometimes the runnable_load_avg is used, and sometimes
> > > the load_avg is used. Completely replacing all uses of runnable_load_avg
> > > with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
> > > to result in overrated load. Therefore, we get runnable_load_avg back.
> > >
> > > The new cfs_rq's runnable_load_avg is improved to be updated with all of the
> > > runnable sched_eneities at the same time, so the one sched_entity updated and
> > > the others stale problem is solved.
> > >
> >
> > How about tracking cfs_rq's blocked_load_avg instead of
> > runnable_load_avg, because, AFAICS:
> >
> > cfs_rq->runnable_load_avg = se->avg.load_avg - cfs_rq->blocked_load_avg.
>
> No, cfs_rq->runnable_load_avg = cfs_rq->avg.load_avg - cfs_rq->blocked_load_avg,
> without rounding errors and the like.
>

Oh, sorry.. yeah, you're right here.

> > se is the corresponding sched_entity of cfs_rq. And when we need the
> > runnable_load_avg, we just calculate by the expression above.
> >
> > This can be thought as a lazy way to update runnable_load_avg, and we
> > don't need to modify __update_load_avg any more.
>
> Not lazy at all, but adding (as of now) useless blocked_load_avg and an
> extra subtraction.

but we can remove runnable_load_avg tracking code in __update_load_avg,
as you do in this patch, right?

> Or did you forget blocked_load_avg also needs to be updated/decayed as
> time elapses?

I know we need to update or decay the blocked_load_avg, but we only need
to update and decay when 1) entity dequeued/enqueued 2) entity migrated
or 3) we need the runnable_load_avg value calcuated by blocked_load_avg,
right?

These are more rare than __update_load_avg called, right?

Regards,
Boqun


Attachments:
(No filename) (2.00 kB)
signature.asc (473.00 B)
Download all attachments

2015-07-21 10:30:02

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

On Tue, Jul 21, 2015 at 06:18:46PM +0800, Boqun Feng wrote:
> On Tue, Jul 21, 2015 at 08:44:01AM +0800, Yuyang Du wrote:
> > On Tue, Jul 21, 2015 at 09:08:07AM +0800, Boqun Feng wrote:
> > > Hi Yuyang,
> > >
> > > On Wed, Jul 15, 2015 at 08:04:41AM +0800, Yuyang Du wrote:
> > > > The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
> > > > Before this series, sometimes the runnable_load_avg is used, and sometimes
> > > > the load_avg is used. Completely replacing all uses of runnable_load_avg
> > > > with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
> > > > to result in overrated load. Therefore, we get runnable_load_avg back.
> > > >
> > > > The new cfs_rq's runnable_load_avg is improved to be updated with all of the
> > > > runnable sched_eneities at the same time, so the one sched_entity updated and
> > > > the others stale problem is solved.
> > > >
> > >
> > > How about tracking cfs_rq's blocked_load_avg instead of
> > > runnable_load_avg, because, AFAICS:
> > >
> > > cfs_rq->runnable_load_avg = se->avg.load_avg - cfs_rq->blocked_load_avg.
> >
> > No, cfs_rq->runnable_load_avg = cfs_rq->avg.load_avg - cfs_rq->blocked_load_avg,
> > without rounding errors and the like.
> >
>
> Oh, sorry.. yeah, you're right here.
>

The point is that you have already tracked the sum of runnable_load_avg
and blocked_load_avg in cfs_rq->avg.load_avg. If you're going to track
part of the sum, you'd better track the one that's updated less
frequently, right?

Anyway, this idea just comes into my mind. I wonder which is udpated
less frequently myself too. ;-) So I ask to see whether there is
something we can improve.

Regards,
Boqun


Attachments:
(No filename) (1.66 kB)
signature.asc (473.00 B)
Download all attachments

2015-07-22 02:20:02

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

On Wed, Jul 15, 2015 at 08:04:41AM +0800, Yuyang Du wrote:
> The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
> Before this series, sometimes the runnable_load_avg is used, and sometimes
> the load_avg is used. Completely replacing all uses of runnable_load_avg
> with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
> to result in overrated load. Therefore, we get runnable_load_avg back.
>
> The new cfs_rq's runnable_load_avg is improved to be updated with all of the
> runnable sched_eneities at the same time, so the one sched_entity updated and
> the others stale problem is solved.
>
> Signed-off-by: Yuyang Du <[email protected]>
> ---

<snip>

> +/* Remove the runnable load generated by se from cfs_rq's runnable load average */
> +static inline void
> +dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> + update_load_avg(se, 1);
> +

I think we need an update_cfs_rq_load_avg() here? Because the
runnable_load_avg may not be up to date when dequeue_entity_load_avg()
is called, right?

> + cfs_rq->runnable_load_avg =
> + max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
> + cfs_rq->runnable_load_sum =
> + max_t(s64, cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
> +}
> +
> /*
> * Task first catches up with cfs_rq, and then subtract
> * itself from the cfs_rq (task must be off the queue now).

<snip>

> @@ -2982,7 +3015,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> * Update run-time statistics of the 'current'.
> */
> update_curr(cfs_rq);
> - update_load_avg(se, 1);
> + dequeue_entity_load_avg(cfs_rq, se);
>
> update_stats_dequeue(cfs_rq, se);
> if (flags & DEQUEUE_SLEEP) {

Thanks and Best Regards,
Boqun


Attachments:
(No filename) (1.74 kB)
signature.asc (473.00 B)
Download all attachments

2015-07-24 16:41:40

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v10 2/7] sched: Rewrite runnable load and utilization average tracking

Hi Yuyang,

On 15/07/15 01:04, Yuyang Du wrote:

[...]

> @@ -4674,7 +4487,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> /*
> * w = rw_i + @wl
> */
> - w = se->my_q->load.weight + wl;
> + w = se->my_q->avg.load_avg + wl;
>
> /*
> * wl = S * s'_i; see (2)

There is a comment 'Per the above, wl is the new *se->load.weight*
value'. This should be replaced by *se->avg.load_avg*. Also the function
header explains the functionality of effective_load() based on weight
and not sched_avg::load_avg.

> @@ -4695,7 +4508,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> /*
> * wl = dw_i = S * (s'_i - s_i); see (3)
> */
> - wl -= se->load.weight;
> + wl -= se->avg.load_avg;
>
> /*
> * Recursively apply this logic to all parent groups to compute
> @@ -4769,14 +4582,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> */
> if (sync) {
> tg = task_group(current);
> - weight = current->se.load.weight;
> + weight = current->se.avg.load_avg;
>
> this_load += effective_load(tg, this_cpu, -weight, -weight);
> load += effective_load(tg, prev_cpu, 0, -weight);
> }
>
> tg = task_group(p);
> - weight = p->se.load.weight;
> + weight = p->se.avg.load_avg;

You changed cfs_rq->load.weight to cfs_rq->avg.load_avg and
se->load.weight to se->avg.load_avg in effective_load() and
wake_affine() in v2.
I wasn't able to find explanation why you did this. I mean we still have
to maintain 'struct load_weight' on cfs_rq's and se's representing tg's.

-- Dietmar

[...]

2015-07-24 16:41:50

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] sched: Clean up load average references

On 15/07/15 01:04, Yuyang Du wrote:
> For cfs_rq, we have load.weight, runnable_load_avg, and load_avg. We
> now start to clean up how they are used.
>
> First, as group sched_entity already largely uses load_avg, we now expand
> to use load_avg in all cases.

You're talking about group se's or cfs_rq owned by the group se's
(se->my_q) here or both?

Just asking because both data structures (cfs_rq and se) have a 'struct
load_weight load' as well as 'struct sched_avg avg' member.

Second, for CPU-wide load balancing, we
> choose to use runnable_load_avg in all cases, which is the same as before
> this series.

With your patch-set there will be still the difference of
'cfs_rq->utilization_load_avg' and your 'cfs_rq->avg.util_avg' in the
sense that the former one does not contain the contribution of blocked se's.

The EAS patch-set adds blocked utilization contribution:
https://lkml.org/lkml/2015/7/7/915

The cfs_rq utilization is also used by the load-balancer code via
get_cpu_usage() so the blocked utilization contribution to
'cfs_rq->avg.util_avg' can change load-balancing as well.

Since it is not as heavily used as the cfs_rq->runnable_load_avg we
might not need to reintroduce cfs_rq->utilization_load_avg but at least
mention this here.

-- Dietmar

[...]

2015-07-27 02:34:29

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

Hi Boqun,

On Tue, Jul 21, 2015 at 06:29:56PM +0800, Boqun Feng wrote:
> The point is that you have already tracked the sum of runnable_load_avg
> and blocked_load_avg in cfs_rq->avg.load_avg. If you're going to track
> part of the sum, you'd better track the one that's updated less
> frequently, right?
>
> Anyway, this idea just comes into my mind. I wonder which is udpated
> less frequently myself too. ;-) So I ask to see whether there is
> something we can improve.

Actually, this is not the point.

1) blocked load is more "difficult" to track, hint, migrate.

2) r(t1) - b(t2) is not anything, hint, t1 != t2

2015-07-27 03:36:40

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

On Wed, Jul 22, 2015 at 10:19:54AM +0800, Boqun Feng wrote:
>
> > +/* Remove the runnable load generated by se from cfs_rq's runnable load average */
> > +static inline void
> > +dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > +{
> > + update_load_avg(se, 1);
> > +
>
> I think we need an update_cfs_rq_load_avg() here? Because the
> runnable_load_avg may not be up to date when dequeue_entity_load_avg()
> is called, right?

Not in update_load_avg()?

2015-07-27 03:47:37

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

On Mon, Jul 27, 2015 at 11:21:15AM +0800, Boqun Feng wrote:
> Hi Yuyang,
>
> On Mon, Jul 27, 2015 at 02:43:25AM +0800, Yuyang Du wrote:
> > Hi Boqun,
> >
> > On Tue, Jul 21, 2015 at 06:29:56PM +0800, Boqun Feng wrote:
> > > The point is that you have already tracked the sum of runnable_load_avg
> > > and blocked_load_avg in cfs_rq->avg.load_avg. If you're going to track
> > > part of the sum, you'd better track the one that's updated less
> > > frequently, right?
> > >
> > > Anyway, this idea just comes into my mind. I wonder which is udpated
> > > less frequently myself too. ;-) So I ask to see whether there is
> > > something we can improve.
> >
> > Actually, this is not the point.
> >

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 1) blocked load is more "difficult" to track, hint, migrate.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> > 2) r(t1) - b(t2) is not anything, hint, t1 != t2
>
> Please consider this patch below, which is not tested yet, just for
> discussion. This patch is based on 1-5 in your patchset and going to
> replace patch 6. Hope this could make my point clear.
>
> Thanks anyway for being patient with me ;-)

2015-07-27 04:15:32

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v10 2/7] sched: Rewrite runnable load and utilization average tracking

Hi Dietmar,

On Fri, Jul 24, 2015 at 05:41:35PM +0100, Dietmar Eggemann wrote:
> Hi Yuyang,
>
> On 15/07/15 01:04, Yuyang Du wrote:
>
> [...]
>
> > @@ -4674,7 +4487,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> > /*
> > * w = rw_i + @wl
> > */
> > - w = se->my_q->load.weight + wl;
> > + w = se->my_q->avg.load_avg + wl;
> >
> > /*
> > * wl = S * s'_i; see (2)
>
> There is a comment 'Per the above, wl is the new *se->load.weight*
> value'. This should be replaced by *se->avg.load_avg*. Also the function
> header explains the functionality of effective_load() based on weight
> and not sched_avg::load_avg.

I think it is already replaced when effective_load is called.

About load.weight vs. load_avg, see below.

> > @@ -4695,7 +4508,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> > /*
> > * wl = dw_i = S * (s'_i - s_i); see (3)
> > */
> > - wl -= se->load.weight;
> > + wl -= se->avg.load_avg;
> >
> > /*
> > * Recursively apply this logic to all parent groups to compute
> > @@ -4769,14 +4582,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> > */
> > if (sync) {
> > tg = task_group(current);
> > - weight = current->se.load.weight;
> > + weight = current->se.avg.load_avg;
> >
> > this_load += effective_load(tg, this_cpu, -weight, -weight);
> > load += effective_load(tg, prev_cpu, 0, -weight);
> > }
> >
> > tg = task_group(p);
> > - weight = p->se.load.weight;
> > + weight = p->se.avg.load_avg;
>
> You changed cfs_rq->load.weight to cfs_rq->avg.load_avg and
> se->load.weight to se->avg.load_avg in effective_load() and
> wake_affine() in v2.
> I wasn't able to find explanation why you did this. I mean we still have
> to maintain 'struct load_weight' on cfs_rq's and se's representing tg's.

Yes, I might not have explained it specifically, but back then, it was
simply motivated/reasoned by consistently expressing the load with load_avg.

As of now, it is sort of the same, adding as I previously stated, as far
as group SE is concerned, we use load_avg, instread of runnable_load_avg
or load.weight.

As was also suggested by Morten, we need to revisit the bulk of the load
balancing code a lot, including rethinking about what to use: load.weight,
or runnable_load_avg, or load_avg. I think this patch series is just a
starter.

Thanks,
Yuyang

2015-07-27 04:22:26

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] sched: Clean up load average references

Hi Dietmar,

On Fri, Jul 24, 2015 at 05:41:45PM +0100, Dietmar Eggemann wrote:
> On 15/07/15 01:04, Yuyang Du wrote:
> > For cfs_rq, we have load.weight, runnable_load_avg, and load_avg. We
> > now start to clean up how they are used.
> >
> > First, as group sched_entity already largely uses load_avg, we now expand
> > to use load_avg in all cases.
>
> You're talking about group se's or cfs_rq owned by the group se's
> (se->my_q) here or both?

Definitely, group SE, and if the cfs_rq owned by group SE is also concerned
with group SE, then both. I don't think this is very well calculated to be
optimal, but probably this is the right move I can think of now.

We need to revisit all of the codes before we can at least make a final call.

> Just asking because both data structures (cfs_rq and se) have a 'struct
> load_weight load' as well as 'struct sched_avg avg' member.
>
> Second, for CPU-wide load balancing, we
> > choose to use runnable_load_avg in all cases, which is the same as before
> > this series.
>
> With your patch-set there will be still the difference of
> 'cfs_rq->utilization_load_avg' and your 'cfs_rq->avg.util_avg' in the
> sense that the former one does not contain the contribution of blocked se's.
>
> The EAS patch-set adds blocked utilization contribution:
> https://lkml.org/lkml/2015/7/7/915
>
> The cfs_rq utilization is also used by the load-balancer code via
> get_cpu_usage() so the blocked utilization contribution to
> 'cfs_rq->avg.util_avg' can change load-balancing as well.
>
> Since it is not as heavily used as the cfs_rq->runnable_load_avg we
> might not need to reintroduce cfs_rq->utilization_load_avg but at least
> mention this here.
>

Yes, thanks.

2015-07-27 04:25:13

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

On Mon, Jul 27, 2015 at 12:04:20PM +0800, Boqun Feng wrote:
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > 1) blocked load is more "difficult" to track, hint, migrate.
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> I may not get your point here? Are you saying my patch fails to handle
> the migration or are you just telling me that blocked load tracking need
> to take migration into consideration?

Both, is it so difficult to get?

> If it's the latter one, I want to say that, with blocked load or not, we
> have to handle load_avg in migrations, so *adding* some code to handle
> blocked load is not a big deal.
>
> Please consider this piece of code in update_cfs_rq_load_avg(), which
> decays and updates blocked_load_avg.

At this point of time, you tell me why exactly you want to track the blocked?

2015-07-27 03:21:23

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

Hi Yuyang,

On Mon, Jul 27, 2015 at 02:43:25AM +0800, Yuyang Du wrote:
> Hi Boqun,
>
> On Tue, Jul 21, 2015 at 06:29:56PM +0800, Boqun Feng wrote:
> > The point is that you have already tracked the sum of runnable_load_avg
> > and blocked_load_avg in cfs_rq->avg.load_avg. If you're going to track
> > part of the sum, you'd better track the one that's updated less
> > frequently, right?
> >
> > Anyway, this idea just comes into my mind. I wonder which is udpated
> > less frequently myself too. ;-) So I ask to see whether there is
> > something we can improve.
>
> Actually, this is not the point.
>
> 1) blocked load is more "difficult" to track, hint, migrate.
>
> 2) r(t1) - b(t2) is not anything, hint, t1 != t2

Please consider this patch below, which is not tested yet, just for
discussion. This patch is based on 1-5 in your patchset and going to
replace patch 6. Hope this could make my point clear.

Thanks anyway for being patient with me ;-)

Regards,
Boqun

========================================================================

Subject: [PATCH] sched: lazy blocked load tracking

With this patch, cfs_rq_runnable_load_avg can be implemented as follow:

static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
{
u64 now = cfs_rq_clock_task(cfs_rq);
decay_cfs_rq_blocked_load(now, cfs_rq);

return max_t(long, cfs_rq->avg.load_avg - cfs_rq->blocked_load_avg, 0);
}

---
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 4 ++++
2 files changed, 45 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e977074..76beb81 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2625,6 +2625,20 @@ static __always_inline int __update_load_avg(u64 now, int cpu,
return decayed;
}

+static inline u64 decay_cfs_rq_blocked_load(u64 now, struct cfs_rq *cfs_rq)
+{
+ u64 decays;
+
+ now = now >> 20;
+ decays = now - cfs_rq->last_blocked_load_decays;
+
+ cfs_rq->blocked_load_sum = decay_load(cfs_rq->blocked_load_sum, decays);
+ cfs_rq->blocked_load_avg = div_u64(cfs->blocked_load_sum, LOAD_AVG_MAX);
+ cfs_rq->last_blocked_load_update_time = now;
+
+ return decays;
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* Updating tg's load_avg is necessary before update_cfs_share (which is done)
@@ -2656,6 +2670,12 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
sa->load_avg = max_t(long, sa->load_avg - r, 0);
sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+
+ decay_cfs_rq_blocked_load(sa->last_update_time, cfs_rq);
+ cfs_rq->blocked_load_avg = max_t(long,
+ cfs_rq->blocked_load_avg - r, 0);
+ cfs_rq->blocked_load_sum = max_t(s64,
+ cfs_rq->blocked_load_avg - r * LOAD_AVG_MAX, 0);
}

if (atomic_long_read(&cfs_rq->removed_util_avg)) {
@@ -2719,11 +2739,32 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
cfs_rq->avg.util_avg += sa->util_avg;
cfs_rq->avg.util_sum += sa->util_sum;
}
+ else {
+ decay_cfs_rq_blocked_load(now, cfs_rq);
+
+ cfs_rq->blocked_load_avg = max_t(long,
+ cfs_rq->blocked_load_avg - sa->load_avg, 0);
+ cfs_rq->blocked_load_sum = max_t(long,
+ cfs_rq->blocked_load_sum - sa->load_sum, 0);
+ }

if (decayed || migrated)
update_tg_load_avg(cfs_rq, 0);
}

+static inline void
+dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ u64 now = cfs_rq_clock_task(cfs_rq);
+
+ update_load_avg(se, 1);
+ update_cfs_rq_load_avg(now, cfs_rq);
+ decay_cfs_rq_blocked_load(now, cfs_rq);
+
+ cfs_rq->blocked_load_sum += se->avg.load_sum;
+ cfs_rq->blocked_load_avg += se->avg.load_avg;
+}
+
/*
* Task first catches up with cfs_rq, and then subtract
* itself from the cfs_rq (task must be off the queue now).
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4d139e0..f570306 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -368,6 +368,10 @@ struct cfs_rq {
* CFS load tracking
*/
struct sched_avg avg;
+
+ u64 last_blocked_load_decays;
+ u64 blocked_load_sum;
+ unsigned long blocked_load_avg;
#ifdef CONFIG_FAIR_GROUP_SCHED
unsigned long tg_load_avg_contrib;
#endif
--
2.4.6

2015-07-27 03:30:10

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

On Mon, Jul 27, 2015 at 11:21:15AM +0800, Boqun Feng wrote:
> Hi Yuyang,
>
> On Mon, Jul 27, 2015 at 02:43:25AM +0800, Yuyang Du wrote:
> > Hi Boqun,
> >
> > On Tue, Jul 21, 2015 at 06:29:56PM +0800, Boqun Feng wrote:
> > > The point is that you have already tracked the sum of runnable_load_avg
> > > and blocked_load_avg in cfs_rq->avg.load_avg. If you're going to track
> > > part of the sum, you'd better track the one that's updated less
> > > frequently, right?
> > >
> > > Anyway, this idea just comes into my mind. I wonder which is udpated
> > > less frequently myself too. ;-) So I ask to see whether there is
> > > something we can improve.
> >
> > Actually, this is not the point.
> >
> > 1) blocked load is more "difficult" to track, hint, migrate.
> >
> > 2) r(t1) - b(t2) is not anything, hint, t1 != t2
>
> Please consider this patch below, which is not tested yet, just for
> discussion. This patch is based on 1-5 in your patchset and going to
> replace patch 6. Hope this could make my point clear.
>
> Thanks anyway for being patient with me ;-)
>
> Regards,
> Boqun
>
> ========================================================================
>
> Subject: [PATCH] sched: lazy blocked load tracking
>
> With this patch, cfs_rq_runnable_load_avg can be implemented as follow:
>
> static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
> {
> u64 now = cfs_rq_clock_task(cfs_rq);
> decay_cfs_rq_blocked_load(now, cfs_rq);
>
> return max_t(long, cfs_rq->avg.load_avg - cfs_rq->blocked_load_avg, 0);
> }
>
> ---
> kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 4 ++++
> 2 files changed, 45 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e977074..76beb81 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2625,6 +2625,20 @@ static __always_inline int __update_load_avg(u64 now, int cpu,
> return decayed;
> }
>
> +static inline u64 decay_cfs_rq_blocked_load(u64 now, struct cfs_rq *cfs_rq)
> +{
> + u64 decays;
> +
> + now = now >> 20;
> + decays = now - cfs_rq->last_blocked_load_decays;
> +
> + cfs_rq->blocked_load_sum = decay_load(cfs_rq->blocked_load_sum, decays);
> + cfs_rq->blocked_load_avg = div_u64(cfs->blocked_load_sum, LOAD_AVG_MAX);
> + cfs_rq->last_blocked_load_update_time = now;

Sorry for the typo, should be last_blocked_load_decays here ;-)

Regards,
Boqun


Attachments:
(No filename) (2.38 kB)
signature.asc (473.00 B)
Download all attachments

2015-07-27 04:04:28

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

Hi Yuyang,

On Mon, Jul 27, 2015 at 03:56:34AM +0800, Yuyang Du wrote:
> On Mon, Jul 27, 2015 at 11:21:15AM +0800, Boqun Feng wrote:
> > Hi Yuyang,
> >
> > On Mon, Jul 27, 2015 at 02:43:25AM +0800, Yuyang Du wrote:
> > > Hi Boqun,
> > >
> > > On Tue, Jul 21, 2015 at 06:29:56PM +0800, Boqun Feng wrote:
> > > > The point is that you have already tracked the sum of runnable_load_avg
> > > > and blocked_load_avg in cfs_rq->avg.load_avg. If you're going to track
> > > > part of the sum, you'd better track the one that's updated less
> > > > frequently, right?
> > > >
> > > > Anyway, this idea just comes into my mind. I wonder which is udpated
> > > > less frequently myself too. ;-) So I ask to see whether there is
> > > > something we can improve.
> > >
> > > Actually, this is not the point.
> > >
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > 1) blocked load is more "difficult" to track, hint, migrate.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I may not get your point here? Are you saying my patch fails to handle
the migration or are you just telling me that blocked load tracking need
to take migration into consideration?

If it's the latter one, I want to say that, with blocked load or not, we
have to handle load_avg in migrations, so *adding* some code to handle
blocked load is not a big deal.

Please consider this piece of code in update_cfs_rq_load_avg(), which
decays and updates blocked_load_avg.

@@ -2656,6 +2670,12 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
sa->load_avg = max_t(long, sa->load_avg - r, 0);
sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+
+ decay_cfs_rq_blocked_load(sa->last_update_time, cfs_rq);
+ cfs_rq->blocked_load_avg = max_t(long,
+ cfs_rq->blocked_load_avg - r, 0);
+ cfs_rq->blocked_load_sum = max_t(s64,
+ cfs_rq->blocked_load_avg - r * LOAD_AVG_MAX, 0);
}

Regards,
Boqun


Attachments:
(No filename) (1.97 kB)
signature.asc (473.00 B)
Download all attachments

2015-07-27 05:16:29

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] sched: Provide runnable_load_avg back to cfs_rq

Hi Yuyang,

On Mon, Jul 27, 2015 at 04:34:09AM +0800, Yuyang Du wrote:
> On Mon, Jul 27, 2015 at 12:04:20PM +0800, Boqun Feng wrote:
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > 1) blocked load is more "difficult" to track, hint, migrate.
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > I may not get your point here? Are you saying my patch fails to handle
> > the migration or are you just telling me that blocked load tracking need
> > to take migration into consideration?
>
> Both, is it so difficult to get?
>

Hmm.. I will appreciate more if you comment on my patch to point out
where is wrong ;-)

> > If it's the latter one, I want to say that, with blocked load or not, we
> > have to handle load_avg in migrations, so *adding* some code to handle
> > blocked load is not a big deal.
> >
> > Please consider this piece of code in update_cfs_rq_load_avg(), which
> > decays and updates blocked_load_avg.
>
> At this point of time, you tell me why exactly you want to track the blocked?

I want to track the blocked load because you want to track runnable load
in your patch, and as you've already tracked load_avg, which is the sum
of blocked load and runnable load, so I wonder whether tracking blocked
load is *another* way to track runnable load. Because if tracking
blocked load costs less than tracking runnable load, it's of course
better to track blocked load and calculate runnable load on demand
rather than track runnable load directly.

Yes, I do need to decay and update blocked_load_avg in
update_cfs_rq_load_avg() if there is a *non-zero* value of
remove_load_avg, but:

1. if no entity is migrated and no entity is dequeued(blocked), I
need to do nothing but tracking runnable load directly still
needs to update runnable load, for example, in entity_tick().
2. if no entity is migrated and a entity is dequeued(blocked), what
I need to do is similar as tracking runnable load directly does.

and of course:

3. if a entity is migrated, I do need to do more than tracking
runnable load directly.

So,
For #1 situations, tracking blocked load wins
For #2 situations, tie
For #3 situations, tracking runnable load wins

And which are more rare in the system, #1 or #3?


I write that patch to see how much we need to track blocked load
*instead* of runnable load, and I don't see that costs a lot. So I
basically want to know, is my patch of tracking blocked wrong? If not,
does that cost more than or nearly equal to tracking runnable load
directly? If not, why not track blocked load instead of tracking
runnable load?

However, I admit all the questions should be answered by real
benchmarks. I just want to see whether you have thought the same
questions as I and could give me a quick answer.

But I think an simple and directly answer is that "because we need
runnable load so we track it", code simplicity wins! if your current
answer is that, I'm OK with it, and will do some benchmark myself to see
whether it's worth to track blocked load rather than runnable load
directly. Again, I still hope a quick and convinced answer from you.
Thank you ;-)

Regards,
Boqun


Attachments:
(No filename) (3.10 kB)
signature.asc (473.00 B)
Download all attachments
Subject: [tip:sched/core] sched/fair: Remove rq's runnable avg

Commit-ID: cd126afe838d7ea9b971cdea087fd498a7293c7f
Gitweb: http://git.kernel.org/tip/cd126afe838d7ea9b971cdea087fd498a7293c7f
Author: Yuyang Du <[email protected]>
AuthorDate: Wed, 15 Jul 2015 08:04:36 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 3 Aug 2015 12:21:28 +0200

sched/fair: Remove rq's runnable avg

The current rq->avg is not used at all since its merge into the kernel,
and the code is in the scheduler's hot path, so remove it.

Tested-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Yuyang Du <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/debug.c | 7 +------
kernel/sched/fair.c | 25 ++++---------------------
kernel/sched/sched.h | 2 --
3 files changed, 5 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4222ec5..363b7e8 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -68,13 +68,8 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
#define PN(F) \
SEQ_printf(m, " .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))

- if (!se) {
- struct sched_avg *avg = &cpu_rq(cpu)->avg;
- P(avg->runnable_avg_sum);
- P(avg->avg_period);
+ if (!se)
return;
- }
-

PN(se->exec_start);
PN(se->vruntime);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea23f9f..90292c67 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2724,19 +2724,12 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
}
}

-static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
-{
- __update_entity_runnable_avg(rq_clock_task(rq), cpu_of(rq), &rq->avg,
- runnable, runnable);
- __update_tg_runnable_avg(&rq->avg, &rq->cfs);
-}
#else /* CONFIG_FAIR_GROUP_SCHED */
static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
int force_update) {}
static inline void __update_tg_runnable_avg(struct sched_avg *sa,
struct cfs_rq *cfs_rq) {}
static inline void __update_group_entity_contrib(struct sched_entity *se) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
#endif /* CONFIG_FAIR_GROUP_SCHED */

static inline void __update_task_entity_contrib(struct sched_entity *se)
@@ -2940,7 +2933,6 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
*/
void idle_enter_fair(struct rq *this_rq)
{
- update_rq_runnable_avg(this_rq, 1);
}

/*
@@ -2950,7 +2942,6 @@ void idle_enter_fair(struct rq *this_rq)
*/
void idle_exit_fair(struct rq *this_rq)
{
- update_rq_runnable_avg(this_rq, 0);
}

static int idle_balance(struct rq *this_rq);
@@ -2959,7 +2950,6 @@ static int idle_balance(struct rq *this_rq);

static inline void update_entity_load_avg(struct sched_entity *se,
int update_cfs_rq) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
struct sched_entity *se,
int wakeup) {}
@@ -4258,10 +4248,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_entity_load_avg(se, 1);
}

- if (!se) {
- update_rq_runnable_avg(rq, rq->nr_running);
+ if (!se)
add_nr_running(rq, 1);
- }
+
hrtick_update(rq);
}

@@ -4319,10 +4308,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_entity_load_avg(se, 1);
}

- if (!se) {
+ if (!se)
sub_nr_running(rq, 1);
- update_rq_runnable_avg(rq, 1);
- }
+
hrtick_update(rq);
}

@@ -6005,9 +5993,6 @@ static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
*/
if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
list_del_leaf_cfs_rq(cfs_rq);
- } else {
- struct rq *rq = rq_of(cfs_rq);
- update_rq_runnable_avg(rq, rq->nr_running);
}
}

@@ -7988,8 +7973,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

if (numabalancing_enabled)
task_tick_numa(rq, curr);
-
- update_rq_runnable_avg(rq, 1);
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84d4879..e13210c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -595,8 +595,6 @@ struct rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this cpu: */
struct list_head leaf_cfs_rq_list;
-
- struct sched_avg avg;
#endif /* CONFIG_FAIR_GROUP_SCHED */

/*

Subject: [tip:sched/core] sched/fair: Rewrite runnable load and utilization average tracking

Commit-ID: 9d89c257dfb9c51a532d69397f6eed75e5168c35
Gitweb: http://git.kernel.org/tip/9d89c257dfb9c51a532d69397f6eed75e5168c35
Author: Yuyang Du <[email protected]>
AuthorDate: Wed, 15 Jul 2015 08:04:37 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 3 Aug 2015 12:21:29 +0200

sched/fair: Rewrite runnable load and utilization average tracking

The idea of runnable load average (let runnable time contribute to weight)
was proposed by Paul Turner and Ben Segall, and it is still followed by
this rewrite. This rewrite aims to solve the following issues:

1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
updated at the granularity of an entity at a time, which results in the
cfs_rq's load average is stale or partially updated: at any time, only
one entity is up to date, all other entities are effectively lagging
behind. This is undesirable.

To illustrate, if we have n runnable entities in the cfs_rq, as time
elapses, they certainly become outdated:

t0: cfs_rq { e1_old, e2_old, ..., en_old }

and when we update:

t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }

t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }

...

We solve this by combining all runnable entities' load averages together
in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
on the fact that if we regard the update as a function, then:

w * update(e) = update(w * e) and

update(e1) + update(e2) = update(e1 + e2), then

w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)

therefore, by this rewrite, we have an entirely updated cfs_rq at the
time we update it:

t1: update cfs_rq { e1_new, e2_new, ..., en_new }

t2: update cfs_rq { e1_new, e2_new, ..., en_new }

...

2. cfs_rq's load average is different between top rq->cfs_rq and other
task_group's per CPU cfs_rqs in whether or not blocked_load_average
contributes to the load.

The basic idea behind runnable load average (the same for utilization)
is that the blocked state is taken into account as opposed to only
accounting for the currently runnable state. Therefore, the average
should include both the runnable/running and blocked load averages.
This rewrite does that.

In addition, we also combine runnable/running and blocked averages
of all entities into the cfs_rq's average, and update it together at
once. This is based on the fact that:

update(runnable) + update(blocked) = update(runnable + blocked)

This significantly reduces the code as we don't need to separately
maintain/update runnable/running load and blocked load.

3. How task_group entities' share is calculated is complex and imprecise.

We reduce the complexity in this rewrite to allow a very simple rule:
the task_group's load_avg is aggregated from its per CPU cfs_rqs's
load_avgs. Then group entity's weight is simply proportional to its
own cfs_rq's load_avg / task_group's load_avg. To illustrate,

if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,

task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then

cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share

To sum up, this rewrite in principle is equivalent to the current one, but
fixes the issues described above. Turns out, it significantly reduces the
code complexity and hence increases clarity and efficiency. In addition,
the new averages are more smooth/continuous (no spurious spikes and valleys)
and updated more consistently and quickly to reflect the load dynamics.

As a result, we have less load tracking overhead, better performance,
and especially better power efficiency due to more balanced load.

Signed-off-by: Yuyang Du <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 41 ++--
kernel/sched/core.c | 3 -
kernel/sched/debug.c | 41 ++--
kernel/sched/fair.c | 630 ++++++++++++++++----------------------------------
kernel/sched/sched.h | 28 +--
5 files changed, 249 insertions(+), 494 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9c14465..44dca5b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1175,29 +1175,24 @@ struct load_weight {
u32 inv_weight;
};

+/*
+ * The load_avg/util_avg accumulates an infinite geometric series.
+ * 1) load_avg factors the amount of time that a sched_entity is
+ * runnable on a rq into its weight. For cfs_rq, it is the aggregated
+ * such weights of all runnable and blocked sched_entities.
+ * 2) util_avg factors frequency scaling into the amount of time
+ * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
+ * For cfs_rq, it is the aggregated such times of all runnable and
+ * blocked sched_entities.
+ * The 64 bit load_sum can:
+ * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
+ * the highest weight (=88761) always runnable, we should not overflow
+ * 2) for entity, support any load.weight always runnable
+ */
struct sched_avg {
- u64 last_runnable_update;
- s64 decay_count;
- /*
- * utilization_avg_contrib describes the amount of time that a
- * sched_entity is running on a CPU. It is based on running_avg_sum
- * and is scaled in the range [0..SCHED_LOAD_SCALE].
- * load_avg_contrib described the amount of time that a sched_entity
- * is runnable on a rq. It is based on both runnable_avg_sum and the
- * weight of the task.
- */
- unsigned long load_avg_contrib, utilization_avg_contrib;
- /*
- * These sums represent an infinite geometric series and so are bound
- * above by 1024/(1-y). Thus we only need a u32 to store them for all
- * choices of y < 1-2^(-32)*1024.
- * running_avg_sum reflects the time that the sched_entity is
- * effectively running on the CPU.
- * runnable_avg_sum represents the amount of time a sched_entity is on
- * a runqueue which includes the running time that is monitored by
- * running_avg_sum.
- */
- u32 runnable_avg_sum, avg_period, running_avg_sum;
+ u64 last_update_time, load_sum;
+ u32 util_sum, period_contrib;
+ unsigned long load_avg, util_avg;
};

#ifdef CONFIG_SCHEDSTATS
@@ -1263,7 +1258,7 @@ struct sched_entity {
#endif

#ifdef CONFIG_SMP
- /* Per-entity load-tracking */
+ /* Per entity load average tracking */
struct sched_avg avg;
#endif
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5fad2b..3981526 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2020,9 +2020,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
-#ifdef CONFIG_SMP
- p->se.avg.decay_count = 0;
-#endif
INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 363b7e8..74f276f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -88,12 +88,8 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
#endif
P(se->load.weight);
#ifdef CONFIG_SMP
- P(se->avg.runnable_avg_sum);
- P(se->avg.running_avg_sum);
- P(se->avg.avg_period);
- P(se->avg.load_avg_contrib);
- P(se->avg.utilization_avg_contrib);
- P(se->avg.decay_count);
+ P(se->avg.load_avg);
+ P(se->avg.util_avg);
#endif
#undef PN
#undef P
@@ -209,21 +205,19 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
#ifdef CONFIG_SMP
- SEQ_printf(m, " .%-30s: %ld\n", "runnable_load_avg",
- cfs_rq->runnable_load_avg);
- SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
- cfs_rq->blocked_load_avg);
- SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg",
- cfs_rq->utilization_load_avg);
+ SEQ_printf(m, " .%-30s: %lu\n", "load_avg",
+ cfs_rq->avg.load_avg);
+ SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
+ cfs_rq->avg.util_avg);
+ SEQ_printf(m, " .%-30s: %ld\n", "removed_load_avg",
+ atomic_long_read(&cfs_rq->removed_load_avg));
+ SEQ_printf(m, " .%-30s: %ld\n", "removed_util_avg",
+ atomic_long_read(&cfs_rq->removed_util_avg));
#ifdef CONFIG_FAIR_GROUP_SCHED
- SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib",
- cfs_rq->tg_load_contrib);
- SEQ_printf(m, " .%-30s: %d\n", "tg_runnable_contrib",
- cfs_rq->tg_runnable_contrib);
+ SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib",
+ cfs_rq->tg_load_avg_contrib);
SEQ_printf(m, " .%-30s: %ld\n", "tg_load_avg",
atomic_long_read(&cfs_rq->tg->load_avg));
- SEQ_printf(m, " .%-30s: %d\n", "tg->runnable_avg",
- atomic_read(&cfs_rq->tg->runnable_avg));
#endif
#endif
#ifdef CONFIG_CFS_BANDWIDTH
@@ -631,12 +625,11 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)

P(se.load.weight);
#ifdef CONFIG_SMP
- P(se.avg.runnable_avg_sum);
- P(se.avg.running_avg_sum);
- P(se.avg.avg_period);
- P(se.avg.load_avg_contrib);
- P(se.avg.utilization_avg_contrib);
- P(se.avg.decay_count);
+ P(se.avg.load_sum);
+ P(se.avg.util_sum);
+ P(se.avg.load_avg);
+ P(se.avg.util_avg);
+ P(se.avg.last_update_time);
#endif
P(policy);
P(prio);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 90292c67..01ffa95 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -283,9 +283,6 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
return grp->my_q;
}

-static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
- int force_update);
-
static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
{
if (!cfs_rq->on_list) {
@@ -305,8 +302,6 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
}

cfs_rq->on_list = 1;
- /* We should have no load, but we need to update last_decay. */
- update_cfs_rq_blocked_load(cfs_rq, 0);
}
}

@@ -664,19 +659,31 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
static int select_idle_sibling(struct task_struct *p, int cpu);
static unsigned long task_h_load(struct task_struct *p);

-static inline void __update_task_entity_contrib(struct sched_entity *se);
-static inline void __update_task_entity_utilization(struct sched_entity *se);
+/*
+ * We choose a half-life close to 1 scheduling period.
+ * Note: The tables below are dependent on this value.
+ */
+#define LOAD_AVG_PERIOD 32
+#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
+#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */

/* Give new task start runnable values to heavy its load in infant time */
void init_task_runnable_average(struct task_struct *p)
{
- u32 slice;
+ struct sched_avg *sa = &p->se.avg;

- slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
- p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
- p->se.avg.avg_period = slice;
- __update_task_entity_contrib(&p->se);
- __update_task_entity_utilization(&p->se);
+ sa->last_update_time = 0;
+ /*
+ * sched_avg's period_contrib should be strictly less then 1024, so
+ * we give it 1023 to make sure it is almost a period (1024us), and
+ * will definitely be update (after enqueue).
+ */
+ sa->period_contrib = 1023;
+ sa->load_avg = scale_load_down(p->se.load.weight);
+ sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
+ sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
+ sa->util_sum = LOAD_AVG_MAX;
+ /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
}
#else
void init_task_runnable_average(struct task_struct *p)
@@ -1698,8 +1705,8 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
delta = runtime - p->last_sum_exec_runtime;
*period = now - p->last_task_numa_placement;
} else {
- delta = p->se.avg.runnable_avg_sum;
- *period = p->se.avg.avg_period;
+ delta = p->se.avg.load_sum / p->se.load.weight;
+ *period = LOAD_AVG_MAX;
}

p->last_sum_exec_runtime = runtime;
@@ -2347,13 +2354,13 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
long tg_weight;

/*
- * Use this CPU's actual weight instead of the last load_contribution
- * to gain a more accurate current total weight. See
- * __update_cfs_rq_tg_load_contrib().
+ * Use this CPU's real-time load instead of the last load contribution
+ * as the updating of the contribution is delayed, and we will use the
+ * the real-time load to calc the share. See update_tg_load_avg().
*/
tg_weight = atomic_long_read(&tg->load_avg);
- tg_weight -= cfs_rq->tg_load_contrib;
- tg_weight += cfs_rq->load.weight;
+ tg_weight -= cfs_rq->tg_load_avg_contrib;
+ tg_weight += cfs_rq->avg.load_avg;

return tg_weight;
}
@@ -2363,7 +2370,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
long tg_weight, load, shares;

tg_weight = calc_tg_weight(tg, cfs_rq);
- load = cfs_rq->load.weight;
+ load = cfs_rq->avg.load_avg;

shares = (tg->shares * load);
if (tg_weight)
@@ -2425,14 +2432,6 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_SMP
-/*
- * We choose a half-life close to 1 scheduling period.
- * Note: The tables below are dependent on this value.
- */
-#define LOAD_AVG_PERIOD 32
-#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
-#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */
-
/* Precomputed fixed inverse multiplies for multiplication by y^n */
static const u32 runnable_avg_yN_inv[] = {
0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
@@ -2481,9 +2480,8 @@ static __always_inline u64 decay_load(u64 val, u64 n)
local_n %= LOAD_AVG_PERIOD;
}

- val *= runnable_avg_yN_inv[local_n];
- /* We don't use SRR here since we always want to round down. */
- return val >> 32;
+ val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+ return val;
}

/*
@@ -2542,23 +2540,22 @@ static u32 __compute_runnable_contrib(u64 n)
* load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
* = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
*/
-static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
- struct sched_avg *sa,
- int runnable,
- int running)
+static __always_inline int
+__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
+ unsigned long weight, int running)
{
u64 delta, periods;
- u32 runnable_contrib;
+ u32 contrib;
int delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);

- delta = now - sa->last_runnable_update;
+ delta = now - sa->last_update_time;
/*
* This should only happen when time goes backwards, which it
* unfortunately does during sched clock init when we swap over to TSC.
*/
if ((s64)delta < 0) {
- sa->last_runnable_update = now;
+ sa->last_update_time = now;
return 0;
}

@@ -2569,26 +2566,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
delta >>= 10;
if (!delta)
return 0;
- sa->last_runnable_update = now;
+ sa->last_update_time = now;

/* delta_w is the amount already accumulated against our next period */
- delta_w = sa->avg_period % 1024;
+ delta_w = sa->period_contrib;
if (delta + delta_w >= 1024) {
- /* period roll-over */
decayed = 1;

+ /* how much left for next period will start over, we don't know yet */
+ sa->period_contrib = 0;
+
/*
* Now that we know we're crossing a period boundary, figure
* out how much from delta we need to complete the current
* period and accrue it.
*/
delta_w = 1024 - delta_w;
- if (runnable)
- sa->runnable_avg_sum += delta_w;
+ if (weight)
+ sa->load_sum += weight * delta_w;
if (running)
- sa->running_avg_sum += delta_w * scale_freq
- >> SCHED_CAPACITY_SHIFT;
- sa->avg_period += delta_w;
+ sa->util_sum += delta_w * scale_freq >> SCHED_CAPACITY_SHIFT;

delta -= delta_w;

@@ -2596,334 +2593,156 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
periods = delta / 1024;
delta %= 1024;

- sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
- periods + 1);
- sa->running_avg_sum = decay_load(sa->running_avg_sum,
- periods + 1);
- sa->avg_period = decay_load(sa->avg_period,
- periods + 1);
+ sa->load_sum = decay_load(sa->load_sum, periods + 1);
+ sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);

/* Efficiently calculate \sum (1..n_period) 1024*y^i */
- runnable_contrib = __compute_runnable_contrib(periods);
- if (runnable)
- sa->runnable_avg_sum += runnable_contrib;
+ contrib = __compute_runnable_contrib(periods);
+ if (weight)
+ sa->load_sum += weight * contrib;
if (running)
- sa->running_avg_sum += runnable_contrib * scale_freq
- >> SCHED_CAPACITY_SHIFT;
- sa->avg_period += runnable_contrib;
+ sa->util_sum += contrib * scale_freq >> SCHED_CAPACITY_SHIFT;
}

/* Remainder of delta accrued against u_0` */
- if (runnable)
- sa->runnable_avg_sum += delta;
+ if (weight)
+ sa->load_sum += weight * delta;
if (running)
- sa->running_avg_sum += delta * scale_freq
- >> SCHED_CAPACITY_SHIFT;
- sa->avg_period += delta;
-
- return decayed;
-}
-
-/* Synchronize an entity's decay with its parenting cfs_rq.*/
-static inline u64 __synchronize_entity_decay(struct sched_entity *se)
-{
- struct cfs_rq *cfs_rq = cfs_rq_of(se);
- u64 decays = atomic64_read(&cfs_rq->decay_counter);
+ sa->util_sum += delta * scale_freq >> SCHED_CAPACITY_SHIFT;

- decays -= se->avg.decay_count;
- se->avg.decay_count = 0;
- if (!decays)
- return 0;
+ sa->period_contrib += delta;

- se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
- se->avg.utilization_avg_contrib =
- decay_load(se->avg.utilization_avg_contrib, decays);
+ if (decayed) {
+ sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+ sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
+ }

- return decays;
+ return decayed;
}

#ifdef CONFIG_FAIR_GROUP_SCHED
-static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
- int force_update)
-{
- struct task_group *tg = cfs_rq->tg;
- long tg_contrib;
-
- tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
- tg_contrib -= cfs_rq->tg_load_contrib;
-
- if (!tg_contrib)
- return;
-
- if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
- atomic_long_add(tg_contrib, &tg->load_avg);
- cfs_rq->tg_load_contrib += tg_contrib;
- }
-}
-
/*
- * Aggregate cfs_rq runnable averages into an equivalent task_group
- * representation for computing load contributions.
+ * Updating tg's load_avg is necessary before update_cfs_share (which is done)
+ * and effective_load (which is not done because it is too costly).
*/
-static inline void __update_tg_runnable_avg(struct sched_avg *sa,
- struct cfs_rq *cfs_rq)
+static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
{
- struct task_group *tg = cfs_rq->tg;
- long contrib;
-
- /* The fraction of a cpu used by this cfs_rq */
- contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
- sa->avg_period + 1);
- contrib -= cfs_rq->tg_runnable_contrib;
+ long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;

- if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
- atomic_add(contrib, &tg->runnable_avg);
- cfs_rq->tg_runnable_contrib += contrib;
- }
-}
-
-static inline void __update_group_entity_contrib(struct sched_entity *se)
-{
- struct cfs_rq *cfs_rq = group_cfs_rq(se);
- struct task_group *tg = cfs_rq->tg;
- int runnable_avg;
-
- u64 contrib;
-
- contrib = cfs_rq->tg_load_contrib * tg->shares;
- se->avg.load_avg_contrib = div_u64(contrib,
- atomic_long_read(&tg->load_avg) + 1);
-
- /*
- * For group entities we need to compute a correction term in the case
- * that they are consuming <1 cpu so that we would contribute the same
- * load as a task of equal weight.
- *
- * Explicitly co-ordinating this measurement would be expensive, but
- * fortunately the sum of each cpus contribution forms a usable
- * lower-bound on the true value.
- *
- * Consider the aggregate of 2 contributions. Either they are disjoint
- * (and the sum represents true value) or they are disjoint and we are
- * understating by the aggregate of their overlap.
- *
- * Extending this to N cpus, for a given overlap, the maximum amount we
- * understand is then n_i(n_i+1)/2 * w_i where n_i is the number of
- * cpus that overlap for this interval and w_i is the interval width.
- *
- * On a small machine; the first term is well-bounded which bounds the
- * total error since w_i is a subset of the period. Whereas on a
- * larger machine, while this first term can be larger, if w_i is the
- * of consequential size guaranteed to see n_i*w_i quickly converge to
- * our upper bound of 1-cpu.
- */
- runnable_avg = atomic_read(&tg->runnable_avg);
- if (runnable_avg < NICE_0_LOAD) {
- se->avg.load_avg_contrib *= runnable_avg;
- se->avg.load_avg_contrib >>= NICE_0_SHIFT;
+ if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
+ atomic_long_add(delta, &cfs_rq->tg->load_avg);
+ cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
}
}

#else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
- int force_update) {}
-static inline void __update_tg_runnable_avg(struct sched_avg *sa,
- struct cfs_rq *cfs_rq) {}
-static inline void __update_group_entity_contrib(struct sched_entity *se) {}
+static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-static inline void __update_task_entity_contrib(struct sched_entity *se)
-{
- u32 contrib;
-
- /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
- contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
- contrib /= (se->avg.avg_period + 1);
- se->avg.load_avg_contrib = scale_load(contrib);
-}
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);

-/* Compute the current contribution to load_avg by se, return any delta */
-static long __update_entity_load_avg_contrib(struct sched_entity *se)
+/* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
+static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
{
- long old_contrib = se->avg.load_avg_contrib;
+ int decayed;
+ struct sched_avg *sa = &cfs_rq->avg;

- if (entity_is_task(se)) {
- __update_task_entity_contrib(se);
- } else {
- __update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
- __update_group_entity_contrib(se);
+ if (atomic_long_read(&cfs_rq->removed_load_avg)) {
+ long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
+ sa->load_avg = max_t(long, sa->load_avg - r, 0);
+ sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
}

- return se->avg.load_avg_contrib - old_contrib;
-}
-
-
-static inline void __update_task_entity_utilization(struct sched_entity *se)
-{
- u32 contrib;
-
- /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
- contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
- contrib /= (se->avg.avg_period + 1);
- se->avg.utilization_avg_contrib = scale_load(contrib);
-}
+ if (atomic_long_read(&cfs_rq->removed_util_avg)) {
+ long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
+ sa->util_avg = max_t(long, sa->util_avg - r, 0);
+ sa->util_sum = max_t(s32, sa->util_sum -
+ ((r * LOAD_AVG_MAX) >> SCHED_LOAD_SHIFT), 0);
+ }

-static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
-{
- long old_contrib = se->avg.utilization_avg_contrib;
+ decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+ scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL);

- if (entity_is_task(se))
- __update_task_entity_utilization(se);
- else
- se->avg.utilization_avg_contrib =
- group_cfs_rq(se)->utilization_load_avg;
-
- return se->avg.utilization_avg_contrib - old_contrib;
-}
+#ifndef CONFIG_64BIT
+ smp_wmb();
+ cfs_rq->load_last_update_time_copy = sa->last_update_time;
+#endif

-static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
- long load_contrib)
-{
- if (likely(load_contrib < cfs_rq->blocked_load_avg))
- cfs_rq->blocked_load_avg -= load_contrib;
- else
- cfs_rq->blocked_load_avg = 0;
+ return decayed;
}

-static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
-
-/* Update a sched_entity's runnable average */
-static inline void update_entity_load_avg(struct sched_entity *se,
- int update_cfs_rq)
+/* Update task and its cfs_rq load average */
+static inline void update_load_avg(struct sched_entity *se, int update_tg)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- long contrib_delta, utilization_delta;
int cpu = cpu_of(rq_of(cfs_rq));
- u64 now;
+ u64 now = cfs_rq_clock_task(cfs_rq);

/*
- * For a group entity we need to use their owned cfs_rq_clock_task() in
- * case they are the parent of a throttled hierarchy.
+ * Track task load average for carrying it to new CPU after migrated, and
+ * track group sched_entity load average for task_h_load calc in migration
*/
- if (entity_is_task(se))
- now = cfs_rq_clock_task(cfs_rq);
- else
- now = cfs_rq_clock_task(group_cfs_rq(se));
+ __update_load_avg(now, cpu, &se->avg,
+ se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);

- if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
- cfs_rq->curr == se))
- return;
-
- contrib_delta = __update_entity_load_avg_contrib(se);
- utilization_delta = __update_entity_utilization_avg_contrib(se);
-
- if (!update_cfs_rq)
- return;
-
- if (se->on_rq) {
- cfs_rq->runnable_load_avg += contrib_delta;
- cfs_rq->utilization_load_avg += utilization_delta;
- } else {
- subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
- }
+ if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
+ update_tg_load_avg(cfs_rq, 0);
}

-/*
- * Decay the load contributed by all blocked children and account this so that
- * their contribution may appropriately discounted when they wake up.
- */
-static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
+/* Add the load generated by se into cfs_rq's load average */
+static inline void
+enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- u64 now = cfs_rq_clock_task(cfs_rq) >> 20;
- u64 decays;
-
- decays = now - cfs_rq->last_decay;
- if (!decays && !force_update)
- return;
+ struct sched_avg *sa = &se->avg;
+ u64 now = cfs_rq_clock_task(cfs_rq);
+ int migrated = 0, decayed;

- if (atomic_long_read(&cfs_rq->removed_load)) {
- unsigned long removed_load;
- removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
- subtract_blocked_load_contrib(cfs_rq, removed_load);
+ if (sa->last_update_time == 0) {
+ sa->last_update_time = now;
+ migrated = 1;
}
-
- if (decays) {
- cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
- decays);
- atomic64_add(decays, &cfs_rq->decay_counter);
- cfs_rq->last_decay = now;
+ else {
+ __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+ se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
}

- __update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
-}
+ decayed = update_cfs_rq_load_avg(now, cfs_rq);

-/* Add the load generated by se into cfs_rq's child load-average */
-static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se,
- int wakeup)
-{
- /*
- * We track migrations using entity decay_count <= 0, on a wake-up
- * migration we use a negative decay count to track the remote decays
- * accumulated while sleeping.
- *
- * Newly forked tasks are enqueued with se->avg.decay_count == 0, they
- * are seen by enqueue_entity_load_avg() as a migration with an already
- * constructed load_avg_contrib.
- */
- if (unlikely(se->avg.decay_count <= 0)) {
- se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
- if (se->avg.decay_count) {
- /*
- * In a wake-up migration we have to approximate the
- * time sleeping. This is because we can't synchronize
- * clock_task between the two cpus, and it is not
- * guaranteed to be read-safe. Instead, we can
- * approximate this using our carried decays, which are
- * explicitly atomically readable.
- */
- se->avg.last_runnable_update -= (-se->avg.decay_count)
- << 20;
- update_entity_load_avg(se, 0);
- /* Indicate that we're now synchronized and on-rq */
- se->avg.decay_count = 0;
- }
- wakeup = 0;
- } else {
- __synchronize_entity_decay(se);
+ if (migrated) {
+ cfs_rq->avg.load_avg += sa->load_avg;
+ cfs_rq->avg.load_sum += sa->load_sum;
+ cfs_rq->avg.util_avg += sa->util_avg;
+ cfs_rq->avg.util_sum += sa->util_sum;
}

- /* migrated tasks did not contribute to our blocked load */
- if (wakeup) {
- subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
- update_entity_load_avg(se, 0);
- }
-
- cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
- cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
- /* we force update consideration on load-balancer moves */
- update_cfs_rq_blocked_load(cfs_rq, !wakeup);
+ if (decayed || migrated)
+ update_tg_load_avg(cfs_rq, 0);
}

/*
- * Remove se's load from this cfs_rq child load-average, if the entity is
- * transitioning to a blocked state we track its projected decay using
- * blocked_load_avg.
+ * Task first catches up with cfs_rq, and then subtract
+ * itself from the cfs_rq (task must be off the queue now).
*/
-static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se,
- int sleep)
+void remove_entity_load_avg(struct sched_entity *se)
{
- update_entity_load_avg(se, 1);
- /* we force update consideration on load-balancer moves */
- update_cfs_rq_blocked_load(cfs_rq, !sleep);
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ u64 last_update_time;
+
+#ifndef CONFIG_64BIT
+ u64 last_update_time_copy;

- cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
- cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
- if (sleep) {
- cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
- se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
- } /* migrations, e.g. sleep=0 leave decay_count == 0 */
+ do {
+ last_update_time_copy = cfs_rq->load_last_update_time_copy;
+ smp_rmb();
+ last_update_time = cfs_rq->avg.last_update_time;
+ } while (last_update_time != last_update_time_copy);
+#else
+ last_update_time = cfs_rq->avg.last_update_time;
+#endif
+
+ __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0);
+ atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
+ atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
}

/*
@@ -2948,16 +2767,10 @@ static int idle_balance(struct rq *this_rq);

#else /* CONFIG_SMP */

-static inline void update_entity_load_avg(struct sched_entity *se,
- int update_cfs_rq) {}
-static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se,
- int wakeup) {}
-static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se,
- int sleep) {}
-static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
- int force_update) {}
+static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
+static inline void
+enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static inline void remove_entity_load_avg(struct sched_entity *se) {}

static inline int idle_balance(struct rq *rq)
{
@@ -3089,7 +2902,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
+ enqueue_entity_load_avg(cfs_rq, se);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);

@@ -3164,7 +2977,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
+ update_load_avg(se, 1);

update_stats_dequeue(cfs_rq, se);
if (flags & DEQUEUE_SLEEP) {
@@ -3254,7 +3067,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
- update_entity_load_avg(se, 1);
+ update_load_avg(se, 1);
}

update_stats_curr_start(cfs_rq, se);
@@ -3354,7 +3167,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
- update_entity_load_avg(prev, 1);
+ update_load_avg(prev, 0);
}
cfs_rq->curr = NULL;
}
@@ -3370,8 +3183,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
/*
* Ensure that runnable average is periodically updated.
*/
- update_entity_load_avg(curr, 1);
- update_cfs_rq_blocked_load(cfs_rq, 1);
+ update_load_avg(curr, 1);
update_cfs_shares(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK
@@ -4244,8 +4056,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;

+ update_load_avg(se, 1);
update_cfs_shares(cfs_rq);
- update_entity_load_avg(se, 1);
}

if (!se)
@@ -4304,8 +4116,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;

+ update_load_avg(se, 1);
update_cfs_shares(cfs_rq);
- update_entity_load_avg(se, 1);
}

if (!se)
@@ -4444,7 +4256,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
static void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = READ_ONCE(jiffies);
- unsigned long load = this_rq->cfs.runnable_load_avg;
+ unsigned long load = this_rq->cfs.avg.load_avg;
unsigned long pending_updates;

/*
@@ -4490,7 +4302,7 @@ void update_cpu_load_nohz(void)
*/
void update_cpu_load_active(struct rq *this_rq)
{
- unsigned long load = this_rq->cfs.runnable_load_avg;
+ unsigned long load = this_rq->cfs.avg.load_avg;
/*
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
@@ -4501,7 +4313,7 @@ void update_cpu_load_active(struct rq *this_rq)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->cfs.runnable_load_avg;
+ return cpu_rq(cpu)->cfs.avg.load_avg;
}

/*
@@ -4551,7 +4363,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
- unsigned long load_avg = rq->cfs.runnable_load_avg;
+ unsigned long load_avg = rq->cfs.avg.load_avg;

if (nr_running)
return load_avg / nr_running;
@@ -4670,7 +4482,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* w = rw_i + @wl
*/
- w = se->my_q->load.weight + wl;
+ w = se->my_q->avg.load_avg + wl;

/*
* wl = S * s'_i; see (2)
@@ -4691,7 +4503,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* wl = dw_i = S * (s'_i - s_i); see (3)
*/
- wl -= se->load.weight;
+ wl -= se->avg.load_avg;

/*
* Recursively apply this logic to all parent groups to compute
@@ -4761,14 +4573,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
*/
if (sync) {
tg = task_group(current);
- weight = current->se.load.weight;
+ weight = current->se.avg.load_avg;

this_load += effective_load(tg, this_cpu, -weight, -weight);
load += effective_load(tg, prev_cpu, 0, -weight);
}

tg = task_group(p);
- weight = p->se.load.weight;
+ weight = p->se.avg.load_avg;

/*
* In low-load situations, where prev_cpu is idle and this_cpu is idle
@@ -4961,12 +4773,12 @@ done:
* tasks. The unit of the return value must be the one of capacity so we can
* compare the usage with the capacity of the CPU that is available for CFS
* task (ie cpu_capacity).
- * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * cfs.avg.util_avg is the sum of running time of runnable tasks on a
* CPU. It represents the amount of utilization of a CPU in the range
* [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
* capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
- * because of unfortunate rounding in avg_period and running_load_avg or just
+ * Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE
+ * because of unfortunate rounding in util_avg or just
* after migrating tasks until the average stabilizes with the new running
* time. So we need to check that the usage stays into the range
* [0..cpu_capacity_orig] and cap if necessary.
@@ -4975,7 +4787,7 @@ done:
*/
static int get_cpu_usage(int cpu)
{
- unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long usage = cpu_rq(cpu)->cfs.avg.util_avg;
unsigned long capacity = capacity_orig_of(cpu);

if (usage >= SCHED_LOAD_SCALE)
@@ -5084,26 +4896,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
* other assumptions, including the state of rq->lock, should be made.
*/
-static void
-migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+static void migrate_task_rq_fair(struct task_struct *p, int next_cpu)
{
- struct sched_entity *se = &p->se;
- struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
/*
- * Load tracking: accumulate removed load so that it can be processed
- * when we next update owning cfs_rq under rq->lock. Tasks contribute
- * to blocked load iff they have a positive decay-count. It can never
- * be negative here since on-rq tasks have decay-count == 0.
+ * We are supposed to update the task to "current" time, then its up to date
+ * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
+ * what current time is, so simply throw away the out-of-date time. This
+ * will result in the wakee task is less decayed, but giving the wakee more
+ * load sounds not bad.
*/
- if (se->avg.decay_count) {
- se->avg.decay_count = -__synchronize_entity_decay(se);
- atomic_long_add(se->avg.load_avg_contrib,
- &cfs_rq->removed_load);
- }
+ remove_entity_load_avg(&p->se);
+
+ /* Tell new CPU we are migrated */
+ p->se.avg.last_update_time = 0;

/* We have migrated, no longer consider this task hot */
- se->exec_start = 0;
+ p->se.exec_start = 0;
}
#endif /* CONFIG_SMP */

@@ -5966,36 +5774,6 @@ static void attach_tasks(struct lb_env *env)
}

#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
- * update tg->load_weight by folding this cpu's load_avg
- */
-static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
-{
- struct sched_entity *se = tg->se[cpu];
- struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
-
- /* throttled entities do not contribute to load */
- if (throttled_hierarchy(cfs_rq))
- return;
-
- update_cfs_rq_blocked_load(cfs_rq, 1);
-
- if (se) {
- update_entity_load_avg(se, 1);
- /*
- * We pivot on our runnable average having decayed to zero for
- * list removal. This generally implies that all our children
- * have also been removed (modulo rounding error or bandwidth
- * control); however, such cases are rare and we can fix these
- * at enqueue.
- *
- * TODO: fix up out-of-order children on enqueue.
- */
- if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
- list_del_leaf_cfs_rq(cfs_rq);
- }
-}
-
static void update_blocked_averages(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -6004,19 +5782,19 @@ static void update_blocked_averages(int cpu)

raw_spin_lock_irqsave(&rq->lock, flags);
update_rq_clock(rq);
+
/*
* Iterates the task_group tree in a bottom up fashion, see
* list_add_leaf_cfs_rq() for details.
*/
for_each_leaf_cfs_rq(rq, cfs_rq) {
- /*
- * Note: We may want to consider periodically releasing
- * rq->lock about these updates so that creating many task
- * groups does not result in continually extending hold time.
- */
- __update_blocked_averages_cpu(cfs_rq->tg, rq->cpu);
- }
+ /* throttled entities do not contribute to load */
+ if (throttled_hierarchy(cfs_rq))
+ continue;

+ if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq))
+ update_tg_load_avg(cfs_rq, 0);
+ }
raw_spin_unlock_irqrestore(&rq->lock, flags);
}

@@ -6044,14 +5822,13 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
}

if (!se) {
- cfs_rq->h_load = cfs_rq->runnable_load_avg;
+ cfs_rq->h_load = cfs_rq->avg.load_avg;
cfs_rq->last_h_load_update = now;
}

while ((se = cfs_rq->h_load_next) != NULL) {
load = cfs_rq->h_load;
- load = div64_ul(load * se->avg.load_avg_contrib,
- cfs_rq->runnable_load_avg + 1);
+ load = div64_ul(load * se->avg.load_avg, cfs_rq->avg.load_avg + 1);
cfs_rq = group_cfs_rq(se);
cfs_rq->h_load = load;
cfs_rq->last_h_load_update = now;
@@ -6063,8 +5840,8 @@ static unsigned long task_h_load(struct task_struct *p)
struct cfs_rq *cfs_rq = task_cfs_rq(p);

update_cfs_rq_h_load(cfs_rq);
- return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
- cfs_rq->runnable_load_avg + 1);
+ return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
+ cfs_rq->avg.load_avg + 1);
}
#else
static inline void update_blocked_averages(int cpu)
@@ -6073,7 +5850,7 @@ static inline void update_blocked_averages(int cpu)

static unsigned long task_h_load(struct task_struct *p)
{
- return p->se.avg.load_avg_contrib;
+ return p->se.avg.load_avg;
}
#endif

@@ -8071,15 +7848,18 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
}

#ifdef CONFIG_SMP
- /*
- * Remove our load from contribution when we leave sched_fair
- * and ensure we don't carry in an old decay_count if we
- * switch back.
- */
- if (se->avg.decay_count) {
- __synchronize_entity_decay(se);
- subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
- }
+ /* Catch up with the cfs_rq and remove our load when we leave */
+ __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq), &se->avg,
+ se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
+
+ cfs_rq->avg.load_avg =
+ max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
+ cfs_rq->avg.load_sum =
+ max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0);
+ cfs_rq->avg.util_avg =
+ max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
+ cfs_rq->avg.util_sum =
+ max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0);
#endif
}

@@ -8136,8 +7916,8 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
#ifdef CONFIG_SMP
- atomic64_set(&cfs_rq->decay_counter, 1);
- atomic_long_set(&cfs_rq->removed_load, 0);
+ atomic_long_set(&cfs_rq->removed_load_avg, 0);
+ atomic_long_set(&cfs_rq->removed_util_avg, 0);
#endif
}

@@ -8182,14 +7962,14 @@ static void task_move_group_fair(struct task_struct *p, int queued)
if (!queued) {
cfs_rq = cfs_rq_of(se);
se->vruntime += cfs_rq->min_vruntime;
+
#ifdef CONFIG_SMP
- /*
- * migrate_task_rq_fair() will have removed our previous
- * contribution, but we must synchronize for ongoing future
- * decay.
- */
- se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
- cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ /* Virtually synchronize task with its new cfs_rq */
+ p->se.avg.last_update_time = cfs_rq->avg.last_update_time;
+ cfs_rq->avg.load_avg += p->se.avg.load_avg;
+ cfs_rq->avg.load_sum += p->se.avg.load_sum;
+ cfs_rq->avg.util_avg += p->se.avg.util_avg;
+ cfs_rq->avg.util_sum += p->se.avg.util_sum;
#endif
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e13210c..dcde941 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -245,7 +245,6 @@ struct task_group {

#ifdef CONFIG_SMP
atomic_long_t load_avg;
- atomic_t runnable_avg;
#endif
#endif

@@ -366,27 +365,18 @@ struct cfs_rq {

#ifdef CONFIG_SMP
/*
- * CFS Load tracking
- * Under CFS, load is tracked on a per-entity basis and aggregated up.
- * This allows for the description of both thread and group usage (in
- * the FAIR_GROUP_SCHED case).
- * runnable_load_avg is the sum of the load_avg_contrib of the
- * sched_entities on the rq.
- * blocked_load_avg is similar to runnable_load_avg except that its
- * the blocked sched_entities on the rq.
- * utilization_load_avg is the sum of the average running time of the
- * sched_entities on the rq.
+ * CFS load tracking
*/
- unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
- atomic64_t decay_counter;
- u64 last_decay;
- atomic_long_t removed_load;
-
+ struct sched_avg avg;
#ifdef CONFIG_FAIR_GROUP_SCHED
- /* Required to track per-cpu representation of a task_group */
- u32 tg_runnable_contrib;
- unsigned long tg_load_contrib;
+ unsigned long tg_load_avg_contrib;
+#endif
+ atomic_long_t removed_load_avg, removed_util_avg;
+#ifndef CONFIG_64BIT
+ u64 load_last_update_time_copy;
+#endif

+#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* h_load = weight * f(tg)
*

Subject: [tip:sched/core] sched/fair: Implement update_blocked_averages() for CONFIG_FAIR_GROUP_SCHED=n

Commit-ID: 6c1d47c0827304949e0eb9479f4d587f226fac8b
Gitweb: http://git.kernel.org/tip/6c1d47c0827304949e0eb9479f4d587f226fac8b
Author: Vincent Guittot <[email protected]>
AuthorDate: Wed, 15 Jul 2015 08:04:38 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 3 Aug 2015 12:24:28 +0200

sched/fair: Implement update_blocked_averages() for CONFIG_FAIR_GROUP_SCHED=n

The load and the utilization of idle CPUs must be updated periodically in
order to decay the blocked part.

If CONFIG_FAIR_GROUP_SCHED is not set, the load and util of idle cpus
are not decayed and stay at the values set before becoming idle.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Yuyang Du <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Fixed up the SOB chain. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01ffa95..e4b80c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5846,6 +5846,14 @@ static unsigned long task_h_load(struct task_struct *p)
#else
static inline void update_blocked_averages(int cpu)
{
+ struct rq *rq = cpu_rq(cpu);
+ struct cfs_rq *cfs_rq = &rq->cfs;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ update_rq_clock(rq);
+ update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
}

static unsigned long task_h_load(struct task_struct *p)

Subject: [tip:sched/core] sched/fair: Init cfs_rq' s sched_entity load average

Commit-ID: 540247fb5ddf6d2364f90387fa1f8f428d15e683
Gitweb: http://git.kernel.org/tip/540247fb5ddf6d2364f90387fa1f8f428d15e683
Author: Yuyang Du <[email protected]>
AuthorDate: Wed, 15 Jul 2015 08:04:39 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 3 Aug 2015 12:24:29 +0200

sched/fair: Init cfs_rq's sched_entity load average

The runnable load and utilization averages of cfs_rq's sched_entity
were not initiated. Like done to a task, give new cfs_rq' sched_entity
start values to heavy its load in infant time.

Signed-off-by: Yuyang Du <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 11 ++++++-----
kernel/sched/sched.h | 2 +-
3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3981526..5ca9ae0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2304,7 +2304,7 @@ void wake_up_new_task(struct task_struct *p)
#endif

/* Initialize new task's runnable average */
- init_task_runnable_average(p);
+ init_entity_runnable_average(&p->se);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4b80c6..f636db0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -667,10 +667,10 @@ static unsigned long task_h_load(struct task_struct *p);
#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */

-/* Give new task start runnable values to heavy its load in infant time */
-void init_task_runnable_average(struct task_struct *p)
+/* Give new sched_entity start runnable values to heavy its load in infant time */
+void init_entity_runnable_average(struct sched_entity *se)
{
- struct sched_avg *sa = &p->se.avg;
+ struct sched_avg *sa = &se->avg;

sa->last_update_time = 0;
/*
@@ -679,14 +679,14 @@ void init_task_runnable_average(struct task_struct *p)
* will definitely be update (after enqueue).
*/
sa->period_contrib = 1023;
- sa->load_avg = scale_load_down(p->se.load.weight);
+ sa->load_avg = scale_load_down(se->load.weight);
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
sa->util_sum = LOAD_AVG_MAX;
/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
}
#else
-void init_task_runnable_average(struct task_struct *p)
+void init_entity_runnable_average(struct sched_entity *se)
{
}
#endif
@@ -8029,6 +8029,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)

init_cfs_rq(cfs_rq);
init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
+ init_entity_runnable_average(se);
}

return 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dcde941..4d139e0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1307,7 +1307,7 @@ extern void init_dl_task_timer(struct sched_dl_entity *dl_se);

unsigned long to_ratio(u64 period, u64 runtime);

-extern void init_task_runnable_average(struct task_struct *p);
+extern void init_entity_runnable_average(struct sched_entity *se);

static inline void add_nr_running(struct rq *rq, unsigned count)
{

Subject: [tip:sched/core] sched/fair: Remove task and group entity load when they are dead

Commit-ID: 1269557889b477e3e43ab99a21035ddf8f7cea4d
Gitweb: http://git.kernel.org/tip/1269557889b477e3e43ab99a21035ddf8f7cea4d
Author: Yuyang Du <[email protected]>
AuthorDate: Wed, 15 Jul 2015 08:04:40 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 3 Aug 2015 12:24:30 +0200

sched/fair: Remove task and group entity load when they are dead

When task exits or group is destroyed, the entity's load should be
removed from its parent cfs_rq's load. Otherwise, it will take time
for the parent cfs_rq to decay the dead entity's load to 0, which
is not desired.

Signed-off-by: Yuyang Du <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f636db0..5532bf3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4913,6 +4913,11 @@ static void migrate_task_rq_fair(struct task_struct *p, int next_cpu)
/* We have migrated, no longer consider this task hot */
p->se.exec_start = 0;
}
+
+static void task_dead_fair(struct task_struct *p)
+{
+ remove_entity_load_avg(&p->se);
+}
#endif /* CONFIG_SMP */

static unsigned long
@@ -7991,8 +7996,11 @@ void free_fair_sched_group(struct task_group *tg)
for_each_possible_cpu(i) {
if (tg->cfs_rq)
kfree(tg->cfs_rq[i]);
- if (tg->se)
+ if (tg->se) {
+ if (tg->se[i])
+ remove_entity_load_avg(tg->se[i]);
kfree(tg->se[i]);
+ }
}

kfree(tg->cfs_rq);
@@ -8179,6 +8187,7 @@ const struct sched_class fair_sched_class = {
.rq_offline = rq_offline_fair,

.task_waking = task_waking_fair,
+ .task_dead = task_dead_fair,
#endif

.set_curr_task = set_curr_task_fair,

Subject: [tip:sched/core] sched/fair: Provide runnable_load_avg back to cfs_rq

Commit-ID: 139622343ef31941effc6de6a5a9320371a00e62
Gitweb: http://git.kernel.org/tip/139622343ef31941effc6de6a5a9320371a00e62
Author: Yuyang Du <[email protected]>
AuthorDate: Wed, 15 Jul 2015 08:04:41 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 3 Aug 2015 12:24:31 +0200

sched/fair: Provide runnable_load_avg back to cfs_rq

The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
Before this series, sometimes the runnable_load_avg is used, and sometimes
the load_avg is used. Completely replacing all uses of runnable_load_avg
with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
to result in overrated load. Therefore, we get runnable_load_avg back.

The new cfs_rq's runnable_load_avg is improved to be updated with all of the
runnable sched_eneities at the same time, so the one sched_entity updated and
the others stale problem is solved.

Signed-off-by: Yuyang Du <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/debug.c | 2 ++
kernel/sched/fair.c | 55 ++++++++++++++++++++++++++++++++++++++++++----------
kernel/sched/sched.h | 2 ++
3 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 74f276f..6415117 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -207,6 +207,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "load_avg",
cfs_rq->avg.load_avg);
+ SEQ_printf(m, " .%-30s: %lu\n", "runnable_load_avg",
+ cfs_rq->runnable_load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed_load_avg",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5532bf3..1a878d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2542,7 +2542,7 @@ static u32 __compute_runnable_contrib(u64 n)
*/
static __always_inline int
__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
- unsigned long weight, int running)
+ unsigned long weight, int running, struct cfs_rq *cfs_rq)
{
u64 delta, periods;
u32 contrib;
@@ -2582,8 +2582,11 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
* period and accrue it.
*/
delta_w = 1024 - delta_w;
- if (weight)
+ if (weight) {
sa->load_sum += weight * delta_w;
+ if (cfs_rq)
+ cfs_rq->runnable_load_sum += weight * delta_w;
+ }
if (running)
sa->util_sum += delta_w * scale_freq >> SCHED_CAPACITY_SHIFT;

@@ -2594,19 +2597,29 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
delta %= 1024;

sa->load_sum = decay_load(sa->load_sum, periods + 1);
+ if (cfs_rq) {
+ cfs_rq->runnable_load_sum =
+ decay_load(cfs_rq->runnable_load_sum, periods + 1);
+ }
sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);

/* Efficiently calculate \sum (1..n_period) 1024*y^i */
contrib = __compute_runnable_contrib(periods);
- if (weight)
+ if (weight) {
sa->load_sum += weight * contrib;
+ if (cfs_rq)
+ cfs_rq->runnable_load_sum += weight * contrib;
+ }
if (running)
sa->util_sum += contrib * scale_freq >> SCHED_CAPACITY_SHIFT;
}

/* Remainder of delta accrued against u_0` */
- if (weight)
+ if (weight) {
sa->load_sum += weight * delta;
+ if (cfs_rq)
+ cfs_rq->runnable_load_sum += weight * delta;
+ }
if (running)
sa->util_sum += delta * scale_freq >> SCHED_CAPACITY_SHIFT;

@@ -2614,6 +2627,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

if (decayed) {
sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+ if (cfs_rq) {
+ cfs_rq->runnable_load_avg =
+ div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
+ }
sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
}

@@ -2661,7 +2678,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
}

decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
- scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL);
+ scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);

#ifndef CONFIG_64BIT
smp_wmb();
@@ -2683,7 +2700,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
* track group sched_entity load average for task_h_load calc in migration
*/
__update_load_avg(now, cpu, &se->avg,
- se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
+ se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se, NULL);

if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
update_tg_load_avg(cfs_rq, 0);
@@ -2703,11 +2720,15 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
}
else {
__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
- se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
+ se->on_rq * scale_load_down(se->load.weight),
+ cfs_rq->curr == se, NULL);
}

decayed = update_cfs_rq_load_avg(now, cfs_rq);

+ cfs_rq->runnable_load_avg += sa->load_avg;
+ cfs_rq->runnable_load_sum += sa->load_sum;
+
if (migrated) {
cfs_rq->avg.load_avg += sa->load_avg;
cfs_rq->avg.load_sum += sa->load_sum;
@@ -2719,6 +2740,18 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
update_tg_load_avg(cfs_rq, 0);
}

+/* Remove the runnable load generated by se from cfs_rq's runnable load average */
+static inline void
+dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ update_load_avg(se, 1);
+
+ cfs_rq->runnable_load_avg =
+ max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
+ cfs_rq->runnable_load_sum =
+ max_t(s64, cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
+}
+
/*
* Task first catches up with cfs_rq, and then subtract
* itself from the cfs_rq (task must be off the queue now).
@@ -2740,7 +2773,7 @@ void remove_entity_load_avg(struct sched_entity *se)
last_update_time = cfs_rq->avg.last_update_time;
#endif

- __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0);
+ __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
}
@@ -2770,6 +2803,8 @@ static int idle_balance(struct rq *this_rq);
static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
static inline void
enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static inline void
+dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
static inline void remove_entity_load_avg(struct sched_entity *se) {}

static inline int idle_balance(struct rq *rq)
@@ -2977,7 +3012,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- update_load_avg(se, 1);
+ dequeue_entity_load_avg(cfs_rq, se);

update_stats_dequeue(cfs_rq, se);
if (flags & DEQUEUE_SLEEP) {
@@ -7863,7 +7898,7 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
#ifdef CONFIG_SMP
/* Catch up with the cfs_rq and remove our load when we leave */
__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq), &se->avg,
- se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
+ se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se, NULL);

cfs_rq->avg.load_avg =
max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4d139e0..ab0b05c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -368,6 +368,8 @@ struct cfs_rq {
* CFS load tracking
*/
struct sched_avg avg;
+ u64 runnable_load_sum;
+ unsigned long runnable_load_avg;
#ifdef CONFIG_FAIR_GROUP_SCHED
unsigned long tg_load_avg_contrib;
#endif

Subject: [tip:sched/core] sched/fair: Clean up load average references

Commit-ID: 7ea241afbf4924c58d41078599f7a32ba49fb985
Gitweb: http://git.kernel.org/tip/7ea241afbf4924c58d41078599f7a32ba49fb985
Author: Yuyang Du <[email protected]>
AuthorDate: Wed, 15 Jul 2015 08:04:42 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 3 Aug 2015 12:24:32 +0200

sched/fair: Clean up load average references

For cfs_rq, we have load.weight, runnable_load_avg, and load_avg.
Clean up how they are used:

- First, as group sched_entity already largely uses load_avg, we now expand
to use load_avg in all cases.

- Second, for CPU-wide load balancing, we choose to use runnable_load_avg
in all cases, which is the same as before this series.

Signed-off-by: Yuyang Du <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++---------------
1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a878d5..858b94a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -685,6 +685,9 @@ void init_entity_runnable_average(struct sched_entity *se)
sa->util_sum = LOAD_AVG_MAX;
/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
}
+
+static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
+static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq);
#else
void init_entity_runnable_average(struct sched_entity *se)
{
@@ -2360,7 +2363,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
*/
tg_weight = atomic_long_read(&tg->load_avg);
tg_weight -= cfs_rq->tg_load_avg_contrib;
- tg_weight += cfs_rq->avg.load_avg;
+ tg_weight += cfs_rq_load_avg(cfs_rq);

return tg_weight;
}
@@ -2370,7 +2373,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
long tg_weight, load, shares;

tg_weight = calc_tg_weight(tg, cfs_rq);
- load = cfs_rq->avg.load_avg;
+ load = cfs_rq_load_avg(cfs_rq);

shares = (tg->shares * load);
if (tg_weight)
@@ -2796,6 +2799,16 @@ void idle_exit_fair(struct rq *this_rq)
{
}

+static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->runnable_load_avg;
+}
+
+static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->avg.load_avg;
+}
+
static int idle_balance(struct rq *this_rq);

#else /* CONFIG_SMP */
@@ -4270,6 +4283,12 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
sched_avg_update(this_rq);
}

+/* Used instead of source_load when we know the type == 0 */
+static unsigned long weighted_cpuload(const int cpu)
+{
+ return cfs_rq_runnable_load_avg(&cpu_rq(cpu)->cfs);
+}
+
#ifdef CONFIG_NO_HZ_COMMON
/*
* There is no sane way to deal with nohz on smp when using jiffies because the
@@ -4291,7 +4310,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
static void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = READ_ONCE(jiffies);
- unsigned long load = this_rq->cfs.avg.load_avg;
+ unsigned long load = weighted_cpuload(cpu_of(this_rq));
unsigned long pending_updates;

/*
@@ -4337,7 +4356,7 @@ void update_cpu_load_nohz(void)
*/
void update_cpu_load_active(struct rq *this_rq)
{
- unsigned long load = this_rq->cfs.avg.load_avg;
+ unsigned long load = weighted_cpuload(cpu_of(this_rq));
/*
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
@@ -4345,12 +4364,6 @@ void update_cpu_load_active(struct rq *this_rq)
__update_cpu_load(this_rq, load, 1);
}

-/* Used instead of source_load when we know the type == 0 */
-static unsigned long weighted_cpuload(const int cpu)
-{
- return cpu_rq(cpu)->cfs.avg.load_avg;
-}
-
/*
* Return a low guess at the load of a migration-source cpu weighted
* according to the scheduling class and "nice" value.
@@ -4398,7 +4411,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
- unsigned long load_avg = rq->cfs.avg.load_avg;
+ unsigned long load_avg = weighted_cpuload(cpu);

if (nr_running)
return load_avg / nr_running;
@@ -4517,7 +4530,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* w = rw_i + @wl
*/
- w = se->my_q->avg.load_avg + wl;
+ w = cfs_rq_load_avg(se->my_q) + wl;

/*
* wl = S * s'_i; see (2)
@@ -5862,13 +5875,14 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
}

if (!se) {
- cfs_rq->h_load = cfs_rq->avg.load_avg;
+ cfs_rq->h_load = cfs_rq_load_avg(cfs_rq);
cfs_rq->last_h_load_update = now;
}

while ((se = cfs_rq->h_load_next) != NULL) {
load = cfs_rq->h_load;
- load = div64_ul(load * se->avg.load_avg, cfs_rq->avg.load_avg + 1);
+ load = div64_ul(load * se->avg.load_avg,
+ cfs_rq_load_avg(cfs_rq) + 1);
cfs_rq = group_cfs_rq(se);
cfs_rq->h_load = load;
cfs_rq->last_h_load_update = now;
@@ -5881,7 +5895,7 @@ static unsigned long task_h_load(struct task_struct *p)

update_cfs_rq_h_load(cfs_rq);
return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
- cfs_rq->avg.load_avg + 1);
+ cfs_rq_load_avg(cfs_rq) + 1);
}
#else
static inline void update_blocked_averages(int cpu)