Hi all,
The attached series is an RFC on implementing load tracking at the entity
instead of cfs_rq level. This results in a bottom-up load-computation in which
entities contribute to their parents load, as opposed to the current top-down
where the parent averages its children. In particular this allows us to
correctly migrate load with their accompanying entities and provides the
necessary inputs for intelligent load-balancing and power-management.
It was previously well tested and stable, but that was on v3.1-; there's been
some fairly extensive changes in the wake-up path since so apologies if anything
was broken in the rebase.Note also, since this is also an RFC on the approach I
have not yet de-linted the various CONFIG combinations for introduced compiler
errors.
Background:
----------
We currently track the load average at the parenting cfs_rq level. We divide
execution into a series of consecutive "windows, w_i". Within each we track:
\Sum load_i * t_i where \Sum t_i = w and each load_i a disjoint load level.
The load average is then \Sum w_j / 2^n.
This this works reasonably well but there's a few problems
1) Since decay occurs at boundary window boundary there are 'skews':
At a window boundary the 'foreground' time has a bias against the
time immediately preceding it (as a result of the folding division)
e.g. __xx_|_yyyy___ vs __xx_yyyy_|___ (where '|' is a window boundary).
The accounting here is 2x + 4y/2 or 2x + 4y, depending on which side of the
window your load lands on.
2) Everything within a window 'w' is accounted equally, we only fold at
the boundaries. This actually means you can't set 'w' large enough to
really have meaningful coverage of the sched period without throwing
decay out the window. But then with w/2 << sched_period (currently
w/2=5ms) the average ends up having fairly drastic swings even with
stable loads.
(Note: Before we even looked at migrating to per-entity tracking we evaluating
foreground window into the considered average until it was "complete", this
represented a measurable improvement in stability under predictable load.)
3) Since the load sum average is per-cfs_rq and not per-entity when a task
entity migrates we lose its contribution to load-sum and effectively double
count it while it former sum decays.
New approach:
-------------
Instead of tracking load on a per-cfs_rq basis we do it on a per-sched_entity
basis. The load sum for a cfs_rq is then just the sum of its childrens' load
averages. The obvious immediately nice property is that when we migrate thread
A from cpu1-->cpu2, we carry its load with it; leaving the global load sum
unmodified.
The 'windows' above are replaced with more fine-grained tracking, that is (for
an entity j):
runnable_j = u_i * y^i , load_avg_j = runnable_j * weight_j [*]
Where: u_i is the usage in the last i`th ~ms and y is chosen such that
y^~sched_period = 1/2 (we currently choose k=32).This means that load tracked 1
sched_period ago contributes about ~50% as current execution.
Now, the real challenge in tracking at an entity basis is handling blocked
entities. Obviously runnable entities will be updated naturally as we iterate
through them but there could be O(many) blocked entities so we can't afford to
iterate through them and update their tracking.
That our decay for a unit period is exponential introduces a particularly nice
property here:
We can separate the contributed load on a cfs_rq into blocked and runnable.
Runnable load is just the sum of all load_avg_j above, maintained by the
entities themselves as they run and self update (when they update their
load_avg they update the cumulative sum also).
Blocked load then looks like:
load_avg_j = weight_k * \Sum u[k]_n * y^n
To account an entirely idle period we then only need to multiply by y.
This ends up being orders of magnitude more accurate than the current
tracking schema, even with the old shares_window massively dilated to
better capture a stable load-state. It's also obviously stable under
migration.
As referenced above this also allows us to potentially improve decisions within
the load-balancer, for both distribution and power-management.
Exmaple: consider 1x80% task and 2x40% tasks on a 2-core machine. It's
currently a bit of a gamble as to whether you get an {AB, B} or {A,
BB} split since they have equal weight (assume 1024). With per-task
tracking we can actually consider them at their contributed weight and
see a stable ~{800,{400, 400}} load-split. Likewise within balance_tasks we can
consider the load migrated to be that actually contributed.
Thanks,
- Paul
---
Ben Segall (1):
sched: maintain per-rq runnable averages
Paul Turner (13):
sched: track the runnable average on a per-task entitiy basis
sched: aggregate load contributed by task entities on parenting cfs_rq
sched: maintain the load contribution of blocked entities
sched: account for blocked load waking back up
sched: aggregate total task_group load
sched: compute load contribution by a group entity
sched: normalize tg load contributions against runnable time
sched: maintain runnable averages across throttled periods
sched: replace update_shares weight distribution with per-entity computation
sched: refactor update_shares_cpu() -> update_blocked_avgs()
sched: update_cfs_shares at period edge
sched: make __update_entity_runnable_avg() fast
sched: implement usage tracking
include/linux/sched.h | 11 +
kernel/sched/core.c | 3
kernel/sched/debug.c | 37 ++-
kernel/sched/fair.c | 714 ++++++++++++++++++++++++++++++++++++++-----------
kernel/sched/sched.h | 29 +-
5 files changed, 611 insertions(+), 183 deletions(-)
Maintain a global running sum of the average load seen on each cfs_rq belonging
to each task group so that it may be used in calculating an appropriate
shares:weight distribution.
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/debug.c | 4 ++++
kernel/sched/fair.c | 17 +++++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 23 insertions(+), 0 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index a638d9d..f6227c0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -228,6 +228,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->runnable_load_avg);
SEQ_printf(m, " .%-30s: %lld\n", "blocked_load_avg",
cfs_rq->blocked_load_avg);
+ SEQ_printf(m, " .%-30s: %ld\n", "tg_load_avg",
+ atomic64_read(&cfs_rq->tg->load_avg));
+ SEQ_printf(m, " .%-30s: %lld\n", "tg_load_contrib",
+ cfs_rq->tg_load_contrib);
#endif
print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9a8f6d..7771003 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1111,6 +1111,21 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
return se->avg.load_avg_contrib - old_contrib;
}
+static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
+ int force_update)
+{
+ struct task_group *tg = cfs_rq->tg;
+ s64 tg_contrib;
+
+ tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+ tg_contrib -= cfs_rq->tg_load_contrib;
+
+ if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
+ atomic64_add(tg_contrib, &tg->load_avg);
+ cfs_rq->tg_load_contrib += tg_contrib;
+ }
+}
+
static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
long load_contrib)
{
@@ -1166,6 +1181,8 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
atomic64_add(decays, &cfs_rq->decay_counter);
cfs_rq->last_decay = now;
}
+
+ __update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
}
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f45b49..17f99e7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -116,6 +116,7 @@ struct task_group {
unsigned long shares;
atomic_t load_weight;
+ atomic64_t load_avg;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
@@ -272,6 +273,7 @@ struct cfs_rq {
unsigned long load_contribution;
u64 runnable_load_avg, blocked_load_avg;
+ u64 tg_load_contrib;
atomic64_t decay_counter, removed_load;
u64 last_decay;
#endif /* CONFIG_SMP */
Instead of tracking averaging the load parented by a cfs_rq, we can track
entity load directly. With the load for a given cfs_Rq then being the sum of
its children.
To do this we represent the historical contribution to runnable average within each
trailing 1024us of execution as the coefficients of a geometric series.
We can express this for a given task t as:
runnable_sum(t) = \Sum u_i * y^i ,
load(t) = weight_t * runnable_sum(t) / (\Sum 1024 * y^i)
Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms
and y is chosen such that y^k = 1/2. We currently choose k to be 32 which
roughly translates to about a sched period.
Signed-off-by: Paul Turner <[email protected]>
---
include/linux/sched.h | 7 +++
kernel/sched/debug.c | 2 +
kernel/sched/fair.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 123 insertions(+), 0 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a5be381..91599c8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1156,6 +1156,11 @@ struct load_weight {
unsigned long weight, inv_weight;
};
+struct sched_avg {
+ u64 runnable_avg_sum, runnable_avg_period;
+ u64 last_runnable_update;
+};
+
#ifdef CONFIG_SCHEDSTATS
struct sched_statistics {
u64 wait_start;
@@ -1215,6 +1220,8 @@ struct sched_entity {
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
+
+ struct sched_avg avg;
#endif
};
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 09acaa1..d89db32 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -85,6 +85,8 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->statistics.wait_count);
#endif
P(se->load.weight);
+ P(se->avg.runnable_avg_sum);
+ P(se->avg.runnable_avg_period);
#undef PN
#undef P
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e77a6b..a570e9c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,6 +988,108 @@ static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+/*
+ * Approximate:
+ * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
+ */
+static __always_inline u64 decay_load(u64 val, int n)
+{
+ for (;n && val;n--) {
+ val *= 4008;
+ val >>= 12;
+ }
+
+ return val;
+}
+
+/* We can represent the historical contribution to runnable average as the
+ * coefficients of a geometric series. To do this we sub-divide our runnable
+ * history into segments of approximately 1ms (1024us); label the segment that
+ * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ *
+ * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
+ * p0 p1 p1
+ * (now) (~1ms ago) (~2ms ago)
+ *
+ * Let u_i denote the fraction of p_i that the entity was runnable.
+ *
+ * We can then represent historical load-average using u_i as the co-efficients
+ * to for a geometric series.
+ * u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
+ * (Taking the sum over the equivalently decayed period)
+ *
+ * We choose k to be approximately the width of a scheduling period, that is:
+ * y^32 = 0.5
+ * This means that the contribution to load ~32ms ago will be weighted
+ * approximately half as much as the contribution to load within the last ms.
+ *
+ * When a period "rolls over" and we have new u_0`, we can multiply the
+ * previous sum again by k to update:
+ * load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1]
+ */
+static __always_inline int __update_entity_runnable_avg(u64 now,
+ struct sched_avg *sa,
+ int runnable)
+{
+ u64 delta;
+ int delta_w, decayed = 0;
+
+ delta = now - sa->last_runnable_update;
+ if((s64)delta < 0) {
+ sa->last_runnable_update = now;
+ return 0;
+ }
+
+ /*
+ * Use 1024us as the unit of measurement since it's a reasonable
+ * approximation of 1ms and fast to compute.
+ */
+ delta >>= 10;
+ if (!delta)
+ return 0;
+ sa->last_runnable_update = now;
+
+ delta_w = sa->runnable_avg_period % 1024;
+ if (delta + delta_w >= 1024) {
+ /* period roll-over */
+ decayed = 1;
+
+ delta_w = 1024 - delta_w;
+ BUG_ON(delta_w > delta);
+ do {
+ if (runnable)
+ sa->runnable_avg_sum += delta_w;
+ sa->runnable_avg_period += delta_w;
+
+ sa->runnable_avg_sum =
+ decay_load(sa->runnable_avg_sum, 1);
+ sa->runnable_avg_period =
+ decay_load(sa->runnable_avg_period, 1);
+
+ delta -= delta_w;
+ delta_w = 1024;
+ } while (delta >= 1024);
+ }
+
+ if (runnable)
+ sa->runnable_avg_sum += delta;
+ sa->runnable_avg_period += delta;
+
+ return decayed;
+}
+
+/* Update a sched_entity's runnable average */
+static inline void update_entity_load_avg(struct sched_entity *se)
+{
+ __update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
+ se->on_rq);
+}
+#else
+static inline void update_entity_load_avg(struct sched_entity *se) {}
+#endif
+
static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
#ifdef CONFIG_SCHEDSTATS
@@ -1112,6 +1214,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
update_curr(cfs_rq);
update_cfs_load(cfs_rq, 0);
+ update_entity_load_avg(se);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);
@@ -1186,6 +1289,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
+ update_entity_load_avg(se);
update_stats_dequeue(cfs_rq, se);
if (flags & DEQUEUE_SLEEP) {
@@ -1355,6 +1459,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
update_stats_wait_start(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
+ /* in !on_rq case, update occurred at dequeue */
+ update_entity_load_avg(prev);
}
cfs_rq->curr = NULL;
}
@@ -1367,6 +1473,14 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
*/
update_curr(cfs_rq);
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ /*
+ * Ensure that runnable average is periodically updated.
+ */
+ if (likely(curr->avg.last_runnable_update)) {
+ update_entity_load_avg(curr);
+ }
+#endif
/*
* Update share accounting for long-running entities.
*/
We are currently maintaining:
runnable_load(cfs_rq) = \Sum task_load(t)
For all running children t of cfs_rq. While this can be naturally updated for
tasks in a runnable state (as they are scheduled); this does not account for
the load contributed by blocked task entities.
This can be solved by introducing a separate accounting for blocked load:
blocked_load(cfs_rq) = \Sum runnable(b) * weight(b)
Obviously we do not want to iterate over all blocked entities to account for
their decay, we instead observe that:
runnable_load(t) = \Sum p_i*y^i
and that to account for an additional idle period we only need to compute:
y*runnable_load(t).
This means that we can compute all blocked entities at once by evaluating:
blocked_load(cfs_rq)` = y * blocked_load(cfs_rq)
Finally we maintain a decay counter so that when a sleeping entity re-awakens
we can determine how much of its load should be removed from the blocked sum.
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Ben Segall <[email protected]>
---
include/linux/sched.h | 2 -
kernel/sched/core.c | 3 +
kernel/sched/debug.c | 3 +
kernel/sched/fair.c | 119 ++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 4 +-
5 files changed, 116 insertions(+), 15 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2999f0..70eae51 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1158,7 +1158,7 @@ struct load_weight {
struct sched_avg {
u64 runnable_avg_sum, runnable_avg_period;
- u64 last_runnable_update;
+ u64 last_runnable_update, decay_count;
unsigned long load_avg_contrib;
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d7c4322..79c3e31 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1683,6 +1683,9 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ p->se.avg.decay_count = 0;
+#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 5a55d26..a638d9d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -94,6 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->avg.runnable_avg_sum);
P(se->avg.runnable_avg_period);
P(se->avg.load_avg_contrib);
+ P(se->avg.decay_count);
#undef PN
#undef P
}
@@ -225,6 +226,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
atomic_read(&cfs_rq->tg->load_weight));
SEQ_printf(m, " .%-30s: %lld\n", "runnable_load_avg",
cfs_rq->runnable_load_avg);
+ SEQ_printf(m, " .%-30s: %lld\n", "blocked_load_avg",
+ cfs_rq->blocked_load_avg);
#endif
print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcdad5d..cc4ec4b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1080,6 +1080,20 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
return decayed;
}
+/* Synchronize an entity's decay with its parentin cfs_rq.*/
+static inline void __synchronize_entity_decay(struct sched_entity *se)
+{
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ u64 decays = atomic64_read(&cfs_rq->decay_counter);
+
+ decays -= se->avg.decay_count;
+ if (!decays)
+ return;
+
+ se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+ se->avg.decay_count += decays;
+}
+
/* Compute the current contribution to load_avg by se, return any delta */
static long __update_entity_load_avg_contrib(struct sched_entity *se)
{
@@ -1095,8 +1109,18 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
return se->avg.load_avg_contrib - old_contrib;
}
+static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
+ long load_contrib)
+{
+ if (likely(load_contrib < cfs_rq->blocked_load_avg))
+ cfs_rq->blocked_load_avg -= load_contrib;
+ else
+ cfs_rq->blocked_load_avg = 0;
+}
+
/* Update a sched_entity's runnable average */
-static inline void update_entity_load_avg(struct sched_entity *se)
+static inline void update_entity_load_avg(struct sched_entity *se,
+ int update_cfs_rq)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
long contrib_delta;
@@ -1106,8 +1130,34 @@ static inline void update_entity_load_avg(struct sched_entity *se)
return;
contrib_delta = __update_entity_load_avg_contrib(se);
+
+ if (!update_cfs_rq)
+ return;
+
if (se->on_rq)
cfs_rq->runnable_load_avg += contrib_delta;
+ else
+ subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+}
+
+/*
+ * Decay the load contributed by all blocked children and account this so that
+ * they their contribution may appropriately discounted when they wake up.
+ */
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq)
+{
+ u64 now = rq_of(cfs_rq)->clock_task >> 20;
+ u64 decays;
+
+ decays = now - cfs_rq->last_decay;
+ if (!decays)
+ return;
+
+ cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
+ decays);
+ atomic64_add(decays, &cfs_rq->decay_counter);
+
+ cfs_rq->last_decay = now;
}
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1117,27 +1167,47 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
/* Add the load generated by se into cfs_rq's child load-average */
static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se)
+ struct sched_entity *se,
+ int wakeup)
{
- update_entity_load_avg(se);
+ __synchronize_entity_decay(se);
+
+ if (wakeup)
+ subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+
+ update_entity_load_avg(se, 0);
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+ update_cfs_rq_blocked_load(cfs_rq);
}
-/* Remove se's load from this cfs_rq child load-average */
+/*
+ * Remove se's load from this cfs_rq child load-average, if the entity is
+ * transitioning to a blocked state we track its projected decay using
+ * blocked_load_avg.
+ */
static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se)
+ struct sched_entity *se,
+ int sleep)
{
- update_entity_load_avg(se);
+ update_entity_load_avg(se, 1);
+
cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+ if (sleep) {
+ cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+ }
}
#else
-static inline void update_entity_load_avg(struct sched_entity *se) {}
+static inline update_entity_load_avg(struct sched_entity *se,
+ int update_cfs_rq) {}
static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se {}
+ struct sched_entity *se,
+ int wakeup) {}
static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
- struct sched_entity *se) {}
+ struct sched_entity *se,
+ int sleep) {}
#endif
static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1264,7 +1334,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
update_curr(cfs_rq);
update_cfs_load(cfs_rq, 0);
- enqueue_entity_load_avg(se);
+ enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);
@@ -1339,7 +1409,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- dequeue_entity_load_avg(cfs_rq, se);
+ dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
update_stats_dequeue(cfs_rq, se);
if (flags & DEQUEUE_SLEEP) {
@@ -1510,7 +1580,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
- update_entity_load_avg(prev);
+ update_entity_load_avg(prev, 1);
}
cfs_rq->curr = NULL;
}
@@ -1528,9 +1598,11 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
* Ensure that runnable average is periodically updated.
*/
if (likely(curr->avg.last_runnable_update)) {
- update_entity_load_avg(curr);
+ update_entity_load_avg(curr, 1);
+ update_cfs_rq_blocked_load(cfs_rq);
}
#endif
+
/*
* Update share accounting for long-running entities.
*/
@@ -2388,6 +2460,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_load(cfs_rq, 0);
update_cfs_shares(cfs_rq);
+ update_entity_load_avg(se, 1);
}
if (!se) {
@@ -2449,6 +2522,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_load(cfs_rq, 0);
update_cfs_shares(cfs_rq);
+ update_entity_load_avg(se, 1);
}
if (!se) {
@@ -3473,6 +3547,7 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
update_rq_clock(rq);
update_cfs_load(cfs_rq, 1);
+ update_cfs_rq_blocked_load(cfs_rq);
/*
* We need to update shares after updating tg->load_weight in
@@ -5478,6 +5553,21 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
place_entity(cfs_rq, se, 0);
se->vruntime -= cfs_rq->min_vruntime;
}
+
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ /*
+ * Remove our load from contribution when we leave sched_fair
+ * and ensure we don't carry in an old decay_count if we
+ * switch back.
+ */
+ if (p->se.avg.decay_count) {
+ struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+ __synchronize_entity_decay(&p->se);
+ subtract_blocked_load_contrib(cfs_rq,
+ p->se.avg.load_avg_contrib);
+ p->se.avg.decay_count = 0;
+ }
+#endif
}
/*
@@ -5525,6 +5615,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifndef CONFIG_64BIT
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ atomic64_set(&cfs_rq->decay_counter, 1);
+#endif
}
#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 77a3427..2c19c26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -271,7 +271,9 @@ struct cfs_rq {
unsigned long load_contribution;
- u64 runnable_load_avg;
+ u64 runnable_load_avg, blocked_load_avg;
+ atomic64_t decay_counter;
+ u64 last_decay;
#endif /* CONFIG_SMP */
#ifdef CONFIG_CFS_BANDWIDTH
int runtime_enabled;
When a running entity blocks we migrate its tracked load to
cfs_rq->blocked_runnable_avg. In the sleep case this occurs while holding
rq->lock and so is a natural transition. Wake-ups however, are potentially
asynchronous in the presence of migration and so special care must be taken.
We use an atomic counter to track such migrated load, taking care to match this
with the previously introduced decay counters so that we don't migrate too much
load.
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Ben Segall <[email protected]>
---
include/linux/sched.h | 2 +
kernel/sched/fair.c | 98 +++++++++++++++++++++++++++++++++++++++++--------
kernel/sched/sched.h | 7 +++-
3 files changed, 90 insertions(+), 17 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 70eae51..09b8c45 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1160,6 +1160,8 @@ struct sched_avg {
u64 runnable_avg_sum, runnable_avg_period;
u64 last_runnable_update, decay_count;
unsigned long load_avg_contrib;
+
+ int contributes_blocked_load;
};
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc4ec4b..c9a8f6d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1081,17 +1081,19 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
}
/* Synchronize an entity's decay with its parentin cfs_rq.*/
-static inline void __synchronize_entity_decay(struct sched_entity *se)
+static inline u64 __synchronize_entity_decay(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
u64 decays = atomic64_read(&cfs_rq->decay_counter);
decays -= se->avg.decay_count;
if (!decays)
- return;
+ return 0;
se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
se->avg.decay_count += decays;
+
+ return decays;
}
/* Compute the current contribution to load_avg by se, return any delta */
@@ -1144,20 +1146,26 @@ static inline void update_entity_load_avg(struct sched_entity *se,
* Decay the load contributed by all blocked children and account this so that
* they their contribution may appropriately discounted when they wake up.
*/
-static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq)
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
{
u64 now = rq_of(cfs_rq)->clock_task >> 20;
u64 decays;
decays = now - cfs_rq->last_decay;
- if (!decays)
+ if (!decays && !force_update)
return;
- cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
- decays);
- atomic64_add(decays, &cfs_rq->decay_counter);
+ if (atomic64_read(&cfs_rq->removed_load)) {
+ u64 removed_load = atomic64_xchg(&cfs_rq->removed_load, 0);
+ subtract_blocked_load_contrib(cfs_rq, removed_load);
+ }
- cfs_rq->last_decay = now;
+ if (decays) {
+ cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
+ decays);
+ atomic64_add(decays, &cfs_rq->decay_counter);
+ cfs_rq->last_decay = now;
+ }
}
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1170,14 +1178,34 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
struct sched_entity *se,
int wakeup)
{
- __synchronize_entity_decay(se);
+ /* we track migrations using entity decay_count == 0 */
+ if (unlikely(se->avg.decay_count <= 0)) {
+ se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
+ if (se->avg.decay_count) {
+ /*
+ * In a wake-up migration we have to approximate the
+ * time sleeping.
+ */
+ se->avg.last_runnable_update -= (-se->avg.decay_count)
+ << 20;
+ update_entity_load_avg(se, 0);
+ }
+ se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+ wakeup = 0;
+ } else {
+ __synchronize_entity_decay(se);
+ }
- if (wakeup)
+ /* migrated tasks did not contribute to our blocked load */
+ if (wakeup) {
+ se->avg.contributes_blocked_load = 0;
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ update_entity_load_avg(se, 0);
+ }
- update_entity_load_avg(se, 0);
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
- update_cfs_rq_blocked_load(cfs_rq);
+ /* we force update consideration on load-balancer moves */
+ update_cfs_rq_blocked_load(cfs_rq, !wakeup);
}
/*
@@ -1190,14 +1218,38 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
int sleep)
{
update_entity_load_avg(se, 1);
+ /* we force update consideration on load-balancer moves */
+ update_cfs_rq_blocked_load(cfs_rq, !sleep);
cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
if (sleep) {
+ se->avg.contributes_blocked_load = 1;
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+ } else {
+ se->avg.contributes_blocked_load = 0;
+ se->avg.decay_count = 0;
}
}
+/*
+ * Accumulate removed load so that it can be processed when we next update
+ * owning cfs_rq under rq->lock.
+ */
+inline void remove_task_load_avg_async(struct task_struct *p)
+{
+ struct sched_entity *se = &p->se;
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+ if (!se->avg.contributes_blocked_load)
+ return;
+
+ se->avg.contributes_blocked_load = 0;
+ __synchronize_entity_decay(se);
+ se->avg.decay_count = 0;
+ atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
+}
+
#else
static inline update_entity_load_avg(struct sched_entity *se,
int update_cfs_rq) {}
@@ -1208,6 +1260,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
struct sched_entity *se,
int sleep) {}
+inline void remove_task_load_avg_async(struct task_struct *p) {}
#endif
static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1599,7 +1652,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
*/
if (likely(curr->avg.last_runnable_update)) {
update_entity_load_avg(curr, 1);
- update_cfs_rq_blocked_load(cfs_rq);
+ update_cfs_rq_blocked_load(cfs_rq, 1);
}
#endif
@@ -3547,7 +3600,7 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
update_rq_clock(rq);
update_cfs_load(cfs_rq, 1);
- update_cfs_rq_blocked_load(cfs_rq);
+ update_cfs_rq_blocked_load(cfs_rq, 1);
/*
* We need to update shares after updating tg->load_weight in
@@ -5617,12 +5670,14 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#endif
#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
atomic64_set(&cfs_rq->decay_counter, 1);
+ atomic64_set(&cfs_rq->removed_load, 0);
#endif
}
#ifdef CONFIG_FAIR_GROUP_SCHED
static void task_move_group_fair(struct task_struct *p, int on_rq)
{
+ struct cfs_rq *cfs_rq;
/*
* If the task was not on the rq at the time of this cgroup movement
* it must have been asleep, sleeping tasks keep their ->vruntime
@@ -5654,8 +5709,19 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
if (!on_rq)
p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
set_task_rq(p, task_cpu(p));
- if (!on_rq)
- p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
+ if (!on_rq) {
+ cfs_rq = cfs_rq_of(&p->se);
+ p->se.vruntime += cfs_rq->min_vruntime;
+#ifdef CONFIG_SMP
+ /*
+ * set_task_rq will() have removed our previous contribution,
+ * but we must synchronize explicitly against further decay
+ * here.
+ */
+ p->se.avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+ cfs_rq->blocked_load_avg += p->se.avg.load_avg_contrib;
+#endif
+ }
}
void free_fair_sched_group(struct task_group *tg)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2c19c26..9f45b49 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -272,7 +272,7 @@ struct cfs_rq {
unsigned long load_contribution;
u64 runnable_load_avg, blocked_load_avg;
- atomic64_t decay_counter;
+ atomic64_t decay_counter, removed_load;
u64 last_decay;
#endif /* CONFIG_SMP */
#ifdef CONFIG_CFS_BANDWIDTH
@@ -570,6 +570,8 @@ static inline struct task_group *task_group(struct task_struct *p)
return autogroup_task_group(p, tg);
}
+inline void remove_task_load_avg_async(struct task_struct *p);
+
/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
{
@@ -578,6 +580,9 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
+ /* if we're migrating a sleeping task we need to remove its load */
+ remove_task_load_avg_async(p);
+
p->se.cfs_rq = tg->cfs_rq[cpu];
p->se.parent = tg->se[cpu];
#endif
__update_entity_runnable_avg forms the core of maintaining an entity's runnable
load average. In this function we charge the accumulated run-time since last
update and handle appropriate decay. In some cases, e.g. a waking task, this
time interval may be much larger than our period unit.
Fortunately we can exploit some properties of our series to perform decay for a
blocked update in constant time and account the contribution for a running
update in essentially-constant* time.
[*]: For any running entity they should be performing updates at the tick which
gives us a soft limit of 1 jiffy between updates, and we can compute up to a
32 jiffy update in a single pass.
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/fair.c | 100 ++++++++++++++++++++++++++++++++++++++++++---------
1 files changed, 82 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9e2c9a4..ad524bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -896,17 +896,77 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
/*
+ * We choose a half-life close to 1 scheduling period.
+ * Note: The tables below are dependent on this value.
+ */
+#define LOAD_AVG_PERIOD 32
+
+/* Precomputed fixed inverse multiplies for multiplication by y^n */
+const static u32 runnable_avg_yN_inv[] = {
+ 0xffffffff,0xfa83b2db,0xf5257d15,0xefe4b99b,0xeac0c6e7,0xe5b906e7,
+ 0xe0ccdeec,0xdbfbb797,0xd744fcca,0xd2a81d91,0xce248c15,0xc9b9bd86,
+ 0xc5672a11,0xc12c4cca,0xbd08a39f,0xb8fbaf47,0xb504f333,0xb123f581,
+ 0xad583eea,0xa9a15ab4,0xa5fed6a9,0xa2704303,0x9ef53260,0x9b8d39b9,
+ 0x9837f051,0x94f4efa8,0x91c3d373,0x8ea4398b,0x8b95c1e3,0x88980e80,
+ 0x85aac367,0x82cd8698,
+};
+
+/* Precomputed \Sum y^k { 1<=k<=n } */
+const static u32 runnable_avg_yN_sum[] = {
+ 0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
+ 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
+ 17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
+};
+
+/*
* Approximate:
* val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
*/
static __always_inline u64 decay_load(u64 val, int n)
{
- for (;n && val;n--) {
- val *= 4008;
- val >>= 12;
+ if (!n)
+ return val;
+
+ /*
+ * As y^PERIOD = 1/2, we can combine
+ * y^n = 1/2^(n/PERIOD) * k^(n%PERIOD)
+ * With a look-up table which covers k^n (n<PERIOD)
+ *
+ * To achieve constant time decay_load.
+ */
+ if (unlikely(n >= LOAD_AVG_PERIOD)) {
+ val >>= n/LOAD_AVG_PERIOD;
+ n %= LOAD_AVG_PERIOD;
}
- return val;
+ val *= runnable_avg_yN_inv[n];
+ return SRR(val, 32);
+}
+
+/*
+ * For updates fully spanning n periods, the contribution to runnable
+ * average will be: \Sum 1024*y^n
+ *
+ * We can compute this reasonably efficiently by combining:
+ * y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for n <PERIOD}
+ */
+static u32 __compute_runnable_contrib(int n)
+{
+ u32 contrib = 0;
+
+ if (likely(n<=LOAD_AVG_PERIOD))
+ return runnable_avg_yN_sum[n];
+
+ /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
+ do {
+ contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
+ contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
+
+ n -= LOAD_AVG_PERIOD;
+ } while (n>LOAD_AVG_PERIOD);
+
+ contrib = decay_load(contrib, n);
+ return contrib + runnable_avg_yN_sum[n];
}
/* We can represent the historical contribution to runnable average as the
@@ -940,6 +1000,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
int runnable)
{
u64 delta;
+ u32 periods, runnable_contrib;
int delta_w, decayed = 0;
delta = now - sa->last_runnable_update;
@@ -963,20 +1024,23 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
decayed = 1;
delta_w = 1024 - delta_w;
- BUG_ON(delta_w > delta);
- do {
- if (runnable)
- sa->runnable_avg_sum += delta_w;
- sa->runnable_avg_period += delta_w;
-
- sa->runnable_avg_sum =
- decay_load(sa->runnable_avg_sum, 1);
- sa->runnable_avg_period =
- decay_load(sa->runnable_avg_period, 1);
-
- delta -= delta_w;
- delta_w = 1024;
- } while (delta >= 1024);
+ if (runnable)
+ sa->runnable_avg_sum += delta_w;
+ sa->runnable_avg_period += delta_w;
+
+ delta -= delta_w;
+ periods = delta / 1024;
+ delta %= 1024;
+
+ sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
+ periods + 1);
+ sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+ periods + 1);
+
+ runnable_contrib = __compute_runnable_contrib(periods);
+ if (runnable)
+ sa->runnable_avg_sum += runnable_contrib;
+ sa->runnable_avg_period += runnable_contrib;
}
if (runnable)
Now that our measurement intervals are small (~1ms) we can amortize the posting
of update_shares() to be about each period overflow. This is a large cost
saving for frequently switching tasks.
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/fair.c | 18 ++++++++++--------
1 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b6ee7a4..9e2c9a4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1141,6 +1141,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
}
__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
+ update_cfs_shares(cfs_rq);
}
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1362,9 +1363,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
account_entity_enqueue(cfs_rq, se);
- update_cfs_shares(cfs_rq);
+ enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
if (flags & ENQUEUE_WAKEUP) {
place_entity(cfs_rq, se, 0);
@@ -1437,7 +1437,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
update_stats_dequeue(cfs_rq, se);
if (flags & DEQUEUE_SLEEP) {
@@ -1457,8 +1456,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
- se->on_rq = 0;
account_entity_dequeue(cfs_rq, se);
+ dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
/*
* Normalize the entity after updating the min_vruntime because the
@@ -1472,7 +1471,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
return_cfs_rq_runtime(cfs_rq);
update_min_vruntime(cfs_rq);
- update_cfs_shares(cfs_rq);
+ se->on_rq = 0;
}
/*
@@ -2488,8 +2487,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;
- update_cfs_shares(cfs_rq);
update_entity_load_avg(se, 1);
+ update_cfs_rq_blocked_load(cfs_rq, 0);
}
if (!se) {
@@ -2549,8 +2548,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;
- update_cfs_shares(cfs_rq);
update_entity_load_avg(se, 1);
+ update_cfs_rq_blocked_load(cfs_rq, 0);
}
if (!se) {
@@ -5836,8 +5835,11 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
se = tg->se[i];
/* Propagate contribution to hierarchy */
raw_spin_lock_irqsave(&rq->lock, flags);
- for_each_sched_entity(se)
+ for_each_sched_entity(se) {
update_cfs_shares(group_cfs_rq(se));
+ /* update contribution to parent */
+ update_entity_load_avg(se, 1);
+ }
raw_spin_unlock_irqrestore(&rq->lock, flags);
}
With the frame-work for runnable tracking now fully in place. Per-entity usage
tracking is a simple and low-overhead addition.
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Ben Segall <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/debug.c | 3 +++
kernel/sched/fair.c | 29 +++++++++++++++++++++++------
kernel/sched/sched.h | 4 ++--
4 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 09b8c45..209185f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1159,6 +1159,7 @@ struct load_weight {
struct sched_avg {
u64 runnable_avg_sum, runnable_avg_period;
u64 last_runnable_update, decay_count;
+ u32 usage_avg_sum;
unsigned long load_avg_contrib;
int contributes_blocked_load;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 0911ec6..4d39069 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -93,6 +93,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->load.weight);
P(se->avg.runnable_avg_sum);
P(se->avg.runnable_avg_period);
+ P(se->avg.usage_avg_sum);
P(se->avg.load_avg_contrib);
P(se->avg.decay_count);
#undef PN
@@ -228,6 +229,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->tg_runnable_contrib);
SEQ_printf(m, " .%-30s: %d\n", "tg->runnable_avg",
atomic_read(&cfs_rq->tg->runnable_avg));
+ SEQ_printf(m, " .%-30s: %d\n", "tg->usage_avg",
+ atomic_read(&cfs_rq->tg->usage_avg));
#endif
print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ad524bb..222c2c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -997,7 +997,8 @@ static u32 __compute_runnable_contrib(int n)
*/
static __always_inline int __update_entity_runnable_avg(u64 now,
struct sched_avg *sa,
- int runnable)
+ int runnable,
+ int running)
{
u64 delta;
u32 periods, runnable_contrib;
@@ -1026,6 +1027,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
delta_w = 1024 - delta_w;
if (runnable)
sa->runnable_avg_sum += delta_w;
+ if (running)
+ sa->usage_avg_sum += delta_w;
sa->runnable_avg_period += delta_w;
delta -= delta_w;
@@ -1036,15 +1039,20 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
periods + 1);
sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
periods + 1);
+ sa->usage_avg_sum = decay_load(sa->usage_avg_sum, periods + 1);
runnable_contrib = __compute_runnable_contrib(periods);
if (runnable)
sa->runnable_avg_sum += runnable_contrib;
+ if (running)
+ sa->usage_avg_sum += runnable_contrib;
sa->runnable_avg_period += runnable_contrib;
}
if (runnable)
sa->runnable_avg_sum += delta;
+ if (running)
+ sa->usage_avg_sum += delta;
sa->runnable_avg_period += delta;
return decayed;
@@ -1081,14 +1089,21 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
struct cfs_rq *cfs_rq)
{
struct task_group *tg = cfs_rq->tg;
- long contrib;
+ long contrib, usage_contrib;
contrib = (sa->runnable_avg_sum << 12) / (sa->runnable_avg_period + 1);
contrib -= cfs_rq->tg_runnable_contrib;
- if (abs(contrib) > cfs_rq->tg_runnable_contrib/64) {
+ usage_contrib = (sa->usage_avg_sum << 12) / (sa->runnable_avg_period + 1);
+ usage_contrib -= cfs_rq->tg_usage_contrib;
+
+ if ((abs(contrib) > cfs_rq->tg_runnable_contrib/64) ||
+ (abs(usage_contrib) > cfs_rq->tg_usage_contrib/64)) {
atomic_add(contrib, &tg->runnable_avg);
cfs_rq->tg_runnable_contrib += contrib;
+
+ atomic_add(usage_contrib, &tg->usage_avg);
+ cfs_rq->tg_usage_contrib += usage_contrib;
}
}
@@ -1164,8 +1179,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
struct cfs_rq *cfs_rq = cfs_rq_of(se);
long contrib_delta;
- if(!__update_entity_runnable_avg(rq_of(cfs_rq)->clock_task, &se->avg,
- se->on_rq))
+ if(!__update_entity_runnable_avg(cfs_rq_clock_task(cfs_rq), &se->avg,
+ se->on_rq, cfs_rq->curr == se))
return;
contrib_delta = __update_entity_load_avg_contrib(se);
@@ -1210,7 +1225,8 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
- __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+ __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
+ runnable);
__update_tg_runnable_avg(&rq->avg, &rq->cfs);
}
@@ -1590,6 +1606,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
+ update_entity_load_avg(se, 1);
}
update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 77f3297..9602a47 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -117,7 +117,7 @@ struct task_group {
atomic_t load_weight;
atomic64_t load_avg;
- atomic_t runnable_avg;
+ atomic_t runnable_avg, usage_avg;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
@@ -260,7 +260,7 @@ struct cfs_rq {
*/
unsigned long h_load;
- u32 tg_runnable_contrib;
+ u32 tg_runnable_contrib, tg_usage_contrib;
u64 runnable_load_avg, blocked_load_avg;
u64 tg_load_contrib;
atomic64_t decay_counter, removed_load;
Entities of equal weight should receive equitable distribution of cpu time.
This is challenging in the case of a task_group's shares as execution may be
occurring on multiple cpus simultaneously.
To handle this we divide up the shares into weights proportionate with the load
on each cfs_rq. This does not however, account for the fact that the sum of
the parts may be less than one cpu and so we need to normalize:
load(tg) = min(runnable_avg(tg), 1) * tg->shares
Where runnable_avg is the aggregate time in which the task_group had runnable
children.
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Ben Segall <[email protected]>.
---
kernel/sched/debug.c | 4 ++++
kernel/sched/fair.c | 34 ++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 40 insertions(+), 0 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f6227c0..8d87796 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -232,6 +232,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
atomic64_read(&cfs_rq->tg->load_avg));
SEQ_printf(m, " .%-30s: %lld\n", "tg_load_contrib",
cfs_rq->tg_load_contrib);
+ SEQ_printf(m, " .%-30s: %d\n", "tg_runnable_contrib",
+ cfs_rq->tg_runnable_contrib);
+ SEQ_printf(m, " .%-30s: %d\n", "tg->runnable_avg",
+ atomic_read(&cfs_rq->tg->runnable_avg));
#endif
print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d8af5e..803c622 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1103,13 +1103,45 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
se->avg.runnable_avg_period + 1);
}
+/*
+ * Aggregate cfs_rq runnable averages into an equivalent task_group
+ * representation for computing load contributions.
+ */
+static inline void __update_tg_runnable_avg(struct sched_avg *sa,
+ struct cfs_rq *cfs_rq)
+{
+ struct task_group *tg = cfs_rq->tg;
+ long contrib;
+
+ contrib = (sa->runnable_avg_sum << 12) / (sa->runnable_avg_period + 1);
+ contrib -= cfs_rq->tg_runnable_contrib;
+
+ if (abs(contrib) > cfs_rq->tg_runnable_contrib/64) {
+ atomic_add(contrib, &tg->runnable_avg);
+ cfs_rq->tg_runnable_contrib += contrib;
+ }
+}
+
static inline void __update_group_entity_contrib(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = group_cfs_rq(se);
struct task_group *tg = cfs_rq->tg;
+ int runnable_avg;
se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
+
+ /*
+ * Unlike a task-entity, a group entity may be using >=1 cpu globally.
+ * However, in the case that it's using <1 cpu we need to form a
+ * correction term so that we contribute the same load as a task of
+ * equal weight. (Global runnable time is taken as a fraction over 2^12.)
+ */
+ runnable_avg = atomic_read(&tg->runnable_avg);
+ if (runnable_avg < (1<<12)) {
+ se->avg.load_avg_contrib *= runnable_avg;
+ se->avg.load_avg_contrib /= (1<<12);
+ }
}
/* Compute the current contribution to load_avg by se, return any delta */
@@ -1122,6 +1154,7 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
} else {
if (!se->on_rq)
__synchronize_entity_decay(se);
+ __update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
__update_group_entity_contrib(se);
}
@@ -1205,6 +1238,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+ __update_tg_runnable_avg(&rq->avg, &rq->cfs);
}
/* Add the load generated by se into cfs_rq's child load-average */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 17f99e7..57cc227 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -117,6 +117,7 @@ struct task_group {
atomic_t load_weight;
atomic64_t load_avg;
+ atomic_t runnable_avg;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
@@ -272,6 +273,7 @@ struct cfs_rq {
unsigned long load_contribution;
+ u32 tg_runnable_contrib;
u64 runnable_load_avg, blocked_load_avg;
u64 tg_load_contrib;
atomic64_t decay_counter, removed_load;
Unlike task entities who have a fixed weight, group entities instead own a
fraction of their parenting task_group's shares as their contributed weight.
Compute this fraction so that we can correctly account hierarchies and shared
entity nodes.
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/fair.c | 29 +++++++++++++++++++++++------
1 files changed, 23 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7771003..6d8af5e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1096,17 +1096,34 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
return decays;
}
+static inline void __update_task_entity_contrib(struct sched_entity *se)
+{
+ se->avg.load_avg_contrib = div64_u64(se->avg.runnable_avg_sum *
+ se->load.weight,
+ se->avg.runnable_avg_period + 1);
+}
+
+static inline void __update_group_entity_contrib(struct sched_entity *se)
+{
+ struct cfs_rq *cfs_rq = group_cfs_rq(se);
+ struct task_group *tg = cfs_rq->tg;
+
+ se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
+ se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
+}
+
/* Compute the current contribution to load_avg by se, return any delta */
static long __update_entity_load_avg_contrib(struct sched_entity *se)
{
long old_contrib = se->avg.load_avg_contrib;
- if (!entity_is_task(se))
- return 0;
-
- se->avg.load_avg_contrib = div64_u64(se->avg.runnable_avg_sum *
- se->load.weight,
- se->avg.runnable_avg_period + 1);
+ if (entity_is_task(se)) {
+ __update_task_entity_contrib(se);
+ } else {
+ if (!se->on_rq)
+ __synchronize_entity_decay(se);
+ __update_group_entity_contrib(se);
+ }
return se->avg.load_avg_contrib - old_contrib;
}
With bandwidth control tracked entities may cease execution according to user
specified bandwidth limits. Charging this time as either throttled or blocked
however, is incorrect and would falsely skew in either direction.
What we actually want is for any throttled periods to be "invisible" to
load-tracking as they are removed from the system for that interval and
contribute normally otherwise.
Do this by moderating the progression of time to omit any periods in which the
entity belonged to a throttled hierarchy.
Signed-off-by: Paul Turner <[email protected]>
---
kernel/sched/fair.c | 33 +++++++++++++++++++++++++++------
kernel/sched/sched.h | 3 ++-
2 files changed, 29 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 803c622..71c7410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1185,6 +1185,8 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
cfs_rq->blocked_load_avg = 0;
}
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
+
/* Update a sched_entity's runnable average */
static inline void update_entity_load_avg(struct sched_entity *se,
int update_cfs_rq)
@@ -1213,7 +1215,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
*/
static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
{
- u64 now = rq_of(cfs_rq)->clock_task >> 20;
+ u64 now = cfs_rq_clock_task(cfs_rq) >> 20;
u64 decays;
decays = now - cfs_rq->last_decay;
@@ -1820,6 +1822,15 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
return &tg->cfs_bandwidth;
}
+/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
+{
+ if (unlikely(cfs_rq->throttle_count))
+ return cfs_rq->throttled_clock_task;
+
+ return rq_of(cfs_rq)->clock_task - cfs_rq->throttled_clock_task_time;
+}
+
/* returns 0 on failure to allocate runtime */
static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
@@ -1970,6 +1981,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
cfs_rq->load_stamp += delta;
cfs_rq->load_last += delta;
+ /* adjust cfs_rq_clock_task() */
+ cfs_rq->throttled_clock_task_time += rq->clock_task -
+ cfs_rq->throttled_clock_task;
+
/* update entity weight now that we are on_rq again */
update_cfs_shares(cfs_rq);
}
@@ -1984,8 +1999,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
/* group is entering throttled state, record last load */
- if (!cfs_rq->throttle_count)
+ if (!cfs_rq->throttle_count) {
update_cfs_load(cfs_rq, 0);
+ cfs_rq->throttled_clock_task = rq->clock_task;
+ }
cfs_rq->throttle_count++;
return 0;
@@ -2000,7 +2017,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
- /* account load preceding throttle */
+ /* freeze hierarchy runnable averages while throttled */
rcu_read_lock();
walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
rcu_read_unlock();
@@ -2024,7 +2041,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
rq->nr_running -= task_delta;
cfs_rq->throttled = 1;
- cfs_rq->throttled_timestamp = rq->clock;
+ cfs_rq->throttled_clock = rq->clock;
raw_spin_lock(&cfs_b->lock);
list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
raw_spin_unlock(&cfs_b->lock);
@@ -2042,10 +2059,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
cfs_rq->throttled = 0;
raw_spin_lock(&cfs_b->lock);
- cfs_b->throttled_time += rq->clock - cfs_rq->throttled_timestamp;
+ cfs_b->throttled_time += rq->clock - cfs_rq->throttled_clock;
list_del_rcu(&cfs_rq->throttled_list);
raw_spin_unlock(&cfs_b->lock);
- cfs_rq->throttled_timestamp = 0;
update_rq_clock(rq);
/* update hierarchical throttle state */
@@ -2445,6 +2461,11 @@ void unthrottle_offline_cfs_rqs(struct rq *rq)
}
#else /* CONFIG_CFS_BANDWIDTH */
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
+{
+ return rq_of(cfs_rq)->clock_task;
+}
+
static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
unsigned long delta_exec) {}
static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 57cc227..a823ca4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -284,7 +284,8 @@ struct cfs_rq {
u64 runtime_expires;
s64 runtime_remaining;
- u64 throttled_timestamp;
+ u64 throttled_clock, throttled_clock_task;
+ u64 throttled_clock_task_time;
int throttled, throttle_count;
struct list_head throttled_list;
#endif /* CONFIG_CFS_BANDWIDTH */
Now that running entities maintain their own load-averages the work we must do
in update_shares() is largely restricted to the periodic decay of blocked
entities. This allows us to be a little less pessimistic regarding our
occupancy on rq->lock and the associated rq->clock updates required.
We also no longer use the instantaneous load to calculate group-entity
load-weight. This was previously required as a compromise since migrated load
would remain in the parenting average while it decayed, however it leads to
over-allocation of shares.
Signed-off-by: Paul Turner <[email protected]>
---
kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++-----------------------
1 files changed, 31 insertions(+), 26 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46b3f98..5b6ee7a4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -823,7 +823,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
*/
tg_weight = atomic64_read(&tg->load_avg);
tg_weight -= cfs_rq->tg_load_contrib;
- tg_weight += cfs_rq->load.weight;
+ tg_weight += cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
return tg_weight;
}
@@ -833,7 +833,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
long tg_weight, load, shares;
tg_weight = calc_tg_weight(tg, cfs_rq);
- load = cfs_rq->load.weight;
+ load = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
shares = (tg->shares * load);
if (tg_weight)
@@ -3559,21 +3559,16 @@ out:
/*
* update tg->load_weight by folding this cpu's load_avg
*/
-static int update_shares_cpu(struct task_group *tg, int cpu)
+static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
{
- struct sched_entity *se;
- struct cfs_rq *cfs_rq;
- unsigned long flags;
+ struct sched_entity *se = tg->se[cpu];
+ struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
struct rq *rq;
+ /* throttled entities do not contribute to load */
+ if (throttled_hierarchy(cfs_rq))
+ return;
- rq = cpu_rq(cpu);
- se = tg->se[cpu];
- cfs_rq = tg->cfs_rq[cpu];
-
- raw_spin_lock_irqsave(&rq->lock, flags);
-
- update_rq_clock(rq);
update_cfs_rq_blocked_load(cfs_rq, 1);
if (se) {
@@ -3586,29 +3581,39 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
else
list_del_leaf_cfs_rq(cfs_rq);
}
-
- raw_spin_unlock_irqrestore(&rq->lock, flags);
-
- return 0;
}
-static void update_shares(int cpu)
+static void update_blocked_averages(int cpu)
{
- struct cfs_rq *cfs_rq;
struct rq *rq = cpu_rq(cpu);
+ struct cfs_rq *cfs_rq;
+
+ unsigned long flags;
+ int num_updates = 0;
rcu_read_lock();
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ update_rq_clock(rq);
/*
* Iterates the task_group tree in a bottom up fashion, see
* list_add_leaf_cfs_rq() for details.
*/
for_each_leaf_cfs_rq(rq, cfs_rq) {
- /* throttled entities do not contribute to load */
- if (throttled_hierarchy(cfs_rq))
- continue;
+ __update_blocked_averages_cpu(cfs_rq->tg, rq->cpu);
- update_shares_cpu(cfs_rq->tg, cpu);
+ /*
+ * Periodically release the lock so that a cfs_rq with many
+ * children cannot hold it for an arbitrary period of time.
+ */
+ if (num_updates++ % 20 == 0) {
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+ cpu_relax();
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ update_rq_clock(rq);
+ }
}
+
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
rcu_read_unlock();
}
@@ -3689,7 +3694,7 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
return max_load_move - rem_load_move;
}
#else
-static inline void update_shares(int cpu)
+static inline void update_blocked_averages(int cpu)
{
}
@@ -4917,7 +4922,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
*/
raw_spin_unlock(&this_rq->lock);
- update_shares(this_cpu);
+ update_blocked_averages(this_cpu);
rcu_read_lock();
for_each_domain(this_cpu, sd) {
unsigned long interval;
@@ -5258,7 +5263,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
int update_next_balance = 0;
int need_serialize;
- update_shares(cpu);
+ update_blocked_averages(cpu);
rcu_read_lock();
for_each_domain(cpu, sd) {
Now that the machinery in place is in place to compute contributed load in a
bottom up fashion; replace the shares distribution code within update_shares()
accordingly.
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/debug.c | 8 ---
kernel/sched/fair.c | 150 ++++++--------------------------------------------
kernel/sched/sched.h | 13 ----
3 files changed, 18 insertions(+), 153 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d87796..0911ec6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -216,14 +216,6 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
#ifdef CONFIG_FAIR_GROUP_SCHED
#ifdef CONFIG_SMP
- SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "load_avg",
- SPLIT_NS(cfs_rq->load_avg));
- SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "load_period",
- SPLIT_NS(cfs_rq->load_period));
- SEQ_printf(m, " .%-30s: %ld\n", "load_contrib",
- cfs_rq->load_contribution);
- SEQ_printf(m, " .%-30s: %d\n", "load_tg",
- atomic_read(&cfs_rq->tg->load_weight));
SEQ_printf(m, " .%-30s: %lld\n", "runnable_load_avg",
cfs_rq->runnable_load_avg);
SEQ_printf(m, " .%-30s: %lld\n", "blocked_load_avg",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 71c7410..46b3f98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -655,9 +655,6 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
return calc_delta_fair(sched_slice(cfs_rq, se), se);
}
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update);
-static void update_cfs_shares(struct cfs_rq *cfs_rq);
-
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
@@ -677,10 +674,6 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
curr->vruntime += delta_exec_weighted;
update_min_vruntime(cfs_rq);
-
-#if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
- cfs_rq->load_unacc_exec_time += delta_exec;
-#endif
}
static void update_curr(struct cfs_rq *cfs_rq)
@@ -818,72 +811,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
}
#ifdef CONFIG_FAIR_GROUP_SCHED
-/* we need this in update_cfs_load and load-balance functions below */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
# ifdef CONFIG_SMP
-static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
- int global_update)
-{
- struct task_group *tg = cfs_rq->tg;
- long load_avg;
-
- load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
- load_avg -= cfs_rq->load_contribution;
-
- if (global_update || abs(load_avg) > cfs_rq->load_contribution / 8) {
- atomic_add(load_avg, &tg->load_weight);
- cfs_rq->load_contribution += load_avg;
- }
-}
-
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
- u64 period = sysctl_sched_shares_window;
- u64 now, delta;
- unsigned long load = cfs_rq->load.weight;
-
- if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
- return;
-
- now = rq_of(cfs_rq)->clock_task;
- delta = now - cfs_rq->load_stamp;
-
- /* truncate load history at 4 idle periods */
- if (cfs_rq->load_stamp > cfs_rq->load_last &&
- now - cfs_rq->load_last > 4 * period) {
- cfs_rq->load_period = 0;
- cfs_rq->load_avg = 0;
- delta = period - 1;
- }
-
- cfs_rq->load_stamp = now;
- cfs_rq->load_unacc_exec_time = 0;
- cfs_rq->load_period += delta;
- if (load) {
- cfs_rq->load_last = now;
- cfs_rq->load_avg += delta * load;
- }
-
- /* consider updating load contribution on each fold or truncate */
- if (global_update || cfs_rq->load_period > period
- || !cfs_rq->load_period)
- update_cfs_rq_load_contribution(cfs_rq, global_update);
-
- while (cfs_rq->load_period > period) {
- /*
- * Inline assembly required to prevent the compiler
- * optimising this loop into a divmod call.
- * See __iter_div_u64_rem() for another example of this.
- */
- asm("" : "+rm" (cfs_rq->load_period));
- cfs_rq->load_period /= 2;
- cfs_rq->load_avg /= 2;
- }
-
- if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg)
- list_del_leaf_cfs_rq(cfs_rq);
-}
-
static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
{
long tg_weight;
@@ -893,8 +821,8 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
* to gain a more accurate current total weight. See
* update_cfs_rq_load_contribution().
*/
- tg_weight = atomic_read(&tg->load_weight);
- tg_weight -= cfs_rq->load_contribution;
+ tg_weight = atomic64_read(&tg->load_avg);
+ tg_weight -= cfs_rq->tg_load_contrib;
tg_weight += cfs_rq->load.weight;
return tg_weight;
@@ -918,27 +846,11 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
return shares;
}
-
-static void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
- if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window) {
- update_cfs_load(cfs_rq, 0);
- update_cfs_shares(cfs_rq);
- }
-}
# else /* CONFIG_SMP */
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
-}
-
static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
{
return tg->shares;
}
-
-static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
-}
# endif /* CONFIG_SMP */
static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
unsigned long weight)
@@ -956,6 +868,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
account_entity_enqueue(cfs_rq, se);
}
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+
static void update_cfs_shares(struct cfs_rq *cfs_rq)
{
struct task_group *tg;
@@ -975,17 +889,9 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq)
reweight_entity(cfs_rq_of(se), se, shares);
}
#else /* CONFIG_FAIR_GROUP_SCHED */
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
-}
-
static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
{
}
-
-static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
-}
#endif /* CONFIG_FAIR_GROUP_SCHED */
#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
@@ -1456,7 +1362,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- update_cfs_load(cfs_rq, 0);
enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);
@@ -1553,7 +1458,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
se->on_rq = 0;
- update_cfs_load(cfs_rq, 0);
account_entity_dequeue(cfs_rq, se);
/*
@@ -1726,11 +1630,6 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
}
#endif
- /*
- * Update share accounting for long-running entities.
- */
- update_entity_shares_tick(cfs_rq);
-
#ifdef CONFIG_SCHED_HRTICK
/*
* queued ticks are scheduled to match the slice, so don't bother
@@ -1975,18 +1874,9 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
cfs_rq->throttle_count--;
#ifdef CONFIG_SMP
if (!cfs_rq->throttle_count) {
- u64 delta = rq->clock_task - cfs_rq->load_stamp;
-
- /* leaving throttled state, advance shares averaging windows */
- cfs_rq->load_stamp += delta;
- cfs_rq->load_last += delta;
-
/* adjust cfs_rq_clock_task() */
cfs_rq->throttled_clock_task_time += rq->clock_task -
cfs_rq->throttled_clock_task;
-
- /* update entity weight now that we are on_rq again */
- update_cfs_shares(cfs_rq);
}
#endif
@@ -1998,11 +1888,9 @@ static int tg_throttle_down(struct task_group *tg, void *data)
struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
- /* group is entering throttled state, record last load */
- if (!cfs_rq->throttle_count) {
- update_cfs_load(cfs_rq, 0);
+ /* group is entering throttled state, stop time */
+ if (!cfs_rq->throttle_count)
cfs_rq->throttled_clock_task = rq->clock_task;
- }
cfs_rq->throttle_count++;
return 0;
@@ -2600,7 +2488,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;
- update_cfs_load(cfs_rq, 0);
update_cfs_shares(cfs_rq);
update_entity_load_avg(se, 1);
}
@@ -2662,7 +2549,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;
- update_cfs_load(cfs_rq, 0);
update_cfs_shares(cfs_rq);
update_entity_load_avg(se, 1);
}
@@ -3675,27 +3561,31 @@ out:
*/
static int update_shares_cpu(struct task_group *tg, int cpu)
{
+ struct sched_entity *se;
struct cfs_rq *cfs_rq;
unsigned long flags;
struct rq *rq;
- if (!tg->se[cpu])
- return 0;
rq = cpu_rq(cpu);
+ se = tg->se[cpu];
cfs_rq = tg->cfs_rq[cpu];
raw_spin_lock_irqsave(&rq->lock, flags);
update_rq_clock(rq);
- update_cfs_load(cfs_rq, 1);
update_cfs_rq_blocked_load(cfs_rq, 1);
- /*
- * We need to update shares after updating tg->load_weight in
- * order to adjust the weight of groups with long running tasks.
- */
- update_cfs_shares(cfs_rq);
+ if (se) {
+ /*
+ * We can pivot on the runnable average decaying to zero for list
+ * removal since the parent average will always be >= child.
+ */
+ if (se->avg.runnable_avg_sum)
+ update_cfs_shares(cfs_rq);
+ else
+ list_del_leaf_cfs_rq(cfs_rq);
+ }
raw_spin_unlock_irqrestore(&rq->lock, flags);
@@ -5895,10 +5785,6 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
cfs_rq->tg = tg;
cfs_rq->rq = rq;
-#ifdef CONFIG_SMP
- /* allow initial update_cfs_load() to truncate */
- cfs_rq->load_stamp = 1;
-#endif
init_cfs_rq_runtime(cfs_rq);
tg->cfs_rq[cpu] = cfs_rq;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a823ca4..77f3297 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -260,19 +260,6 @@ struct cfs_rq {
*/
unsigned long h_load;
- /*
- * Maintaining per-cpu shares distribution for group scheduling
- *
- * load_stamp is the last time we updated the load average
- * load_last is the last time we updated the load average and saw load
- * load_unacc_exec_time is currently unaccounted execution time
- */
- u64 load_avg;
- u64 load_period;
- u64 load_stamp, load_last, load_unacc_exec_time;
-
- unsigned long load_contribution;
-
u32 tg_runnable_contrib;
u64 runnable_load_avg, blocked_load_avg;
u64 tg_load_contrib;
From: Ben Segall <[email protected]>
Since runqueues do not have a corresponding sched_entity we instead embed a
sched_avg structure directly.
Signed-off-by: Ben Segall <[email protected]>
Signed-off-by: Paul Turner <[email protected]>
---
kernel/sched/debug.c | 10 ++++++++--
kernel/sched/fair.c | 18 ++++++++++++++++--
kernel/sched/sched.h | 2 ++
3 files changed, 26 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index d89db32..43e3162 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -61,14 +61,20 @@ static unsigned long nsec_low(unsigned long long nsec)
static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
{
struct sched_entity *se = tg->se[cpu];
- if (!se)
- return;
#define P(F) \
SEQ_printf(m, " .%-30s: %lld\n", #F, (long long)F)
#define PN(F) \
SEQ_printf(m, " .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
+ if (!se) {
+ struct sched_avg *avg = &cpu_rq(cpu)->avg;
+ P(avg->runnable_avg_sum);
+ P(avg->runnable_avg_period);
+ return;
+ }
+
+
PN(se->exec_start);
PN(se->vruntime);
PN(se->sum_exec_runtime);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a570e9c..8fa199f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1086,8 +1086,14 @@ static inline void update_entity_load_avg(struct sched_entity *se)
__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
se->on_rq);
}
+
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
+{
+ __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+}
#else
static inline void update_entity_load_avg(struct sched_entity *se) {}
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
#endif
static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -2340,8 +2346,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq);
}
- if (!se)
+ if (!se) {
+ update_rq_runnable_avg(rq, rq->nr_running);
inc_nr_running(rq);
+ }
hrtick_update(rq);
}
@@ -2399,8 +2407,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq);
}
- if (!se)
+ if (!se) {
dec_nr_running(rq);
+ update_rq_runnable_avg(rq, 1);
+ }
hrtick_update(rq);
}
@@ -4749,6 +4759,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
if (this_rq->avg_idle < sysctl_sched_migration_cost)
return;
+ update_rq_runnable_avg(this_rq, 1);
+
/*
* Drop the rq->lock, but keep IRQ/preempt disabled.
*/
@@ -5328,6 +5340,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
cfs_rq = cfs_rq_of(se);
entity_tick(cfs_rq, se, queued);
}
+
+ update_rq_runnable_avg(rq, 1);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8a2c768..42b6df6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -473,6 +473,8 @@ struct rq {
#ifdef CONFIG_SMP
struct llist_head wake_list;
#endif
+
+ struct sched_avg avg;
};
static inline int cpu_of(struct rq *rq)
For a given task t, we can compute its contribution to load as:
task_load(t) = runnable_avg(t) * weight(t)
On a parenting cfs_rq we can then aggregate
runnable_load(cfs_rq) = \Sum task_load(t), for all runnable children t
Maintain this bottom up, with task entities adding their contributed load to
the parenting cfs_rq sum. When a task entities load changes we add the same
delta to the maintained sum.
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Ben Segall <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/debug.c | 3 +++
kernel/sched/fair.c | 52 +++++++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 2 ++
4 files changed, 54 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 91599c8..f2999f0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1159,6 +1159,7 @@ struct load_weight {
struct sched_avg {
u64 runnable_avg_sum, runnable_avg_period;
u64 last_runnable_update;
+ unsigned long load_avg_contrib;
};
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 43e3162..5a55d26 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -93,6 +93,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->load.weight);
P(se->avg.runnable_avg_sum);
P(se->avg.runnable_avg_period);
+ P(se->avg.load_avg_contrib);
#undef PN
#undef P
}
@@ -222,6 +223,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->load_contribution);
SEQ_printf(m, " .%-30s: %d\n", "load_tg",
atomic_read(&cfs_rq->tg->load_weight));
+ SEQ_printf(m, " .%-30s: %lld\n", "runnable_load_avg",
+ cfs_rq->runnable_load_avg);
#endif
print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8fa199f..bcdad5d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1080,20 +1080,64 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
return decayed;
}
+/* Compute the current contribution to load_avg by se, return any delta */
+static long __update_entity_load_avg_contrib(struct sched_entity *se)
+{
+ long old_contrib = se->avg.load_avg_contrib;
+
+ if (!entity_is_task(se))
+ return 0;
+
+ se->avg.load_avg_contrib = div64_u64(se->avg.runnable_avg_sum *
+ se->load.weight,
+ se->avg.runnable_avg_period + 1);
+
+ return se->avg.load_avg_contrib - old_contrib;
+}
+
/* Update a sched_entity's runnable average */
static inline void update_entity_load_avg(struct sched_entity *se)
{
- __update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
- se->on_rq);
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ long contrib_delta;
+
+ if(!__update_entity_runnable_avg(rq_of(cfs_rq)->clock_task, &se->avg,
+ se->on_rq))
+ return;
+
+ contrib_delta = __update_entity_load_avg_contrib(se);
+ if (se->on_rq)
+ cfs_rq->runnable_load_avg += contrib_delta;
}
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
}
+
+/* Add the load generated by se into cfs_rq's child load-average */
+static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
+ struct sched_entity *se)
+{
+ update_entity_load_avg(se);
+ cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+}
+
+/* Remove se's load from this cfs_rq child load-average */
+static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
+ struct sched_entity *se)
+{
+ update_entity_load_avg(se);
+ cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+}
+
#else
static inline void update_entity_load_avg(struct sched_entity *se) {}
static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
+static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
+ struct sched_entity *se {}
+static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
+ struct sched_entity *se) {}
#endif
static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1220,7 +1264,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
update_curr(cfs_rq);
update_cfs_load(cfs_rq, 0);
- update_entity_load_avg(se);
+ enqueue_entity_load_avg(se);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);
@@ -1295,7 +1339,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- update_entity_load_avg(se);
+ dequeue_entity_load_avg(cfs_rq, se);
update_stats_dequeue(cfs_rq, se);
if (flags & DEQUEUE_SLEEP) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 42b6df6..77a3427 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -270,6 +270,8 @@ struct cfs_rq {
u64 load_stamp, load_last, load_unacc_exec_time;
unsigned long load_contribution;
+
+ u64 runnable_load_avg;
#endif /* CONFIG_SMP */
#ifdef CONFIG_CFS_BANDWIDTH
int runtime_enabled;
On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> Mailer:
> StGit/0.15
>
Its broken, every email has the exact same timestamp! This means emails
are sorted by receive order, which doesn't match actual patch order.
Maybe the new and improved stgit-1.6 has this fixed.
/me goes read more
On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> +/* Precomputed fixed inverse multiplies for multiplication by y^n */
> +const static u32 runnable_avg_yN_inv[] = {
> + 0xffffffff,0xfa83b2db,0xf5257d15,0xefe4b99b,0xeac0c6e7,0xe5b906e7,
> + 0xe0ccdeec,0xdbfbb797,0xd744fcca,0xd2a81d91,0xce248c15,0xc9b9bd86,
> + 0xc5672a11,0xc12c4cca,0xbd08a39f,0xb8fbaf47,0xb504f333,0xb123f581,
> + 0xad583eea,0xa9a15ab4,0xa5fed6a9,0xa2704303,0x9ef53260,0x9b8d39b9,
> + 0x9837f051,0x94f4efa8,0x91c3d373,0x8ea4398b,0x8b95c1e3,0x88980e80,
> + 0x85aac367,0x82cd8698,
> +};
I wonder if modern Intel isn't at the point where computing this thing
is cheaper than the cacheline miss. You can compute y^n in O(log n) time
and with n < 32 that's 5 multiplications (see fixed_power_int). Add to
that the division.
Of course there's platforms, ARM?, where reverse is likely true. Bugger
that.
> +/* Precomputed \Sum y^k { 1<=k<=n } */
> +const static u32 runnable_avg_yN_sum[] = {
> + 0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
> + 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
> + 17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
> +};
Right, can't see a fast way to compute this..
The asymmetry in the tables annoys me though 32 vs 33 entries, hex vs
dec :-)
On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> Entities of equal weight should receive equitable distribution of cpu time.
> This is challenging in the case of a task_group's shares as execution may be
> occurring on multiple cpus simultaneously.
>
> To handle this we divide up the shares into weights proportionate with the load
> on each cfs_rq. This does not however, account for the fact that the sum of
> the parts may be less than one cpu and so we need to normalize:
> load(tg) = min(runnable_avg(tg), 1) * tg->shares
> Where runnable_avg is the aggregate time in which the task_group had runnable
> children.
> static inline void __update_group_entity_contrib(struct sched_entity *se)
> {
> struct cfs_rq *cfs_rq = group_cfs_rq(se);
> struct task_group *tg = cfs_rq->tg;
> + int runnable_avg;
>
> se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
> se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
> +
> + /*
> + * Unlike a task-entity, a group entity may be using >=1 cpu globally.
> + * However, in the case that it's using <1 cpu we need to form a
> + * correction term so that we contribute the same load as a task of
> + * equal weight. (Global runnable time is taken as a fraction over 2^12.)
> + */
> + runnable_avg = atomic_read(&tg->runnable_avg);
> + if (runnable_avg < (1<<12)) {
> + se->avg.load_avg_contrib *= runnable_avg;
> + se->avg.load_avg_contrib /= (1<<12);
> + }
> }
This seems weird, and the comments don't explain anything.
Ah,.. you can count runnable multiple times (on each cpu), this also
means that the number you're using (when below 1) can still be utter
crap.
Neither the comment nor the changelog mention this, it should, it should
also mention why it doesn't matter (does it?).
On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> + /*
> + * Use 1024us as the unit of measurement since it's a reasonable
> + * approximation of 1ms and fast to compute.
> + */
> + delta >>= 10;
ns -> us ?, text talks about ms, slightly confusing
> + if (!delta)
> + return 0;
> + sa->last_runnable_update = now;
> +
> + delta_w = sa->runnable_avg_period % 1024;
so delta_w is the chunk of this p we already accounted.
> + if (delta + delta_w >= 1024) {
if delta pushes us over 1024*1024 ns (~1ms) we roll a window.
> + /* period roll-over */
> + decayed = 1;
> +
> + delta_w = 1024 - delta_w;
The distance we need to reach the next window.
> + BUG_ON(delta_w > delta);
somehow reading this code took forever, this suggests clarification,
either through better variable names or more comments. Could also mean
I'm a moron and should get more sleep or so :-)
On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> Signed-off-by: Paul Turner <[email protected]>
> Signed-off-by: Ben Segall <[email protected]>.
Ben being the last sob would suggest Ben being the one passing the patch
on, and yet I'm getting it from pjt. Also note the trailing dot on Ben's
line.
On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> +static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
> + long load_contrib)
> +{
> + if (likely(load_contrib < cfs_rq->blocked_load_avg))
> + cfs_rq->blocked_load_avg -= load_contrib;
> + else
> + cfs_rq->blocked_load_avg = 0;
> +}
> +
> /* Update a sched_entity's runnable average */
> -static inline void update_entity_load_avg(struct sched_entity *se)
> +static inline void update_entity_load_avg(struct sched_entity *se,
> + int update_cfs_rq)
> {
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
> long contrib_delta;
> @@ -1106,8 +1130,34 @@ static inline void update_entity_load_avg(struct sched_entity *se)
> return;
>
> contrib_delta = __update_entity_load_avg_contrib(se);
> +
> + if (!update_cfs_rq)
> + return;
> +
> if (se->on_rq)
> cfs_rq->runnable_load_avg += contrib_delta;
> + else
> + subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> +}
So that last bit is add_blocked_load_contrib(), right?
On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> +static __always_inline u64 decay_load(u64 val, int n)
> +{
> + for (;n && val;n--) {
> + val *= 4008;
> + val >>= 12;
> + }
> +
> + return val;
> +}
> + sa->runnable_avg_sum =
> + decay_load(sa->runnable_avg_sum, 1);
> + sa->runnable_avg_period =
> + decay_load(sa->runnable_avg_period, 1);
Since all you ever seem to do is:
x = decay(x, n);
and frequently run over the line limits it might make sense to either
introduce a CPP helper or make the first argument a pointer and ditch
the return value so we end up with something like:
decay(&x, n);
On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
> {
> - __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
> + __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
> + runnable);
Its not immediately obvious why we use @runnable for @running,
On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> + if (se->avg.decay_count) {
> + /*
> + * In a wake-up migration we have to approximate the
> + * time sleeping.
> + */
> + se->avg.last_runnable_update -= (-se->avg.decay_count)
> + << 20;
> + update_entity_load_avg(se, 0);
> + }
That comment wants more why and how. I think I remember pjt telling me
about this last week, but those are already vague memories, I'll be sure
to be completely clueless in another week.
On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> u64 last_runnable_update, decay_count;
> + /* we track migrations using entity decay_count == 0 */
> + if (unlikely(se->avg.decay_count <= 0)) {
!sleep dequeue isn't migrate only, also, <= 0 on an unsigned is weird.
On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> struct sched_avg {
> u64 runnable_avg_sum, runnable_avg_period;
> u64 last_runnable_update, decay_count;
> + u32 usage_avg_sum;
4 byte hole
> unsigned long load_avg_contrib;
>
> int contributes_blocked_load;
>};
A) Running/Runnable
- uses last_runnable_update to track
runnable_avg_sum, usage_avg_sum and runnable_avg_period
B) Blocked
- uses decay_count to keep track of sleeptime
C) 'Migrate'
- uses contributes_blocked_load to check if we actually did migrate?
So there's a number of things I don't get, why have
remove_task_load_avg_async() in set_task_rq()? enqueue_entity_load_avg()
can do the __synchronize_entity_decay() thing (and actually does) just
fine, right?
Similarly, why not use last_runnable_update (as last_update) to track
both A and B since a task cannot be in both states at the same time
anyway. If you always update the timestamp you can use it to either
track runnable/usage or decay for sleep.
That would get rid of decay_count and contributes_blocked_load, no?
On Wed, 01 Feb 2012 17:38:26 -0800, Paul Turner <[email protected]> wrote:
> +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> + int force_update)
> +{
> + struct task_group *tg = cfs_rq->tg;
> + s64 tg_contrib;
> +
> + tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
> + tg_contrib -= cfs_rq->tg_load_contrib;
> +
> + if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
>
Not obvious to me where this 8 is coming from?
Regards
Nikunj
On Fri, Feb 17, 2012 at 1:07 AM, Nikunj A Dadhania
<[email protected]> wrote:
> On Wed, 01 Feb 2012 17:38:26 -0800, Paul Turner <[email protected]> wrote:
>> Hi all,
>>
>> The attached series is an RFC on implementing load tracking at the entity
>> instead of cfs_rq level. This results in a bottom-up load-computation in which
>> entities contribute to their parents load, as opposed to the current top-down
>> where the parent averages its children. ?In particular this allows us to
>> correctly migrate load with their accompanying entities and provides the
>> necessary inputs for intelligent load-balancing and power-management.
>>
>> It was previously well tested and stable, but that was on v3.1-; there's been
>> some fairly extensive changes in the wake-up path since so apologies if anything
>> was broken in the rebase.Note also, since this is also an RFC on the approach I
>> have not yet de-linted the various CONFIG combinations for introduced compiler
>> errors.
>>
>
> I gave a quick run to this series, and it seems the fairness across
> taskgroups is broken with this.
>
> Test setup:
> Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8
> core, 64GB RAM, 16 cpu.
>
> Create 3 taskgroups: fair16, fair32 and fair48 having 16, 32 and 48
> cpu-hog tasks respectively. They have equal shares(default 1024), so
> they should consume roughly the same time.
>
> 120secs run 1:
> Time consumed by fair16 cgroup: ?712912 Tasks: 16
> Time consumed by fair32 cgroup: ?650977 Tasks: 32
> Time consumed by fair48 cgroup: ?575681 Tasks: 48
>
> 120secs run 2:
> Time consumed by fair16 cgroup: ?686295 Tasks: 16
> Time consumed by fair32 cgroup: ?643474 Tasks: 32
> Time consumed by fair48 cgroup: ?611415 Tasks: 48
>
> 600secs run 1:
> Time consumed by fair16 cgroup: ?4109678 Tasks: 16
> Time consumed by fair32 cgroup: ?1743983 Tasks: 32
> Time consumed by fair48 cgroup: ?3759826 Tasks: 48
>
> 600secs run 2:
> Time consumed by fair16 cgroup: ?3893365 Tasks: 16
> Time consumed by fair32 cgroup: ?3028280 Tasks: 32
> Time consumed by fair48 cgroup: ?2692001 Tasks: 48
>
This is almost certainly a result of me twiddling with the weight in
calc_cfs_shares (using average instead of instantaneous weight) in
this version -- see patch 11/14. While this had some nice stability
properties it was not hot for fairness so I've since reverted it
(snippet attached below).
While it's hard to guarantee it was exactly this since I'm carrying a
few other minor edits in preparation for the next version, the current
results for the next version of this series look like:
8-core:
Starting task group fair16...done
Starting task group fair32...done
Starting task group fair48...done
Waiting for the task to run for 120 secs
Interpreting the results. Please wait....
Time consumed by fair16 cgroup: 316985 Tasks: 16
Time consumed by fair32 cgroup: 320274 Tasks: 32
Time consumed by fair48 cgroup: 320811 Tasks: 48
24-core:
Starting task group fair16...done
Starting task group fair32...done
Starting task group fair48...done
Waiting for the task to run for 120 secs
Interpreting the results. Please wait....
Time consumed by fair16 cgroup: 12628615 Tasks: 96
Time consumed by fair32 cgroup: 12562859 Tasks: 192
Time consumed by fair48 cgroup: 12600364 Tasks: 288
These results are stable and consistent.
As soon as I finish working through Peter's comments I'll upload a
pre-posting so you can re-test if you'd like. Expect a formal
(non-RFC) posting with other nits such as detangling tracking from
FAIR_GROUP_SCHED so that we may use it more comprehensively following
that within the next week or so.
Thanks,
- Paul
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -823,7 +823,7 @@ static inline long calc_tg_weight(struct
task_group *tg, struct cfs_rq *cfs_rq)
*/
tg_weight = atomic64_read(&tg->load_avg);
tg_weight -= cfs_rq->tg_load_contrib;
- tg_weight += cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+ tg_weight += cfs_rq->load.weight;
return tg_weight;
}
@@ -833,7 +833,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq,
struct task_group *tg)
long tg_weight, load, shares;
tg_weight = calc_tg_weight(tg, cfs_rq);
- load = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+ load = cfs_rq->load.weight;
> As you can see there is a lot of variance in the above results.
>
> wo patches
> 120secs run 1:
> Time consumed by fair16 cgroup: ?667644 Tasks: 16
> Time consumed by fair32 cgroup: ?653771 Tasks: 32
> Time consumed by fair48 cgroup: ?624915 Tasks: 48
>
> 600secs run 1:
> Time consumed by fair16 cgroup: ?3278425 Tasks: 16
> Time consumed by fair32 cgroup: ?3140335 Tasks: 32
> Time consumed by fair48 cgroup: ?3198817 Tasks: 48
>
> Regards
> Nikunj
>
On Thu, Feb 16, 2012 at 8:41 PM, Nikunj A Dadhania
<[email protected]> wrote:
> On Wed, 01 Feb 2012 17:38:26 -0800, Paul Turner <[email protected]> wrote:
>> +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int force_update)
>> +{
>> + ? ? struct task_group *tg = cfs_rq->tg;
>> + ? ? s64 tg_contrib;
>> +
>> + ? ? tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
>> + ? ? tg_contrib -= cfs_rq->tg_load_contrib;
>> +
>> + ? ? if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
>>
> Not obvious to me where this 8 is coming from?
>
It's arbitrary, it requires a change in load contrib by more than
1/8th -- or 12.5% -- of the contrib we're advertising globally before
we pay the cost of an update in the non-forced case.
We used the same trick in the previous shares tracking code since we
did not have a natural rate limit on the update rate. While this is
not as much of an issue in the new code, it does not seem to be
hurting the accuracy and squashing essentially spurious updates does
not hurt.
- Paul
> Regards
> Nikunj
>
On Thu, Feb 16, 2012 at 5:37 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> ?static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>> ?{
>> - ? ? ? __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
>> + ? ? ? __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?runnable);
>
> Its not immediately obvious why we use @runnable for @running,
Isn't it? An rq is the root of all scheduling -- if there are any
runnable tasks than one of them better be running when this is called.
I can add a comment to this effect.
On Thu, Feb 16, 2012 at 8:58 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> ?struct sched_avg {
>> ? ? ? ? u64 runnable_avg_sum, runnable_avg_period;
>> ? ? ? ? u64 last_runnable_update, decay_count;
>> + ? ? ? u32 usage_avg_sum;
>
> 4 byte hole
>
>> ? ? ? ? unsigned long load_avg_contrib;
>>
>> ? ? ? ? int contributes_blocked_load;
>>};
>
>
> A) Running/Runnable
> ? ?- uses last_runnable_update to track
> ? ? ? ?runnable_avg_sum, usage_avg_sum and runnable_avg_period
Yes, with the runnable average being runnable_avg_sum / runnable_avg_period
>
> B) Blocked
> ? ?- uses decay_count to keep track of sleeptime
Not quite (we could track sleep-time explicitly about clock_task after all).
We need decay-count to synchronize against entity_tick /
update_blocked_shares accounting decays against
cfs_rq->blocked_load_sum.
Since we do not know where we slept relative to the tick, unless we
explicitly track where this occurred we don't know for a "n.y" jiffy
sleep whether the "0.y" fraction included an extra decay. It's
important to get this right to reduce instability, since when the
entity wakes back up we need to charge an equivalent decay against
it's contribution before we can remove it from the blocked_load_sum.
For the common case of a short sleep whether we charge an extra period
(if y included a jiffy edge) or not is significant to stability since
getting it wrong means we leave a little extra load in blocked_load.
Further complicating this we need to be able to do said
synchronization in the annoying cases of wake-up migration (where we
don't hold the lock on previous rq) and 32-bit machines (where
everything sucks).
>
> C) 'Migrate'
> ? ?- uses contributes_blocked_load to check if we actually did migrate?
>
We actually use se->avg.decay_count to check whether we were migrated
(setting it appropriately at migration); contributes_blocked_load is
just a convenience variable to track whether we're a part of
cfs_rq->blocked_load_avg or not.
I have to double-check everything but I think the sign of
se->decay_count can be used equivalently so I'll likely just remove it
in favor of that. Although in order to keep things readable I'll
replace it a similarly named function that does this sign check.)
> So there's a number of things I don't get, why have
> remove_task_load_avg_async() in set_task_rq()? enqueue_entity_load_avg()
> can do the __synchronize_entity_decay() thing (and actually does) just
> fine, right?
Wake-up migration is the nuisance here.
In this case we:
A) Don't know where we came from in enqueue_entity_load_avg to
synchronize the decay against
B) Need to be able to remove the blocked load possessed by said waking
entity against its former cfs_rq->blocked_load_sum
There's a one or two other paths (than ttwu) that this matters for,
using set_task_rq() gets them all.
>
> Similarly, why not use last_runnable_update (as last_update) to track
> both A and B since a task cannot be in both states at the same time
> anyway. If you always update the timestamp you can use it to either
> track runnable/usage or decay for sleep.
This is pretty much exactly what we do (use it to track both A and B);
modulo the complications surrounding decay for sleep explained above
and below.
There's two things at play here:
1) An entity's runnable_avg := runnable_avg_sum / runnable_avg_period
2) An entity's load contribution := runnable_avg_sum /
runnable_avg_period * entity_weight
The load contribution is what we added to blocked_load_sum. When we
wake back up we synchronize decays against this and remove it from the
sum. However, decay into this is really just an imperfect
approximation (alignment of when we actually slept vs tick)
As soon as it's actually running again we:
- Adjust our load contrib by decay so that we can remove the right
amount from blocked_load_sum
- Charge the time since last_runnable_update as not running
- Compute a new, *accurate* load_contrib from (1) above, throwing away
the decayed previous one
- Add the new accurate calculation into the running load sum
The nice thing is that when we start this cycle again when the entity
next blocks we don't compound the losses of precision due to aligning
decay vs ticks since we get to throw away that number and do an
accurate accounting based on runnable_avg again.
>
> That would get rid of decay_count and contributes_blocked_load, no?
On Thu, Feb 16, 2012 at 8:02 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>
>> ? ? ? ? u64 last_runnable_update, decay_count;
>
>> + ? ? ? /* we track migrations using entity decay_count == 0 */
>> + ? ? ? if (unlikely(se->avg.decay_count <= 0)) {
>
> !sleep dequeue isn't migrate only,
Well they mostly are these days; but in the == 0 case we're either a
load-balancer migration or someone doing a dequeue/enqueue pair on an
entity about some sort of update.
The key is that when it is an load-balancer move we'll resync
appropriately on enqueue (which we need to do). We will essentially
sync with ourselves in the other cases, but they have no bearing on
why we do this.
> also, <= 0 on an unsigned is weird.
Yeah that should be a s64 not u64 (fixed).
>
On Wed, Feb 15, 2012 at 3:37 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> + ? ? ? /*
>> + ? ? ? ?* Use 1024us as the unit of measurement since it's a reasonable
>> + ? ? ? ?* approximation of 1ms and fast to compute.
>> + ? ? ? ?*/
>> + ? ? ? delta >>= 10;
>
> ns -> us ?, text talks about ms, slightly confusing
Yes, comment was incorrect; I'd noticed this also on a re-read -- it's fixed.
>
>> + ? ? ? if (!delta)
>> + ? ? ? ? ? ? ? return 0;
>> + ? ? ? sa->last_runnable_update = now;
>> +
>> + ? ? ? delta_w = sa->runnable_avg_period % 1024;
>
> so delta_w is the chunk of this p we already accounted.
>
>> + ? ? ? if (delta + delta_w >= 1024) {
>
> if delta pushes us over 1024*1024 ns (~1ms) we roll a window.
>
>> + ? ? ? ? ? ? ? /* period roll-over */
>> + ? ? ? ? ? ? ? decayed = 1;
>> +
>> + ? ? ? ? ? ? ? delta_w = 1024 - delta_w;
>
> The distance we need to reach the next window.
>
>> + ? ? ? ? ? ? ? BUG_ON(delta_w > delta);
>
> somehow reading this code took forever, this suggests clarification,
> either through better variable names or more comments. Could also mean
> I'm a moron and should get more sleep or so :-)
No... this bit is definitely fiddly, I definitely got it wrong more
than once writing it down. And then got it again wrong later when
"optimizing" :-).
It deserves more comments, I will add.
On Thu, Feb 16, 2012 at 5:27 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> +static __always_inline u64 decay_load(u64 val, int n)
>> +{
>> + ? ? ? for (;n && val;n--) {
>> + ? ? ? ? ? ? ? val *= 4008;
>> + ? ? ? ? ? ? ? val >>= 12;
>> + ? ? ? }
>> +
>> + ? ? ? return val;
>> +}
>
>> + ? ? ? ? ? ? ? ? ? ? ? sa->runnable_avg_sum =
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? decay_load(sa->runnable_avg_sum, 1);
>> + ? ? ? ? ? ? ? ? ? ? ? sa->runnable_avg_period =
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? decay_load(sa->runnable_avg_period, 1);
>
> Since all you ever seem to do is:
>
> ?x = decay(x, n);
>
> and frequently run over the line limits it might make sense to either
> introduce a CPP helper or make the first argument a pointer and ditch
> the return value so we end up with something like:
>
> ?decay(&x, n);
Ah-ha, good idea. I had done some other fiddles to try and make
things fit better, but this will work.
Thanks!
On Thu, Feb 16, 2012 at 4:25 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> +static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long load_contrib)
>> +{
>> + ? ? ? if (likely(load_contrib < cfs_rq->blocked_load_avg))
>> + ? ? ? ? ? ? ? cfs_rq->blocked_load_avg -= load_contrib;
>> + ? ? ? else
>> + ? ? ? ? ? ? ? cfs_rq->blocked_load_avg = 0;
>> +}
>> +
>> ?/* Update a sched_entity's runnable average */
>> -static inline void update_entity_load_avg(struct sched_entity *se)
>> +static inline void update_entity_load_avg(struct sched_entity *se,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int update_cfs_rq)
>> ?{
>> ? ? ? ? struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> ? ? ? ? long contrib_delta;
>> @@ -1106,8 +1130,34 @@ static inline void update_entity_load_avg(struct sched_entity *se)
>> ? ? ? ? ? ? ? ? return;
>>
>> ? ? ? ? contrib_delta = __update_entity_load_avg_contrib(se);
>> +
>> + ? ? ? if (!update_cfs_rq)
>> + ? ? ? ? ? ? ? return;
>> +
>> ? ? ? ? if (se->on_rq)
>> ? ? ? ? ? ? ? ? cfs_rq->runnable_load_avg += contrib_delta;
>> + ? ? ? else
>> + ? ? ? ? ? ? ? subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
>> +}
>
> So that last bit is add_blocked_load_contrib(), right?
Yes, although contrib_delta is signed.
I suppose this looks a little funny but there's a good explanation:
When adding or removing a contribution to blocked_load_sum we have to
be careful since rounding errors may mean the delta to our
contribution and the delta to our portion of the blocked_load_sum may
be off by a few bits. Being low is fine since it's constantly
decaying to zero so any error term is not long for this world -- but
we do want to make sure we don't underflow in the other direction.
This means any time we remove we have to do the whole
"if (likely(load_contrib < cfs_rq->blocked_load_avg))
cfs_rq->blocked_load_avg -= load_contrib;
else
cfs_rq->blocked_load_avg = 0"
thing.
Typically we only care about doing this on removing load (e.g. local
wake-up); since we have no error going in the opposite direction; so
we end up with subtract_blocked_load_contrib to handle the first case.
Coming back to the use of it here:
Since contrib_delta is on a freshly computed load_contribution we have
to take the same care as when we are removing; but luckily we already
have subtract_blocked_load_contrib from dealing with that everywhere
else. So we just re-use it and flip the sign.
We could call it add everywhere else (and again flip the sign) but
that's less intuitive since we really are only looking to (safely) remove in
those cases.
On Wed, Feb 15, 2012 at 3:36 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> Entities of equal weight should receive equitable distribution of cpu time.
>> This is challenging in the case of a task_group's shares as execution may be
>> occurring on multiple cpus simultaneously.
>>
>> To handle this we divide up the shares into weights proportionate with the load
>> on each cfs_rq. ?This does not however, account for the fact that the sum of
>> the parts may be less than one cpu and so we need to normalize:
>> ? load(tg) = min(runnable_avg(tg), 1) * tg->shares
>> Where runnable_avg is the aggregate time in which the task_group had runnable
>> children.
>
>
>> ?static inline void __update_group_entity_contrib(struct sched_entity *se)
>> ?{
>> ? ? ? ? struct cfs_rq *cfs_rq = group_cfs_rq(se);
>> ? ? ? ? struct task_group *tg = cfs_rq->tg;
>> + ? ? ? int runnable_avg;
>>
>> ? ? ? ? se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
>> ? ? ? ? se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
>> +
>> + ? ? ? /*
>> + ? ? ? ?* Unlike a task-entity, a group entity may be using >=1 cpu globally.
>> + ? ? ? ?* However, in the case that it's using <1 cpu we need to form a
>> + ? ? ? ?* correction term so that we contribute the same load as a task of
>> + ? ? ? ?* equal weight. (Global runnable time is taken as a fraction over 2^12.)
>> + ? ? ? ?*/
>> + ? ? ? runnable_avg = atomic_read(&tg->runnable_avg);
>> + ? ? ? if (runnable_avg < (1<<12)) {
>> + ? ? ? ? ? ? ? se->avg.load_avg_contrib *= runnable_avg;
>> + ? ? ? ? ? ? ? se->avg.load_avg_contrib /= (1<<12);
>> + ? ? ? }
>> ?}
>
> This seems weird, and the comments don't explain anything.
>
So consider (on cpu_i):
R_i
/ \
t1 A_i
\
t2
Where weight(t1) = shares(A) = 1024
Now, if t1 is runnable 80% of the time we'd charge 80% of its weight
to R in the load-avg, or about 820
If t2, in A, is runnable the exact same proportion of time, then it
should make the same contribution to R.
*But* A is a group entity, so its contribution is then its load weight
as a fraction of the task_groups. The catch is that if t2 is the only
task running then this is 1024/1024 * 1024 = 1024 != 820, we've lost
the fact that t2 in A was only actually runnable 80% of the time. We
need to perform the same normalization against how much it actually
runs.. BUT
We can't just use A_i's runnable, since group entities are per-cpu and
thus completely wrong as t2 moves around. We also can't easily carry
t2's contribution to A_i's runnability since there's no way to pull
the sum apart or account it into the new parenting tree efficiently.
But what we c an do is aggregate an estimation of whether the group as
a whole would contribute its full shares value if placed only on one
cpu and adjust appropriately.
I'll try to paraphrase the above into a more useful comment explaining this.
> Ah,.. you can count runnable multiple times (on each cpu), this also
> means that the number you're using (when below 1) can still be utter
> crap.
So, as an estimate it has the nice property that it's always a lower
bound on the true value.
Consider the two cases for runnable contributions across a pair of cpus:
Either they are disjoint by wall time, in which case they would have
accounted for the same amount were they actually on the same cpu.
Or they overlap, in which case that over-lap period would be an
additional run-delay incurred on one of them.
Then, the maximum amount we understate for a given overlap is n(n+1)/2
* k where k is the the width of the overlap and n is the number of
cpus we over-lapped on.
But then, n is bounded by num_cpus -- so on a small machine our error
is bounded (e.g. we can be no more than 1/3 less than true on 2
cpus). On a larger machine this term can be larger, however, if k is
of consequential size we know we accounted at least n*k so we're still
going to approach the ceiling of 1 very quickly at which point it
doesn't matter that we under-estimated. The flip side on a larger
machine is that if k is of inconsequential size then n*2 * k is still
tiny relative to the size of the machine and the accuracy of the
approximation matters less.
>
> Neither the comment nor the changelog mention this, it should, it should
> also mention why it doesn't matter (does it?).
It doesn't and it should. Although I'll take the liberty shortening
it a little to something like "unfortunately we cannot compute
runnable_avg(tg) directly, however, XXX is a reasonable
approximation."
On Thu, 2012-02-16 at 00:36 +0100, Peter Zijlstra wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> > Entities of equal weight should receive equitable distribution of cpu time.
> > This is challenging in the case of a task_group's shares as execution may be
> > occurring on multiple cpus simultaneously.
> >
> > To handle this we divide up the shares into weights proportionate with the load
> > on each cfs_rq. This does not however, account for the fact that the sum of
> > the parts may be less than one cpu and so we need to normalize:
> > load(tg) = min(runnable_avg(tg), 1) * tg->shares
> > Where runnable_avg is the aggregate time in which the task_group had runnable
> > children.
>
>
> > static inline void __update_group_entity_contrib(struct sched_entity *se)
> > {
> > struct cfs_rq *cfs_rq = group_cfs_rq(se);
> > struct task_group *tg = cfs_rq->tg;
> > + int runnable_avg;
> >
> > se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
> > se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
> > +
> > + /*
> > + * Unlike a task-entity, a group entity may be using >=1 cpu globally.
> > + * However, in the case that it's using <1 cpu we need to form a
> > + * correction term so that we contribute the same load as a task of
> > + * equal weight. (Global runnable time is taken as a fraction over 2^12.)
> > + */
> > + runnable_avg = atomic_read(&tg->runnable_avg);
> > + if (runnable_avg < (1<<12)) {
> > + se->avg.load_avg_contrib *= runnable_avg;
> > + se->avg.load_avg_contrib /= (1<<12);
> > + }
> > }
>
> This seems weird, and the comments don't explain anything.
>
> Ah,.. you can count runnable multiple times (on each cpu), this also
> means that the number you're using (when below 1) can still be utter
> crap.
>
> Neither the comment nor the changelog mention this, it should, it should
> also mention why it doesn't matter (does it?).
Since we don't know when we were runnable in the window, we can take our
runnable fraction as a flat probability distribution over the entire
window.
The combined answer we're looking for is what fraction of time was any
of our cpus running.
Take p_i to be the runnable probability of cpu i, then the probability
that both cpu0 and cpu1 were runnable is pc_0,1 = p_0 * p_1, so the
probability that either was running is p_01 = p_0 + p_1 - pc_0,1.
The 3 cpu case becomes when was either cpu01 or cpu2 running, yielding
the iteration: p_012 = p_01 + p_2 - pc_01,2.
p_012 = p_0 + p_1 + p_2 - (p_0 * p_1 + (p_0 + p_1 - p_0 * p_1) * p_2)
Now for small values of p our combined/corrective term is small, since
its a product of small, which is smaller, however it becomes more
dominant the nearer we get to 1.
Since its more likely to get near to 1 the more CPUs we have, I'm not
entirely convinced we can ignore it.
On Mon, Feb 6, 2012 at 12:48 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> +/* Precomputed fixed inverse multiplies for multiplication by y^n */
>> +const static u32 runnable_avg_yN_inv[] = {
>> + ? ? ? 0xffffffff,0xfa83b2db,0xf5257d15,0xefe4b99b,0xeac0c6e7,0xe5b906e7,
>> + ? ? ? 0xe0ccdeec,0xdbfbb797,0xd744fcca,0xd2a81d91,0xce248c15,0xc9b9bd86,
>> + ? ? ? 0xc5672a11,0xc12c4cca,0xbd08a39f,0xb8fbaf47,0xb504f333,0xb123f581,
>> + ? ? ? 0xad583eea,0xa9a15ab4,0xa5fed6a9,0xa2704303,0x9ef53260,0x9b8d39b9,
>> + ? ? ? 0x9837f051,0x94f4efa8,0x91c3d373,0x8ea4398b,0x8b95c1e3,0x88980e80,
>> + ? ? ? 0x85aac367,0x82cd8698,
>> +};
>
> I wonder if modern Intel isn't at the point where computing this thing
> is cheaper than the cacheline miss. You can compute y^n in O(log n) time
> and with n < 32 that's 5 multiplications (see fixed_power_int). Add to
> that the division.
>
> Of course there's platforms, ARM?, where reverse is likely true. Bugger
> that.
>
>> +/* Precomputed \Sum y^k { 1<=k<=n } */
>> +const static u32 runnable_avg_yN_sum[] = {
>> + ? ? ? ? ? 0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
>> + ? ? ? ?9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
>> + ? ? ? 17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
>> +};
>
> Right, can't see a fast way to compute this..
>
> The asymmetry in the tables annoys me though 32 vs 33 entries,
We could shrink yN sum to 32 entries but then
__compute_runnable_contrib() ends up a lot uglier with a bunch of
conditionals about the n=0 case. (and if we did that the same
argument could then be made for making runnable_avg_yN_inv 31
entries.. at which point we're asymmetric again.. ahhhhh!! must go
sort my pencils..)
>hex vs dec :-)
>
So I personally think hex versus dec is somewhat reasonable in that:
\Sum y^n converges to a/(1-r) -- about ~47k -- which is a very human
readable number. It's easy to do the mental math on what a given
entry in that table is going to do to your runnable_avg / runnable_sum
from just looking at it.
Conversely, for the inverse multiplies they're the completely
unintelligible ~0*y^k. We could write them in decimal, but the table
would just be enormous.
Since the second case would be enormously ugly and we already have
precedent for using hex for inverses wmult; the only unified option is
then to use hex for both. But then I personally certainly can't
eyeball the numbers for \Sum y^n in the same fashion. But perhaps
this is something we can lose.
On Thu, Feb 16, 2012 at 7:57 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> + ? ? ? ? ? ? ? if (se->avg.decay_count) {
>> + ? ? ? ? ? ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ? ? ? ? ? ?* In a wake-up migration we have to approximate the
>> + ? ? ? ? ? ? ? ? ? ? ? ?* time sleeping.
>> + ? ? ? ? ? ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? ? ? ? ? ? se->avg.last_runnable_update -= (-se->avg.decay_count)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? << 20;
>> + ? ? ? ? ? ? ? ? ? ? ? update_entity_load_avg(se, 0);
>> + ? ? ? ? ? ? ? }
>
> That comment wants more why and how. I think I remember pjt telling me
> about this last week, but those are already vague memories, I'll be sure
> to be completely clueless in another week.
The full reason is short enough to practically supplant the existing
comment: We can't synchronize clock_task between our old-cpu and our
new [don't have the locks and 32-bit says you can't have nice things],
so we can use our carried decays (which we can grab atomically for
exactly this reason) to approximate that.
I'll update appropriately.
On Fri, 17 Feb 2012 02:48:06 -0800, Paul Turner <[email protected]> wrote:
>
> This is almost certainly a result of me twiddling with the weight in
> calc_cfs_shares (using average instead of instantaneous weight) in
> this version -- see patch 11/14. While this had some nice stability
> properties it was not hot for fairness so I've since reverted it
> (snippet attached below).
>
For my understanding, what do you mean by stability here?
>
> 24-core:
> Starting task group fair16...done
> Starting task group fair32...done
> Starting task group fair48...done
> Waiting for the task to run for 120 secs
> Interpreting the results. Please wait....
> Time consumed by fair16 cgroup: 12628615 Tasks: 96
> Time consumed by fair32 cgroup: 12562859 Tasks: 192
> Time consumed by fair48 cgroup: 12600364 Tasks: 288
>
"Tasks:" should be 16,32,48?
Regards,
Nikunj
On Fri, 2012-02-17 at 04:32 -0800, Paul Turner wrote:
> > Neither the comment nor the changelog mention this, it should, it should
> > also mention why it doesn't matter (does it?).
>
> It doesn't and it should. Although I'll take the liberty shortening
> it a little to something
For the in-code comment that's fine, it would be good for the changelog
to have the entire story though.
> like "unfortunately we cannot compute
> runnable_avg(tg) directly, however, XXX is a reasonable
> approximation."
Yeah, not easily done indeed, you could compute a corrective term if
you'd have something like the avg and variance of runnable over the
various CPUs, but those too suck to have to compute.
On Fri, 2012-02-17 at 02:54 -0800, Paul Turner wrote:
> On Thu, Feb 16, 2012 at 5:37 AM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> >> static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
> >> {
> >> - __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
> >> + __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
> >> + runnable);
> >
> > Its not immediately obvious why we use @runnable for @running,
>
> Isn't it? An rq is the root of all scheduling -- if there are any
> runnable tasks than one of them better be running when this is called.
>
> I can add a comment to this effect.
Ah, indeed. Yes a comment would be good ;-)
On Fri, 2012-02-17 at 03:32 -0800, Paul Turner wrote:
> Since we do not know where we slept relative to the tick, unless we
> explicitly track where this occurred we don't know for a "n.y" jiffy
> sleep whether the "0.y" fraction included an extra decay.
but isn't that the same problem you have in
__update_entity_runnable_avg() where you use ->runnable_avg_period %
1024 ?
Also, shouldn't __synchronize_entity_decay() update
->runnable_avg_period? Surely when you wake up your window fraction
could be different than when you went to sleep?
> It's
> important to get this right to reduce instability, since when the
> entity wakes back up we need to charge an equivalent decay against
> it's contribution before we can remove it from the blocked_load_sum.
>
> For the common case of a short sleep whether we charge an extra period
> (if y included a jiffy edge) or not is significant to stability since
> getting it wrong means we leave a little extra load in blocked_load.
Sure..
> Further complicating this we need to be able to do said
> synchronization in the annoying cases of wake-up migration (where we
> don't hold the lock on previous rq) and 32-bit machines (where
> everything sucks).
Ah, wouldn't task_waking_fair() be a better place though?
> >
> > C) 'Migrate'
> > - uses contributes_blocked_load to check if we actually did migrate?
> >
>
> We actually use se->avg.decay_count to check whether we were migrated
> (setting it appropriately at migration); contributes_blocked_load is
> just a convenience variable to track whether we're a part of
> cfs_rq->blocked_load_avg or not.
If you use task_waking_fair() you can also use ENQUEUE_WAKING.
> There's a one or two other paths (than ttwu) that this matters for,
> using set_task_rq() gets them all.
What paths are those? Having a cfs hook in the generic code there isn't
pretty an suggests funny things could happen when PI gets done.
One more thing though, check what it all does when you run with the
'default' jiffies based sched_clock(). Nobody sane should use that, but
then there's bound to be people who disagree with that :-)
I suspect it will all more or less work out, but someone should verify
that it actually does. It would be such a shame to have to revert this
because we broken something daft like m68k or so ;-)
On Mon, Feb 20, 2012 at 1:41 AM, Nikunj A Dadhania
<[email protected]> wrote:
> On Fri, 17 Feb 2012 02:48:06 -0800, Paul Turner <[email protected]> wrote:
>>
>> This is almost certainly a result of me twiddling with the weight in
>> calc_cfs_shares (using average instead of instantaneous weight) in
>> this version -- see patch 11/14. ?While this had some nice stability
>> properties it was not hot for fairness so I've since reverted it
>> (snippet attached below).
>>
> For my understanding, what do you mean by stability here?
The result is stable; meaning repeating the experiment returns the same number.
>
>>
>> 24-core:
>> Starting task group fair16...done
>> Starting task group fair32...done
>> Starting task group fair48...done
>> Waiting for the task to run for 120 secs
>> Interpreting the results. Please wait....
>> Time consumed by fair16 cgroup: ?12628615 Tasks: 96
>> Time consumed by fair32 cgroup: ?12562859 Tasks: 192
>> Time consumed by fair48 cgroup: ?12600364 Tasks: 288
>>
> "Tasks:" should be 16,32,48?
>
Ah, I ran your script multiple times (for stability) above, it must
not have been killing itself properly (notice each of those numbers is
the respective tasks per run times 6).
A correct first run on a 24-core looks like:
Starting task group fair16...done
Starting task group fair32...done
Starting task group fair48...done
Waiting for the task to run for 120 secs
Interpreting the results. Please wait....
Time consumed by fair16 cgroup: 1332211 Tasks: 16
Time consumed by fair32 cgroup: 1227356 Tasks: 32
Time consumed by fair48 cgroup: 1217174 Tasks: 48
The small boost to the tasks=16 case is almost certainly tied to our
current handling of sleeper credit and entity placement -- since there
are less tasks than cores, whenever a task moves to a core it has not
been previously executing on it gets a vruntime boost.
Thanks,
- Paul
> Regards,
> Nikunj
>
u64/u64 divisions caused undefined reference to `__aeabi_uldivmod'
Convert them to div_u64 calls.
---
kernel/sched/fair.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)
Hi Paul,
The patch series were failing to build for ARM, and with these small
fixes they do, and they appear to work, at least on panda where I tested them.
It also didn't build when CGROUPs where not configured so there's that as well.
Regards
-- Pantelis
Fix compilation when CGROUPs are not configured.
---
kernel/sched/fair.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
On Wed, Feb 29, 2012 at 10:37:38AM +0000, Pantelis Antoniou wrote:
> @@ -1110,9 +1110,9 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
> struct cfs_rq *cfs_rq = group_cfs_rq(se);
> struct task_group *tg = cfs_rq->tg;
> int runnable_avg;
> + u64 contrib;
>
> - se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
> - se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
It seems that contrib is never assigned?
Fix:
+ contrib = (cfs_rq->tg_load_contrib * tg->shares);
> + se->avg.load_avg_contrib = div_u64(contrib, atomic64_read(&tg->load_avg) + 1);
Regards,
Morten
On Feb 28, 2012, at 7:45 PM, Morten Rasmussen wrote:
> On Wed, Feb 29, 2012 at 10:37:38AM +0000, Pantelis Antoniou wrote:
>> @@ -1110,9 +1110,9 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
>> struct cfs_rq *cfs_rq = group_cfs_rq(se);
>> struct task_group *tg = cfs_rq->tg;
>> int runnable_avg;
>> + u64 contrib;
>>
>> - se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
>> - se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
>
> It seems that contrib is never assigned?
> Fix:
> + contrib = (cfs_rq->tg_load_contrib * tg->shares);
>
>> + se->avg.load_avg_contrib = div_u64(contrib, atomic64_read(&tg->load_avg) + 1);
>
> Regards,
> Morten
>
Hmm, yeah,
2 line patch and brown paper bag time.
Thanks
-- Pantelis-