LinuxLists.cc - [PATCH 0/6] Optimize sched averages computation

2016-04-28 10:49:03

Subject: [PATCH 0/6] Optimize sched averages computation

I started to optimize __update_load_avg() for flat util hierarchy implementation.
Since it was started, let me finish.

The flat util hierarchy is not in this patchset. I am still pondering whether
we add sched_avg in rq to do it or simply and only update cfs_rq util when we
update the top cfs_rq (Dietmar and Vincent took this approach). I think this
needs some experiments.

Yuyang Du (6):
sched/fair: Optimize sum computation with a lookup table
sched/fair: Rename variable names for sched averages
sched/fair: Change the variable to hold the number of periods to
32bit integer
sched/fair: Add __always_inline compiler attribute to
__accumulate_sum()
sched/fair: Optimize __update_sched_avg()
documentation: Add scheuler/sched-avg.txt

Documentation/scheduler/sched-avg.txt | 160 +++++++++++++++
kernel/sched/fair.c | 352 +++++++++++++++++----------------
2 files changed, 339 insertions(+), 173 deletions(-)
create mode 100644 Documentation/scheduler/sched-avg.txt

--
1.7.9.5

2016-04-28 10:38:31

by Yuyang Du

[permalink] [raw]

Subject: [PATCH 4/6] sched/fair: Add __always_inline compiler attribute to __accumulate_sum()

Everybody has it. If code-size is not the problem, __accumulate_sum()
should have it too.

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index abfe17a..486f098 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2649,7 +2649,7 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
* We can compute this efficiently by combining:
* y^32 = 1/2 with precomputed \Sum 1024*y^n (where n < 32)
*/
-static u32 __accumulate_sum(u32 n)
+static __always_inline u32 __accumulate_sum(u32 n)
{
u32 contrib = 0;

--
1.7.9.5

2016-04-28 10:38:25

by Yuyang Du

[permalink] [raw]

Subject: [PATCH 2/6] sched/fair: Rename variable names for sched averages

The names of sched averages (including load_avg and util_avg) have
been changed and added in the past a couple of years, some of
the names are a bit confusing especially to people who first read them.
This patch attempts to make the names more self-explaining. And some
comments are updated too.

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 192 ++++++++++++++++++++++++++-------------------------
1 file changed, 98 insertions(+), 94 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e0eec0..8d49276 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -660,13 +660,15 @@ static int select_idle_sibling(struct task_struct *p, int cpu);
static unsigned long task_h_load(struct task_struct *p);

/*
- * We choose a half-life close to 1 scheduling period.
- * Note: The tables runnable_avg_yN_inv and runnable_avg_yN_sum are
- * dependent on this value.
+ * Note: everything in sched average calculation, including
+ * __decay_inv_multiply_N, __accumulated_sum_N, __accumulated_sum_N32,
+ * SCHED_AVG_MAX, and SCHED_AVG_MAX_N are all dependent on and only on
+ * (1) exponential decay, (2) a period of 1024*1024ns (~1ms), and (3)
+ * a half-life of 32 periods.
*/
-#define LOAD_AVG_PERIOD 32
-#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
-#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */
+#define SCHED_AVG_HALFLIFE 32 /* number of periods as a half-life */
+#define SCHED_AVG_MAX 47742 /* maximum possible sched avg */
+#define SCHED_AVG_MAX_N 345 /* number of full periods to produce SCHED_AVG_MAX */

/* Give new sched_entity start runnable values to heavy its load in infant time */
void init_entity_runnable_average(struct sched_entity *se)
@@ -681,7 +683,7 @@ void init_entity_runnable_average(struct sched_entity *se)
*/
sa->period_contrib = 1023;
sa->load_avg = scale_load_down(se->load.weight);
- sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
+ sa->load_sum = sa->load_avg * SCHED_AVG_MAX;
/*
* At this point, util_avg won't be used in select_task_rq_fair anyway
*/
@@ -731,7 +733,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
} else {
sa->util_avg = cap;
}
- sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+ sa->util_sum = sa->util_avg * SCHED_AVG_MAX;
}
}

@@ -1834,7 +1836,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
*period = now - p->last_task_numa_placement;
} else {
delta = p->se.avg.load_sum / p->se.load.weight;
- *period = LOAD_AVG_MAX;
+ *period = SCHED_AVG_MAX;
}

p->last_sum_exec_runtime = runtime;
@@ -2583,7 +2585,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)

#ifdef CONFIG_SMP
/* Precomputed fixed inverse multiplies for multiplication by y^n */
-static const u32 runnable_avg_yN_inv[] = {
+static const u32 __decay_inv_multiply_N[] = {
0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581,
@@ -2596,7 +2598,7 @@ static const u32 runnable_avg_yN_inv[] = {
* Precomputed \Sum y^k { 1<=k<=n }. These are floor(true_value) to prevent
* over-estimates when re-combining.
*/
-static const u32 runnable_avg_yN_sum[] = {
+static const u32 __accumulated_sum_N[] = {
0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
@@ -2612,16 +2614,18 @@ static const u32 __accumulated_sum_N32[] = {
};

/*
- * Approximate:
- * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
+ * val * y^n, where y^m ~= 0.5
+ *
+ * n is the number of periods past; a period is ~1ms
+ * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
*/
-static __always_inline u64 decay_load(u64 val, u64 n)
+static __always_inline u64 __decay_sum(u64 val, u64 n)
{
unsigned int local_n;

if (!n)
return val;
- else if (unlikely(n > LOAD_AVG_PERIOD * 63))
+ else if (unlikely(n > SCHED_AVG_HALFLIFE * 63))
return 0;

/* after bounds checking we can collapse to 32-bit */
@@ -2634,36 +2638,36 @@ static __always_inline u64 decay_load(u64 val, u64 n)
*
* To achieve constant time decay_load.
*/
- if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
- val >>= local_n / LOAD_AVG_PERIOD;
- local_n %= LOAD_AVG_PERIOD;
+ if (unlikely(local_n >= SCHED_AVG_HALFLIFE)) {
+ val >>= local_n / SCHED_AVG_HALFLIFE;
+ local_n %= SCHED_AVG_HALFLIFE;
}

- val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+ val = mul_u64_u32_shr(val, __decay_inv_multiply_N[local_n], 32);
return val;
}

/*
- * For updates fully spanning n periods, the contribution to runnable
- * average will be: \Sum 1024*y^n
+ * For updates fully spanning n periods, the accumulated contribution
+ * will be: \Sum 1024*y^n.
*
- * We can compute this reasonably efficiently by combining:
- * y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for n <PERIOD}
+ * We can compute this efficiently by combining:
+ * y^32 = 1/2 with precomputed \Sum 1024*y^n (where n < 32)
*/
-static u32 __compute_runnable_contrib(u64 n)
+static u32 __accumulate_sum(u64 n)
{
u32 contrib = 0;

- if (likely(n <= LOAD_AVG_PERIOD))
- return runnable_avg_yN_sum[n];
- else if (unlikely(n >= LOAD_AVG_MAX_N))
- return LOAD_AVG_MAX;
+ if (likely(n <= SCHED_AVG_HALFLIFE))
+ return __accumulated_sum_N[n];
+ else if (unlikely(n >= SCHED_AVG_MAX_N))
+ return SCHED_AVG_MAX;

- /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
- contrib = __accumulated_sum_N32[n>>5]; /* =n/LOAD_AVG_PERIOD */
- n %= LOAD_AVG_PERIOD;
- contrib = decay_load(contrib, n);
- return contrib + runnable_avg_yN_sum[n];
+ /* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
+ contrib = __accumulated_sum_N32[n>>5]; /* =n/SCHED_AVG_HALFLIFE */
+ n %= SCHED_AVG_HALFLIFE;
+ contrib = __decay_sum(contrib, n);
+ return contrib + __accumulated_sum_N[n];
}

#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
@@ -2673,35 +2677,35 @@ static u32 __compute_runnable_contrib(u64 n)
#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)

/*
- * We can represent the historical contribution to runnable average as the
- * coefficients of a geometric series. To do this we sub-divide our runnable
- * history into segments of approximately 1ms (1024us); label the segment that
- * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ * We can represent the historical contribution to sched average as the
+ * coefficients of a geometric series. To do this we divide the history
+ * into segments of approximately 1ms (1024*1024ns); label the segment that
+ * occurred N-1024us ago p_N, with p_0 corresponding to the current period, e.g.
*
* [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
* p0 p1 p2
* (now) (~1ms ago) (~2ms ago)
*
- * Let u_i denote the fraction of p_i that the entity was runnable.
+ * Let u_i denote the fraction of p_i whose state (runnable/running) we count.
*
* We then designate the fractions u_i as our co-efficients, yielding the
- * following representation of historical load:
+ * following representation of a sched metric:
* u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
*
- * We choose y based on the with of a reasonably scheduling period, fixing:
- * y^32 = 0.5
+ * We choose y based on a half-life of 32 periods (which is ~32ms):
+ * y^32 = 0.5 => y = (0.5)^(1/32)
*
- * This means that the contribution to load ~32ms ago (u_32) will be weighted
- * approximately half as much as the contribution to load within the last ms
- * (u_0).
+ * where 32 is the number of periods that a past period's contribution is
+ * halved. This means that the impact of a period every ~32ms ago will be
+ * as much as 50% of the previous value.
*
* When a period "rolls over" and we have new u_0`, multiplying the previous
* sum again by y is sufficient to update:
- * load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
- * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
+ * avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
*/
static __always_inline int
-__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
+__update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
unsigned long weight, int running, struct cfs_rq *cfs_rq)
{
u64 delta, scaled_delta, periods;
@@ -2762,15 +2766,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
periods = delta / 1024;
delta %= 1024;

- sa->load_sum = decay_load(sa->load_sum, periods + 1);
+ sa->load_sum = __decay_sum(sa->load_sum, periods + 1);
if (cfs_rq) {
cfs_rq->runnable_load_sum =
- decay_load(cfs_rq->runnable_load_sum, periods + 1);
+ __decay_sum(cfs_rq->runnable_load_sum, periods + 1);
}
- sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);
+ sa->util_sum = __decay_sum((u64)(sa->util_sum), periods + 1);

/* Efficiently calculate \sum (1..n_period) 1024*y^i */
- contrib = __compute_runnable_contrib(periods);
+ contrib = __accumulate_sum(periods);
contrib = cap_scale(contrib, scale_freq);
if (weight) {
sa->load_sum += weight * contrib;
@@ -2794,12 +2798,12 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
sa->period_contrib += delta;

if (decayed) {
- sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+ sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
if (cfs_rq) {
cfs_rq->runnable_load_avg =
- div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
+ div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
}
- sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
+ sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
}

return decayed;
@@ -2867,7 +2871,7 @@ void set_task_rq_fair(struct sched_entity *se,
p_last_update_time = prev->avg.last_update_time;
n_last_update_time = next->avg.last_update_time;
#endif
- __update_load_avg(p_last_update_time, cpu_of(rq_of(prev)),
+ __update_sched_avg(p_last_update_time, cpu_of(rq_of(prev)),
&se->avg, 0, 0, NULL);
se->avg.last_update_time = n_last_update_time;
}
@@ -2879,7 +2883,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);

/* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
-static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
+static inline int update_cfs_rq_sched_avg(u64 now, struct cfs_rq *cfs_rq)
{
struct sched_avg *sa = &cfs_rq->avg;
int decayed, removed = 0;
@@ -2887,17 +2891,17 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
if (atomic_long_read(&cfs_rq->removed_load_avg)) {
s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
sa->load_avg = max_t(long, sa->load_avg - r, 0);
- sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+ sa->load_sum = max_t(s64, sa->load_sum - r * SCHED_AVG_MAX, 0);
removed = 1;
}

if (atomic_long_read(&cfs_rq->removed_util_avg)) {
long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
sa->util_avg = max_t(long, sa->util_avg - r, 0);
- sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+ sa->util_sum = max_t(s32, sa->util_sum - r * SCHED_AVG_MAX, 0);
}

- decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+ decayed = __update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);

#ifndef CONFIG_64BIT
@@ -2909,7 +2913,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
}

/* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int update_tg)
+static inline void update_sched_avg(struct sched_entity *se, int update_tg)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
u64 now = cfs_rq_clock_task(cfs_rq);
@@ -2920,11 +2924,11 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
* Track task load average for carrying it to new CPU after migrated, and
* track group sched_entity load average for task_h_load calc in migration
*/
- __update_load_avg(now, cpu, &se->avg,
+ __update_sched_avg(now, cpu, &se->avg,
se->on_rq * scale_load_down(se->load.weight),
cfs_rq->curr == se, NULL);

- if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
+ if (update_cfs_rq_sched_avg(now, cfs_rq) && update_tg)
update_tg_load_avg(cfs_rq, 0);

if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
@@ -2951,7 +2955,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
}
}

-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void attach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
if (!sched_feat(ATTACH_AGE_LOAD))
goto skip_aging;
@@ -2961,7 +2965,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
* have aged the average right before clearing @last_update_time.
*/
if (se->avg.last_update_time) {
- __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
+ __update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
&se->avg, 0, 0, NULL);

/*
@@ -2978,9 +2982,9 @@ skip_aging:
cfs_rq->avg.util_sum += se->avg.util_sum;
}

-static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
+ __update_sched_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
&se->avg, se->on_rq * scale_load_down(se->load.weight),
cfs_rq->curr == se, NULL);

@@ -2992,7 +2996,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

/* Add the load generated by se into cfs_rq's load average */
static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+enqueue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct sched_avg *sa = &se->avg;
u64 now = cfs_rq_clock_task(cfs_rq);
@@ -3000,18 +3004,18 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)

migrated = !sa->last_update_time;
if (!migrated) {
- __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
+ __update_sched_avg(now, cpu_of(rq_of(cfs_rq)), sa,
se->on_rq * scale_load_down(se->load.weight),
cfs_rq->curr == se, NULL);
}

- decayed = update_cfs_rq_load_avg(now, cfs_rq);
+ decayed = update_cfs_rq_sched_avg(now, cfs_rq);

cfs_rq->runnable_load_avg += sa->load_avg;
cfs_rq->runnable_load_sum += sa->load_sum;

if (migrated)
- attach_entity_load_avg(cfs_rq, se);
+ attach_entity_sched_avg(cfs_rq, se);

if (decayed || migrated)
update_tg_load_avg(cfs_rq, 0);
@@ -3019,9 +3023,9 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)

/* Remove the runnable load generated by se from cfs_rq's runnable load average */
static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+dequeue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- update_load_avg(se, 1);
+ update_sched_avg(se, 1);

cfs_rq->runnable_load_avg =
max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
@@ -3054,7 +3058,7 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
* Task first catches up with cfs_rq, and then subtract
* itself from the cfs_rq (task must be off the queue now).
*/
-void remove_entity_load_avg(struct sched_entity *se)
+void remove_entity_sched_avg(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
u64 last_update_time;
@@ -3068,7 +3072,7 @@ void remove_entity_load_avg(struct sched_entity *se)

last_update_time = cfs_rq_last_update_time(cfs_rq);

- __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+ __update_sched_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
}
@@ -3087,17 +3091,17 @@ static int idle_balance(struct rq *this_rq);

#else /* CONFIG_SMP */

-static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
+static inline void update_sched_avg(struct sched_entity *se, int update_tg) {}
static inline void
-enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+enqueue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
static inline void
-dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-static inline void remove_entity_load_avg(struct sched_entity *se) {}
+dequeue_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static inline void remove_entity_sched_avg(struct sched_entity *se) {}

static inline void
-attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+attach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
static inline void
-detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+detach_entity_sched_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}

static inline int idle_balance(struct rq *rq)
{
@@ -3257,7 +3261,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (renorm && !curr)
se->vruntime += cfs_rq->min_vruntime;

- enqueue_entity_load_avg(cfs_rq, se);
+ enqueue_entity_sched_avg(cfs_rq, se);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);

@@ -3336,7 +3340,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- dequeue_entity_load_avg(cfs_rq, se);
+ dequeue_entity_sched_avg(cfs_rq, se);

if (schedstat_enabled())
update_stats_dequeue(cfs_rq, se, flags);
@@ -3416,7 +3420,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (schedstat_enabled())
update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
- update_load_avg(se, 1);
+ update_sched_avg(se, 1);
}

update_stats_curr_start(cfs_rq, se);
@@ -3520,7 +3524,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
- update_load_avg(prev, 0);
+ update_sched_avg(prev, 0);
}
cfs_rq->curr = NULL;
}
@@ -3536,7 +3540,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
/*
* Ensure that runnable average is periodically updated.
*/
- update_load_avg(curr, 1);
+ update_sched_avg(curr, 1);
update_cfs_shares(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK
@@ -4409,7 +4413,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;

- update_load_avg(se, 1);
+ update_sched_avg(se, 1);
update_cfs_shares(cfs_rq);
}

@@ -4469,7 +4473,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;

- update_load_avg(se, 1);
+ update_sched_avg(se, 1);
update_cfs_shares(cfs_rq);
}

@@ -5321,7 +5325,7 @@ static void migrate_task_rq_fair(struct task_struct *p)
* will result in the wakee task is less decayed, but giving the wakee more
* load sounds not bad.
*/
- remove_entity_load_avg(&p->se);
+ remove_entity_sched_avg(&p->se);

/* Tell new CPU we are migrated */
p->se.avg.last_update_time = 0;
@@ -5332,7 +5336,7 @@ static void migrate_task_rq_fair(struct task_struct *p)

static void task_dead_fair(struct task_struct *p)
{
- remove_entity_load_avg(&p->se);
+ remove_entity_sched_avg(&p->se);
}
#endif /* CONFIG_SMP */

@@ -6213,7 +6217,7 @@ static void update_blocked_averages(int cpu)
if (throttled_hierarchy(cfs_rq))
continue;

- if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq))
+ if (update_cfs_rq_sched_avg(cfs_rq_clock_task(cfs_rq), cfs_rq))
update_tg_load_avg(cfs_rq, 0);
}
raw_spin_unlock_irqrestore(&rq->lock, flags);
@@ -6274,7 +6278,7 @@ static inline void update_blocked_averages(int cpu)

raw_spin_lock_irqsave(&rq->lock, flags);
update_rq_clock(rq);
- update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+ update_cfs_rq_sched_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
raw_spin_unlock_irqrestore(&rq->lock, flags);
}

@@ -8302,7 +8306,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
}

/* Catch up with the cfs_rq and remove our load when we leave */
- detach_entity_load_avg(cfs_rq, se);
+ detach_entity_sched_avg(cfs_rq, se);
}

static void attach_task_cfs_rq(struct task_struct *p)
@@ -8319,7 +8323,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
#endif

/* Synchronize task with its cfs_rq */
- attach_entity_load_avg(cfs_rq, se);
+ attach_entity_sched_avg(cfs_rq, se);

if (!vruntime_normalized(p))
se->vruntime += cfs_rq->min_vruntime;
@@ -8458,7 +8462,7 @@ void unregister_fair_sched_group(struct task_group *tg)

for_each_possible_cpu(cpu) {
if (tg->se[cpu])
- remove_entity_load_avg(tg->se[cpu]);
+ remove_entity_sched_avg(tg->se[cpu]);

/*
* Only empty task groups can be destroyed; so we can speculatively
--
1.7.9.5

2016-04-28 10:38:35

by Yuyang Du

[permalink] [raw]

Subject: [PATCH 6/6] documentation: Add scheuler/sched-avg.txt

This doc file has the programs to generate the constants to compute
sched averages.

Signed-off-by: Yuyang Du <[email protected]>
---
Documentation/scheduler/sched-avg.txt | 160 +++++++++++++++++++++++++++++++++
1 file changed, 160 insertions(+)
create mode 100644 Documentation/scheduler/sched-avg.txt

diff --git a/Documentation/scheduler/sched-avg.txt b/Documentation/scheduler/sched-avg.txt
new file mode 100644
index 0000000..55fe900
--- /dev/null
+++ b/Documentation/scheduler/sched-avg.txt
@@ -0,0 +1,160 @@
+==============================================================
+ C program:
+==============================================================
+
+#include <math.h>
+#include <stdio.h>
+
+#define HALFLIFE 32
+#define SHIFT 32
+
+/*
+print " #: yN_inv yN_sum"
+print "-----------------------"
+y = (0.5)**(1/32.0)
+x = 2**32
+xx = 1024
+for i in range(0, 32):
+ if i == 0:
+ x = x-1
+ xx = xx*y
+ else:
+ x = x*y
+ xx = int(xx*y + 1024*y)
+ print "%2d: %#x %8d" % (i, int(x), int(xx))
+
+print " #: sum_N32"
+print "------------"
+xxx = xx
+for i in range(0, 11):
+ if i == 0:
+ xxx = xx
+ else:
+ xxx = xxx/2 + xx
+ print "%2d: %8d" % (i, xxx)
+*/
+
+double y;
+
+void calc_decay_inv_multiply() {
+ int i;
+ unsigned int x;
+
+ printf("static const u32 __decay_inv_multiply_N[] = {");
+ for(i = 0; i < HALFLIFE; i++) {
+ x = ((1UL<<32)-1)*pow(y, i);
+
+ if (i % 6 == 0) printf("\n\t");
+ printf("0x%8x, ", x);
+ }
+ printf("\n};\n\n");
+}
+
+int sum = 1024;
+void calc_accumulated_sum() {
+ int i;
+
+ printf("static const u32 __accumulated_sum_N[] = {\n\t 0,");
+ for(i = 1; i <= HALFLIFE; i++) {
+ if (i == 1)
+ sum *= y;
+ else
+ sum = sum*y + 1024*y;
+
+ if (i % 11 == 0) printf("\n\t");
+ printf("%5d,", sum);
+ }
+ printf("\n};\n\n");
+}
+
+int n = 1;
+/* first period */
+long max = 1024;
+
+void calc_converged_max() {
+ long last = 0, y_inv = ((1UL<<32)-1)*y;
+
+ for (; ; n++) {
+ if (n > 1)
+ max = ((max*y_inv)>>SHIFT) + 1024;
+ /* This is the same as */
+ //max = max*y + 1024;
+
+ if (last == max)
+ break;
+
+ last = max;
+ }
+ n--;
+ printf("#define SCHED_AVG_HALFLIFE %d\n", HALFLIFE);
+ printf("#define SCHED_AVG_MAX %ld\n", max);
+ printf("#define SCHED_AVG_MAX_N %d\n\n", n);
+}
+
+void calc_accumulated_sum_32() {
+ int i, x = sum;
+
+ printf("static const u32 __accumulated_sum_N32[] = {\n\t 0,");
+ for(i = 1; i <= n/HALFLIFE+1; i++) {
+ if (i > 1)
+ x = x/2 + sum;
+
+ if (i % 6 == 0) printf("\n\t");
+ printf("%6d,", x);
+ }
+ printf("\n};\n\n");
+}
+
+void main() {
+ y = pow(0.5, 1/(double)HALFLIFE);
+
+ calc_decay_inv_multiply();
+ calc_accumulated_sum();
+ calc_converged_max();
+ calc_accumulated_sum_32();
+}
+
+==============================================================
+ Python script if you speak snake
+==============================================================
+
+#!/usr/bin/env python
+
+print " #: yN_inv yN_sum"
+print "-----------------------"
+y = (0.5)**(1/32.0)
+x = 2**32
+xx = 1024
+for i in range(0, 32):
+ if i == 0:
+ x = x-1
+ xx = xx*y
+ else:
+ x = x*y
+ xx = int(xx*y + 1024*y)
+ print "%2d: %#x %8d" % (i, int(x), int(xx))
+
+print
+print " #: sum_N32"
+print "------------"
+xxx = xx
+for i in range(0, 11):
+ if i == 0:
+ xxx = xx
+ else:
+ xxx = xxx/2 + xx
+ print "%2d: %8d" % (i, xxx)
+
+print
+print " n: max"
+print "------------"
+xxxx = 1024
+old = 0
+i = 2
+while (1):
+ xxxx = int(xxxx*y + 1024)
+ if old == xxxx:
+ break
+ i = i+1
+ old = xxxx
+print "%3d: %7d" % (i-1, xxxx)
--
1.7.9.5

2016-04-28 10:43:47

by Yuyang Du

[permalink] [raw]

Subject: [PATCH 5/6] sched/fair: Optimize __update_sched_avg()

Previously, __update_sched_avg() has these (full) steps:
(1) add the left of the last incomplete period
(2) decay old sum
(3) accumulate new sum since last_update_time
(4) add the current incomplete period

If we separately compute (1), (3), and (4), each one of them is
ugly in codes and costly in overhead. But actually they do the same
thing, so we combine them together.

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 185 +++++++++++++++++++++++++--------------------------
1 file changed, 92 insertions(+), 93 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 486f098..7dba259 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -668,7 +668,7 @@ static unsigned long task_h_load(struct task_struct *p);
*/
#define SCHED_AVG_HALFLIFE 32 /* number of periods as a half-life */
#define SCHED_AVG_MAX 47742 /* maximum possible sched avg */
-#define SCHED_AVG_MAX_N 345 /* number of full periods to produce SCHED_AVG_MAX */
+#define SCHED_AVG_MAX_N 347 /* number of full periods to produce SCHED_AVG_MAX */

/* Give new sched_entity start runnable values to heavy its load in infant time */
void init_entity_runnable_average(struct sched_entity *se)
@@ -2606,7 +2606,7 @@ static const u32 __accumulated_sum_N[] = {

/*
* Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
- * lower integers.
+ * lower integers. Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11
*/
static const u32 __accumulated_sum_N32[] = {
0, 23371, 35056, 40899, 43820, 45281,
@@ -2649,20 +2649,31 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
* We can compute this efficiently by combining:
* y^32 = 1/2 with precomputed \Sum 1024*y^n (where n < 32)
*/
-static __always_inline u32 __accumulate_sum(u32 n)
+static __always_inline u32
+__accumulate_sum(u32 periods, u32 period_contrib, u32 remainder)
{
- u32 contrib = 0;
+ u32 contrib;

- if (likely(n <= SCHED_AVG_HALFLIFE))
- return __accumulated_sum_N[n];
- else if (unlikely(n >= SCHED_AVG_MAX_N))
+ if (!periods)
+ return remainder - period_contrib;
+
+ if (unlikely(periods >= SCHED_AVG_MAX_N))
return SCHED_AVG_MAX;

- /* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
- contrib = __accumulated_sum_N32[n>>5]; /* =n/SCHED_AVG_HALFLIFE */
- n %= SCHED_AVG_HALFLIFE;
- contrib = __decay_sum(contrib, n);
- return contrib + __accumulated_sum_N[n];
+ remainder += __decay_sum((u64)(1024 - period_contrib), periods);
+
+ periods -= 1;
+ if (likely(periods <= SCHED_AVG_HALFLIFE))
+ contrib = __accumulated_sum_N[periods];
+ else {
+ /*(periods>>5) = (periods/SCHED_AVG_HALFLIFE) */
+ contrib = __accumulated_sum_N32[periods>>5];
+ periods %= SCHED_AVG_HALFLIFE;
+ contrib = __decay_sum(contrib, periods);
+ contrib += __accumulated_sum_N[periods];
+ }
+
+ return contrib + remainder;
}

#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
@@ -2671,6 +2682,55 @@ static __always_inline u32 __accumulate_sum(u32 n)

#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)

+static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
+ struct cfs_rq *cfs_rq, int cpu, unsigned long weight, int running)
+{
+ u32 contrib, periods;
+ unsigned long scale_freq, scale_cpu;
+
+ scale_freq = arch_scale_freq_capacity(NULL, cpu);
+ scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+ delta += sa->period_contrib;
+ periods = delta >> 10; /* A period is 1024us (~1ms) */
+
+ /*
+ * Accumulating *_sum has two steps.
+ *
+ * Step 1: decay old *_sum if we crossed period boundaries.
+ */
+ if (periods) {
+ sa->load_sum = __decay_sum(sa->load_sum, periods);
+ if (cfs_rq) {
+ cfs_rq->runnable_load_sum =
+ __decay_sum(cfs_rq->runnable_load_sum, periods);
+ }
+ sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
+ }
+
+ /*
+ * Step 2: accumulate new *_sum since last_update_time. This at most has
+ * three parts (at least one part): (1) remainder of incomplete last
+ * period, (2) full periods since (1), and (3) incomplete current period.
+ *
+ * Fortunately, we can (and should) do all these three at once.
+ */
+ delta %= 1024;
+ contrib = __accumulate_sum(periods, sa->period_contrib, delta);
+ sa->period_contrib = delta;
+
+ contrib = cap_scale(contrib, scale_freq);
+ if (weight) {
+ sa->load_sum += weight * contrib;
+ if (cfs_rq)
+ cfs_rq->runnable_load_sum += weight * contrib;
+ }
+ if (running)
+ sa->util_sum += contrib * scale_cpu;
+
+ return periods;
+}
+
/*
* We can represent the historical contribution to sched average as the
* coefficients of a geometric series. To do this we divide the history
@@ -2701,12 +2761,9 @@ static __always_inline u32 __accumulate_sum(u32 n)
*/
static __always_inline int
__update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
- unsigned long weight, int running, struct cfs_rq *cfs_rq)
+ unsigned long weight, int running, struct cfs_rq *cfs_rq)
{
- u64 delta, scaled_delta;
- u32 contrib, periods;
- unsigned int delta_w, scaled_delta_w, decayed = 0;
- unsigned long scale_freq, scale_cpu;
+ u64 delta;

delta = now - sa->last_update_time;
/*
@@ -2727,85 +2784,27 @@ __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
return 0;
sa->last_update_time = now;

- scale_freq = arch_scale_freq_capacity(NULL, cpu);
- scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
-
- /* delta_w is the amount already accumulated against our next period */
- delta_w = sa->period_contrib;
- if (delta + delta_w >= 1024) {
- decayed = 1;
-
- /* how much left for next period will start over, we don't know yet */
- sa->period_contrib = 0;
-
- /*
- * Now that we know we're crossing a period boundary, figure
- * out how much from delta we need to complete the current
- * period and accrue it.
- */
- delta_w = 1024 - delta_w;
- scaled_delta_w = cap_scale(delta_w, scale_freq);
- if (weight) {
- sa->load_sum += weight * scaled_delta_w;
- if (cfs_rq) {
- cfs_rq->runnable_load_sum +=
- weight * scaled_delta_w;
- }
- }
- if (running)
- sa->util_sum += scaled_delta_w * scale_cpu;
-
- delta -= delta_w;
-
- /*
- * Figure out how many additional periods this update spans.
- * A period is 1024*1024ns or ~1ms, so a 32bit integer can hold
- * approximately a maximum of 49 (=2^32/1000/3600/24) days.
- */
- periods = delta / 1024;
- delta %= 1024;
-
- sa->load_sum = __decay_sum(sa->load_sum, periods + 1);
- if (cfs_rq) {
- cfs_rq->runnable_load_sum =
- __decay_sum(cfs_rq->runnable_load_sum, periods + 1);
- }
- sa->util_sum = __decay_sum((u64)(sa->util_sum), periods + 1);
-
- /* Efficiently calculate \sum (1..n_period) 1024*y^i */
- contrib = __accumulate_sum(periods);
- contrib = cap_scale(contrib, scale_freq);
- if (weight) {
- sa->load_sum += weight * contrib;
- if (cfs_rq)
- cfs_rq->runnable_load_sum += weight * contrib;
- }
- if (running)
- sa->util_sum += contrib * scale_cpu;
- }
-
- /* Remainder of delta accrued against u_0` */
- scaled_delta = cap_scale(delta, scale_freq);
- if (weight) {
- sa->load_sum += weight * scaled_delta;
- if (cfs_rq)
- cfs_rq->runnable_load_sum += weight * scaled_delta;
- }
- if (running)
- sa->util_sum += scaled_delta * scale_cpu;
-
- sa->period_contrib += delta;
+ /*
+ * Now we know we crossed measurement unit boundaries. The *_avg
+ * accrues by two steps:
+ *
+ * Step 1: accumulate *_sum since last_update_tim. If we haven't
+ * crossed period boundaries, finish.
+ */
+ if (!accumulate_sum(delta, sa, cfs_rq, cpu, weight, running))
+ return 0;

- if (decayed) {
- sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
- if (cfs_rq) {
- cfs_rq->runnable_load_avg =
- div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
- }
- sa->util_avg = sa->util_sum / SCHED_AVG_MAX;
+ /*
+ * Step 2: update *_avg.
+ */
+ sa->load_avg = div_u64(sa->load_sum, SCHED_AVG_MAX);
+ if (cfs_rq) {
+ cfs_rq->runnable_load_avg =
+ div_u64(cfs_rq->runnable_load_sum, SCHED_AVG_MAX);
}
+ sa->util_avg = sa->util_sum / SCHED_AVG_MAX;

- return decayed;
+ return 1;
}

#ifdef CONFIG_FAIR_GROUP_SCHED
--
1.7.9.5

2016-04-28 10:47:00

by Yuyang Du

[permalink] [raw]

Subject: [PATCH 3/6] sched/fair: Change the variable to hold the number of periods to 32bit integer

Now a period is about 1ms, so a 32-bit unsigned integer can approximately
hold a maximum of 49 (=2^32/1000/3600/24) days, which means it is big enough
and 64-bit is needless.

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 27 +++++++++++++--------------
1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8d49276..abfe17a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2619,18 +2619,13 @@ static const u32 __accumulated_sum_N32[] = {
* n is the number of periods past; a period is ~1ms
* m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
*/
-static __always_inline u64 __decay_sum(u64 val, u64 n)
+static __always_inline u64 __decay_sum(u64 val, u32 n)
{
- unsigned int local_n;
-
if (!n)
return val;
else if (unlikely(n > SCHED_AVG_HALFLIFE * 63))
return 0;

- /* after bounds checking we can collapse to 32-bit */
- local_n = n;
-
/*
* As y^PERIOD = 1/2, we can combine
* y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
@@ -2638,12 +2633,12 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
*
* To achieve constant time decay_load.
*/
- if (unlikely(local_n >= SCHED_AVG_HALFLIFE)) {
- val >>= local_n / SCHED_AVG_HALFLIFE;
- local_n %= SCHED_AVG_HALFLIFE;
+ if (unlikely(n >= SCHED_AVG_HALFLIFE)) {
+ val >>= n / SCHED_AVG_HALFLIFE;
+ n %= SCHED_AVG_HALFLIFE;
}

- val = mul_u64_u32_shr(val, __decay_inv_multiply_N[local_n], 32);
+ val = mul_u64_u32_shr(val, __decay_inv_multiply_N[n], 32);
return val;
}

@@ -2654,7 +2649,7 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
* We can compute this efficiently by combining:
* y^32 = 1/2 with precomputed \Sum 1024*y^n (where n < 32)
*/
-static u32 __accumulate_sum(u64 n)
+static u32 __accumulate_sum(u32 n)
{
u32 contrib = 0;

@@ -2708,8 +2703,8 @@ static __always_inline int
__update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
unsigned long weight, int running, struct cfs_rq *cfs_rq)
{
- u64 delta, scaled_delta, periods;
- u32 contrib;
+ u64 delta, scaled_delta;
+ u32 contrib, periods;
unsigned int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq, scale_cpu;

@@ -2762,7 +2757,11 @@ __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,

delta -= delta_w;

- /* Figure out how many additional periods this update spans */
+ /*
+ * Figure out how many additional periods this update spans.
+ * A period is 1024*1024ns or ~1ms, so a 32bit integer can hold
+ * approximately a maximum of 49 (=2^32/1000/3600/24) days.
+ */
periods = delta / 1024;
delta %= 1024;

--
1.7.9.5

2016-04-28 10:38:20

by Yuyang Du

[permalink] [raw]

Subject: [PATCH 1/6] sched/fair: Optimize sum computation with a lookup table

__compute_runnable_contrib() uses a loop to compute sum, whereas a
table lookup can do it faster in a constant time.

The program to generate the constants is located at:
Documentation/scheduler/sched-avg.txt

Signed-off-by: Yuyang Du <[email protected]>
Reviewed-by: Morten Rasmussen <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8cc1c3..6e0eec0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2603,6 +2603,15 @@ static const u32 runnable_avg_yN_sum[] = {
};

/*
+ * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
+ * lower integers.
+ */
+static const u32 __accumulated_sum_N32[] = {
+ 0, 23371, 35056, 40899, 43820, 45281,
+ 46011, 46376, 46559, 46650, 46696, 46719,
+};
+
+/*
* Approximate:
* val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
*/
@@ -2650,14 +2659,9 @@ static u32 __compute_runnable_contrib(u64 n)
else if (unlikely(n >= LOAD_AVG_MAX_N))
return LOAD_AVG_MAX;

- /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
- do {
- contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
- contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
-
- n -= LOAD_AVG_PERIOD;
- } while (n > LOAD_AVG_PERIOD);
-
+ /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
+ contrib = __accumulated_sum_N32[n>>5]; /* =n/LOAD_AVG_PERIOD */
+ n %= LOAD_AVG_PERIOD;
contrib = decay_load(contrib, n);
return contrib + runnable_avg_yN_sum[n];
}
--
1.7.9.5

2016-04-28 17:26:57

by Benjamin Segall

[permalink] [raw]

Subject: Re: [PATCH 1/6] sched/fair: Optimize sum computation with a lookup table

Yuyang Du <[email protected]> writes:

> __compute_runnable_contrib() uses a loop to compute sum, whereas a
> table lookup can do it faster in a constant time.
>
> The program to generate the constants is located at:
> Documentation/scheduler/sched-avg.txt
>
> Signed-off-by: Yuyang Du <[email protected]>
> Reviewed-by: Morten Rasmussen <[email protected]>
> Acked-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 20 ++++++++++++--------
> 1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b8cc1c3..6e0eec0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2603,6 +2603,15 @@ static const u32 runnable_avg_yN_sum[] = {
> };
>
> /*
> + * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> + * lower integers.
> + */
> +static const u32 __accumulated_sum_N32[] = {
> + 0, 23371, 35056, 40899, 43820, 45281,
> + 46011, 46376, 46559, 46650, 46696, 46719,
> +};
> +
> +/*
> * Approximate:
> * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
> */
> @@ -2650,14 +2659,9 @@ static u32 __compute_runnable_contrib(u64 n)
> else if (unlikely(n >= LOAD_AVG_MAX_N))
> return LOAD_AVG_MAX;
>
> - /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
> - do {
> - contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
> - contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
> -
> - n -= LOAD_AVG_PERIOD;
> - } while (n > LOAD_AVG_PERIOD);
> -
> + /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
> + contrib = __accumulated_sum_N32[n>>5]; /* =n/LOAD_AVG_PERIOD */

Just write / LOAD_AVG_PERIOD

> + n %= LOAD_AVG_PERIOD;
> contrib = decay_load(contrib, n);
> return contrib + runnable_avg_yN_sum[n];
> }

2016-04-28 17:30:00

by Benjamin Segall

[permalink] [raw]

Subject: Re: [PATCH 3/6] sched/fair: Change the variable to hold the number of periods to 32bit integer

Yuyang Du <[email protected]> writes:

> Now a period is about 1ms, so a 32-bit unsigned integer can approximately
> hold a maximum of 49 (=2^32/1000/3600/24) days, which means it is big enough
> and 64-bit is needless.
>
If a thread sleeps for 49 days and then wakes up this would be wrong...
but it also would just result in it not being decayed to zero, and even
then only if it was in a very small window, so it doesn't seem like a
huge deal if it happens.

> Signed-off-by: Yuyang Du <[email protected]>
> ---
> kernel/sched/fair.c | 27 +++++++++++++--------------
> 1 file changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8d49276..abfe17a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2619,18 +2619,13 @@ static const u32 __accumulated_sum_N32[] = {
> * n is the number of periods past; a period is ~1ms
> * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
> */
> -static __always_inline u64 __decay_sum(u64 val, u64 n)
> +static __always_inline u64 __decay_sum(u64 val, u32 n)
> {
> - unsigned int local_n;
> -
> if (!n)
> return val;
> else if (unlikely(n > SCHED_AVG_HALFLIFE * 63))
> return 0;
>
> - /* after bounds checking we can collapse to 32-bit */
> - local_n = n;
> -
> /*
> * As y^PERIOD = 1/2, we can combine
> * y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
> @@ -2638,12 +2633,12 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
> *
> * To achieve constant time decay_load.
> */
> - if (unlikely(local_n >= SCHED_AVG_HALFLIFE)) {
> - val >>= local_n / SCHED_AVG_HALFLIFE;
> - local_n %= SCHED_AVG_HALFLIFE;
> + if (unlikely(n >= SCHED_AVG_HALFLIFE)) {
> + val >>= n / SCHED_AVG_HALFLIFE;
> + n %= SCHED_AVG_HALFLIFE;
> }
>
> - val = mul_u64_u32_shr(val, __decay_inv_multiply_N[local_n], 32);
> + val = mul_u64_u32_shr(val, __decay_inv_multiply_N[n], 32);
> return val;
> }
>
> @@ -2654,7 +2649,7 @@ static __always_inline u64 __decay_sum(u64 val, u64 n)
> * We can compute this efficiently by combining:
> * y^32 = 1/2 with precomputed \Sum 1024*y^n (where n < 32)
> */
> -static u32 __accumulate_sum(u64 n)
> +static u32 __accumulate_sum(u32 n)
> {
> u32 contrib = 0;
>
> @@ -2708,8 +2703,8 @@ static __always_inline int
> __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
> unsigned long weight, int running, struct cfs_rq *cfs_rq)
> {
> - u64 delta, scaled_delta, periods;
> - u32 contrib;
> + u64 delta, scaled_delta;
> + u32 contrib, periods;
> unsigned int delta_w, scaled_delta_w, decayed = 0;
> unsigned long scale_freq, scale_cpu;
>
> @@ -2762,7 +2757,11 @@ __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
>
> delta -= delta_w;
>
> - /* Figure out how many additional periods this update spans */
> + /*
> + * Figure out how many additional periods this update spans.
> + * A period is 1024*1024ns or ~1ms, so a 32bit integer can hold
> + * approximately a maximum of 49 (=2^32/1000/3600/24) days.
> + */
> periods = delta / 1024;
> delta %= 1024;

2016-04-29 02:15:43

by Yuyang Du

[permalink] [raw]

Subject: Re: [PATCH 3/6] sched/fair: Change the variable to hold the number of periods to 32bit integer

On Thu, Apr 28, 2016 at 10:29:55AM -0700, [email protected] wrote:
> Yuyang Du <[email protected]> writes:
>
> > Now a period is about 1ms, so a 32-bit unsigned integer can approximately
> > hold a maximum of 49 (=2^32/1000/3600/24) days, which means it is big enough
> > and 64-bit is needless.
> >
> If a thread sleeps for 49 days and then wakes up this would be wrong...
> but it also would just result in it not being decayed to zero, and even
> then only if it was in a very small window, so it doesn't seem like a
> huge deal if it happens.

Oh, yeah, and we wouldn't know that task is as sleepy as it realy is. :)