2016-04-11 06:18:37

by Yuyang Du

[permalink] [raw]
Subject: [PATCH 0/4] Optimize sched avg computation and implement flat util hierarchy

This patch series takes an effort to implement a flat util hierarchy, in order to
reflect group task's migration immediately.

Simply doing so may increase overhead in computing sched averages, or lose
accuracy if only updating util of root cfs_rq.

So, to not add overhead or to even reduce it, we must optimize __update_load_avg()
greatly. Hopefully, these patches can achieve it.

---

Yuyang Du (4):
sched/fair: Optimize sum computation with a lookup table
sched/fair: Drop out incomplete current period when sched averages
accrue
sched/fair: Modify accumulated sums for load/util averages
sched/fair: Implement flat hierarchical structure for util_avg

include/linux/sched.h | 2 +-
kernel/sched/debug.c | 11 +-
kernel/sched/fair.c | 313 ++++++++++++++++++++++++--------------------------
kernel/sched/sched.h | 5 +-
4 files changed, 163 insertions(+), 168 deletions(-)

--
2.1.4


2016-04-11 06:18:39

by Yuyang Du

[permalink] [raw]
Subject: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

__compute_runnable_contrib() uses a loop to compute sum, whereas a
table loopup can do it faster in a constant time.

The following python script can be used to generate the constants:

print " #: yN_inv yN_sum"
print "-----------------------"
y = (0.5)**(1/32.0)
x = 2**32
xx = 1024
for i in range(0, 32):
if i == 0:
x = x-1
xx = xx*y
else:
x = x*y
xx = int(xx*y + 1024*y)
print "%2d: %#x %8d" % (i, int(x), int(xx))

print " #: sum_N32"
print "------------"
xxx = xx
for i in range(0, 11):
if i == 0:
xxx = xx
else:
xxx = xxx/2 + xx
print "%2d: %8d" % (i, xxx)

Signed-off-by: Yuyang Du <[email protected]>
Reviewed-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8cc1c3..6e0eec0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2603,6 +2603,15 @@ static const u32 runnable_avg_yN_sum[] = {
};

/*
+ * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
+ * lower integers.
+ */
+static const u32 __accumulated_sum_N32[] = {
+ 0, 23371, 35056, 40899, 43820, 45281,
+ 46011, 46376, 46559, 46650, 46696, 46719,
+};
+
+/*
* Approximate:
* val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
*/
@@ -2650,14 +2659,9 @@ static u32 __compute_runnable_contrib(u64 n)
else if (unlikely(n >= LOAD_AVG_MAX_N))
return LOAD_AVG_MAX;

- /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
- do {
- contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
- contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
-
- n -= LOAD_AVG_PERIOD;
- } while (n > LOAD_AVG_PERIOD);
-
+ /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
+ contrib = __accumulated_sum_N32[n>>5]; /* =n/LOAD_AVG_PERIOD */
+ n %= LOAD_AVG_PERIOD;
contrib = decay_load(contrib, n);
return contrib + runnable_avg_yN_sum[n];
}
--
2.1.4

2016-04-11 06:18:43

by Yuyang Du

[permalink] [raw]
Subject: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

In __update_load_avg(), the current period is never complete. This
basically leads to a slightly over-decayed average, say on average we
have 50% current period, then we will lose 1.08%(=(1-0.5^(1/64)) of
past avg. More importantly, the incomplete current period significantly
complicates the avg computation, even a full period is only about 1ms.

So we attempt to drop it. The outcome is that for any x.y periods to
update, we will either lose the .y period or unduely gain (1-.y) period.
How big is the impact? For a large x (say 20ms), you barely notice the
difference, which is plus/minus 1% (=(before-after)/before). Moreover,
the aggregated losses and gains in the long run should statistically
even out.

Signed-off-by: Yuyang Du <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched/fair.c | 170 +++++++++++++++++++-------------------------------
2 files changed, 66 insertions(+), 106 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45e848c..c5948e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1214,7 +1214,7 @@ struct load_weight {
*/
struct sched_avg {
u64 last_update_time, load_sum;
- u32 util_sum, period_contrib;
+ u32 util_sum;
unsigned long load_avg, util_avg;
};

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e0eec0..68273e8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -661,7 +661,7 @@ static unsigned long task_h_load(struct task_struct *p);

/*
* We choose a half-life close to 1 scheduling period.
- * Note: The tables runnable_avg_yN_inv and runnable_avg_yN_sum are
+ * Note: The tables __decay_inv_multiply_N and __accumulated_sum_N are
* dependent on this value.
*/
#define LOAD_AVG_PERIOD 32
@@ -674,12 +674,6 @@ void init_entity_runnable_average(struct sched_entity *se)
struct sched_avg *sa = &se->avg;

sa->last_update_time = 0;
- /*
- * sched_avg's period_contrib should be strictly less then 1024, so
- * we give it 1023 to make sure it is almost a period (1024us), and
- * will definitely be update (after enqueue).
- */
- sa->period_contrib = 1023;
sa->load_avg = scale_load_down(se->load.weight);
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
/*
@@ -2582,8 +2576,8 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_SMP
-/* Precomputed fixed inverse multiplies for multiplication by y^n */
-static const u32 runnable_avg_yN_inv[] = {
+/* Precomputed fixed inverse multipliers for multiplication by y^n */
+static const u32 __decay_inv_multiply_N[] = {
0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581,
@@ -2593,10 +2587,10 @@ static const u32 runnable_avg_yN_inv[] = {
};

/*
- * Precomputed \Sum y^k { 1<=k<=n }. These are floor(true_value) to prevent
+ * Precomputed \Sum y^k { 1<=k<=n }. These are floor(true_value) to prevent
* over-estimates when re-combining.
*/
-static const u32 runnable_avg_yN_sum[] = {
+static const u32 __accumulated_sum_N[] = {
0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
@@ -2612,58 +2606,54 @@ static const u32 __accumulated_sum_N32[] = {
};

/*
- * Approximate:
- * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
+ * val * y^n, where y^32 ~= 0.5 and n's unit is ms (slightly bigger than ms),
+ * so 32-bit n is approximately a maximum of 49 (2^32/1000/3600/24) days
*/
-static __always_inline u64 decay_load(u64 val, u64 n)
+static __always_inline u64 __decay_sum(u64 val, u32 n)
{
- unsigned int local_n;
-
if (!n)
return val;
else if (unlikely(n > LOAD_AVG_PERIOD * 63))
return 0;

- /* after bounds checking we can collapse to 32-bit */
- local_n = n;
-
/*
- * As y^PERIOD = 1/2, we can combine
- * y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
- * With a look-up table which covers y^n (n<PERIOD)
- *
- * To achieve constant time decay_load.
+ * As y^32 = 1/2, we can accelerate the computation as:
+ * y^n = 1/2^(n/32) * y^(n%32) with a look-up table which
+ * covers y^x (x<32). This achieves a constant time __decay_sum.
*/
- if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
- val >>= local_n / LOAD_AVG_PERIOD;
- local_n %= LOAD_AVG_PERIOD;
+ if (unlikely(n >= LOAD_AVG_PERIOD)) {
+ val >>= n / LOAD_AVG_PERIOD;
+ n %= LOAD_AVG_PERIOD;
}

- val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+ val = mul_u64_u32_shr(val, __decay_inv_multiply_N[n], 32);
return val;
}

/*
- * For updates fully spanning n periods, the contribution to runnable
- * average will be: \Sum 1024*y^n
+ * For updates fully spanning n periods, the accumulated contribution
+ * will be: \Sum 1024*y^n, in the meantime the contribution should
+ * be decayed.
+ *
+ * We can compute this efficiently by combining:
+ * y^32 = 1/2 with precomputed \Sum 1024*y^n {for n < 32}
*
- * We can compute this reasonably efficiently by combining:
- * y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for n <PERIOD}
+ * Here, n is also large enough to hold the number of past periods
*/
-static u32 __compute_runnable_contrib(u64 n)
+static u32 __accumulate_sum(u32 n)
{
u32 contrib = 0;

if (likely(n <= LOAD_AVG_PERIOD))
- return runnable_avg_yN_sum[n];
+ return __accumulated_sum_N[n];
else if (unlikely(n >= LOAD_AVG_MAX_N))
return LOAD_AVG_MAX;

/* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
contrib = __accumulated_sum_N32[n>>5]; /* =n/LOAD_AVG_PERIOD */
n %= LOAD_AVG_PERIOD;
- contrib = decay_load(contrib, n);
- return contrib + runnable_avg_yN_sum[n];
+ contrib = __decay_sum(contrib, n);
+ return contrib + __accumulated_sum_N[n];
}

#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
@@ -2704,11 +2694,14 @@ static __always_inline int
__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
unsigned long weight, int running, struct cfs_rq *cfs_rq)
{
- u64 delta, scaled_delta, periods;
- u32 contrib;
- unsigned int delta_w, scaled_delta_w, decayed = 0;
+ u64 delta;
+ u32 contrib, periods;
unsigned long scale_freq, scale_cpu;

+ /*
+ * now rolls down to a period boundary
+ */
+ now = now && (u64)(~0xFFFFF);
delta = now - sa->last_update_time;
/*
* This should only happen when time goes backwards, which it
@@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
}

/*
- * Use 1024ns as the unit of measurement since it's a reasonable
- * approximation of 1us and fast to compute.
+ * Use 1024*1024ns as an approximation of 1ms period, pretty close.
*/
- delta >>= 10;
- if (!delta)
+ periods = delta >> 20;
+ if (!periods)
return 0;
sa->last_update_time = now;

scale_freq = arch_scale_freq_capacity(NULL, cpu);
scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

- /* delta_w is the amount already accumulated against our next period */
- delta_w = sa->period_contrib;
- if (delta + delta_w >= 1024) {
- decayed = 1;
-
- /* how much left for next period will start over, we don't know yet */
- sa->period_contrib = 0;
-
- /*
- * Now that we know we're crossing a period boundary, figure
- * out how much from delta we need to complete the current
- * period and accrue it.
- */
- delta_w = 1024 - delta_w;
- scaled_delta_w = cap_scale(delta_w, scale_freq);
- if (weight) {
- sa->load_sum += weight * scaled_delta_w;
- if (cfs_rq) {
- cfs_rq->runnable_load_sum +=
- weight * scaled_delta_w;
- }
- }
- if (running)
- sa->util_sum += scaled_delta_w * scale_cpu;
-
- delta -= delta_w;
-
- /* Figure out how many additional periods this update spans */
- periods = delta / 1024;
- delta %= 1024;
+ /*
+ * Now we know we're crossing period boundaries, and the *_sum accrues by
+ * two steps.
+ */

- sa->load_sum = decay_load(sa->load_sum, periods + 1);
- if (cfs_rq) {
- cfs_rq->runnable_load_sum =
- decay_load(cfs_rq->runnable_load_sum, periods + 1);
- }
- sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);
-
- /* Efficiently calculate \sum (1..n_period) 1024*y^i */
- contrib = __compute_runnable_contrib(periods);
- contrib = cap_scale(contrib, scale_freq);
- if (weight) {
- sa->load_sum += weight * contrib;
- if (cfs_rq)
- cfs_rq->runnable_load_sum += weight * contrib;
- }
- if (running)
- sa->util_sum += contrib * scale_cpu;
+ /*
+ * Step 1: decay old *_sum
+ */
+ sa->load_sum = __decay_sum(sa->load_sum, periods);
+ if (cfs_rq) {
+ cfs_rq->runnable_load_sum =
+ __decay_sum(cfs_rq->runnable_load_sum, periods);
}
+ sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);

- /* Remainder of delta accrued against u_0` */
- scaled_delta = cap_scale(delta, scale_freq);
+ /*
+ * Step 2: accumulate and meanwhile decay new *_sum by periods since
+ * last_update_time
+ */
+ contrib = __accumulate_sum(periods);
+ contrib = cap_scale(contrib, scale_freq);
if (weight) {
- sa->load_sum += weight * scaled_delta;
+ sa->load_sum += weight * contrib;
if (cfs_rq)
- cfs_rq->runnable_load_sum += weight * scaled_delta;
+ cfs_rq->runnable_load_sum += weight * contrib;
}
if (running)
- sa->util_sum += scaled_delta * scale_cpu;
+ sa->util_sum += contrib * scale_cpu;

- sa->period_contrib += delta;
-
- if (decayed) {
- sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
- if (cfs_rq) {
- cfs_rq->runnable_load_avg =
- div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
- }
- sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
+ /*
+ * *_sum must have been evolved, we update *_avg
+ */
+ sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+ if (cfs_rq) {
+ cfs_rq->runnable_load_avg =
+ div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
}
+ sa->util_avg = sa->util_sum / LOAD_AVG_MAX;

- return decayed;
+ return 1;
}

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.1.4

2016-04-11 06:18:48

by Yuyang Du

[permalink] [raw]
Subject: [PATCH 4/4] sched/fair: Implement flat hierarchical structure for util_avg

The utilization part of the sched averages follows task group's
hierarchy, but the utils of all group entities and all cfs_rqs except
the top cfs_rq are needless to update and are never used. And more
importantly, the top cfs_rq's util will not reflect migration of a
group task.

So this patch propose a flat task hierarchy for util - all tasks
affiliated to a rq are flat, ignoring their task group levels, and
the rq's util is the sum of all the tasks.

Overhead wise, the rg's util update may be more costly, but we don't
update any group entity's util, so the net overhead should not be
formidable. In addition, rq's util updates can be very frequent,
but the updates in a period will be skipped, this is mostly effective
in update_blocked_averages().

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/debug.c | 11 ++---
kernel/sched/fair.c | 123 +++++++++++++++++++++++++++++++--------------------
kernel/sched/sched.h | 5 ++-
3 files changed, 85 insertions(+), 54 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4fbc3bd..ba08d1c 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -395,7 +395,6 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->load.weight);
#ifdef CONFIG_SMP
P(se->avg.load_avg);
- P(se->avg.util_avg);
#endif
#undef PN
#undef P
@@ -469,6 +468,12 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
print_task(m, rq, p);
}
rcu_read_unlock();
+
+ SEQ_printf(m, "\nutilization: \n");
+ SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
+ rq->avg.util_avg);
+ SEQ_printf(m, " .%-30s: %ld\n", "removed_util_avg",
+ atomic_long_read(&rq->removed_util_avg));
}

void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
@@ -517,12 +522,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "runnable_load_avg",
cfs_rq->runnable_load_avg);
- SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
- cfs_rq->avg.util_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed_load_avg",
atomic_long_read(&cfs_rq->removed_load_avg));
- SEQ_printf(m, " .%-30s: %ld\n", "removed_util_avg",
- atomic_long_read(&cfs_rq->removed_util_avg));
#ifdef CONFIG_FAIR_GROUP_SCHED
SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib",
cfs_rq->tg_load_avg_contrib);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49e9f1a..67c2730 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -686,45 +686,27 @@ void init_entity_runnable_average(struct sched_entity *se)

/*
* With new tasks being created, their initial util_avgs are extrapolated
- * based on the cfs_rq's current util_avg:
+ * based on the rq's current util_avg. To make the util_avg converge, we
+ * cap the util_avg of successive tasks to only 1/2 of the left utilization
+ * budget:
*
- * util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
- *
- * However, in many cases, the above util_avg does not give a desired
- * value. Moreover, the sum of the util_avgs may be divergent, such
- * as when the series is a harmonic series.
- *
- * To solve this problem, we also cap the util_avg of successive tasks to
- * only 1/2 of the left utilization budget:
- *
- * util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
+ * util_avg = (1024 - rq->avg.util_avg) / 2^n
*
* where n denotes the nth task.
*
* For example, a simplest series from the beginning would be like:
*
- * task util_avg: 512, 256, 128, 64, 32, 16, 8, ...
- * cfs_rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...
- *
- * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)
- * if util_avg > util_avg_cap.
+ * task util_avg: 512, 256, 128, 64, 32, 16, 8, ...
+ * rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...
*/
void post_init_entity_util_avg(struct sched_entity *se)
{
- struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ struct rq *rq = rq_of(cfs_rq_of(se));
struct sched_avg *sa = &se->avg;
- long cap = (long)(scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;
+ long cap = (long)(scale_load_down(SCHED_LOAD_SCALE) - rq->avg.util_avg) / 2;

if (cap > 0) {
- if (cfs_rq->avg.util_avg != 0) {
- sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
- sa->util_avg /= (cfs_rq->avg.load_avg + 1);
-
- if (sa->util_avg > cap)
- sa->util_avg = cap;
- } else {
- sa->util_avg = cap;
- }
+ sa->util_avg = cap;
sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
}
}
@@ -2697,6 +2679,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
u64 delta;
u32 contrib, periods;
unsigned long scale_freq, scale_cpu;
+ int update_util = 0;

/*
* now rolls down to a period boundary
@@ -2720,15 +2703,19 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
return 0;
sa->last_update_time = now;

+ /*
+ * We update util_sum together with load_sum only if it is a task
+ */
+ if (!cfs_rq && entity_is_task(container_of(sa, struct sched_entity, avg)))
+ update_util = 1;
+
scale_freq = arch_scale_freq_capacity(NULL, cpu);
scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

/*
* Now we know we're crossing period boundaries, and the *_sum accrues by
* two steps.
- */
-
- /*
+ *
* Step 1: decay old *_sum
*/
sa->load_sum = __decay_sum(sa->load_sum, periods);
@@ -2736,7 +2723,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
cfs_rq->runnable_load_sum =
__decay_sum(cfs_rq->runnable_load_sum, periods);
}
- sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
+ if (update_util)
+ sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);

/*
* Step 2: accumulate and meanwhile decay new *_sum by periods since
@@ -2749,7 +2737,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
if (cfs_rq)
cfs_rq->runnable_load_sum += weight * contrib;
}
- if (running)
+ if (update_util && running)
sa->util_sum += contrib * scale_cpu;

/*
@@ -2760,6 +2748,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
cfs_rq->runnable_load_avg =
div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
}
+ if (update_util)
+ sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
+
+ if (!cfs_rq)
+ return 1;
+
+ /*
+ * Update rq's util_sum and util_avg
+ */
+ sa = &rq_of(cfs_rq)->avg;
+
+ /*
+ * Of course, we have new delta and periods.
+ */
+ delta = now - sa->last_update_time;
+ periods = delta >> 20;
+
+ /*
+ * The rq's util should be updated much more frequently
+ * than any cfs_rq. If we are here, child cfs_rq should have
+ * already updated load_avg, so we return 1 anyway.
+ */
+ if (!periods)
+ return 1;
+ sa->last_update_time = now;
+
+ /* Step 1 */
+ sa->load_sum = __decay_sum(sa->load_sum, periods);
+
+ /* Step 2 */
+ contrib = __accumulate_sum(periods);
+ contrib = cap_scale(contrib, scale_freq);
+ if (cfs_rq->curr != NULL) /* new running */
+ sa->util_sum += contrib * scale_cpu;
+
sa->util_avg = sa->util_sum / LOAD_AVG_MAX;

return 1;
@@ -2843,6 +2866,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
{
struct sched_avg *sa = &cfs_rq->avg;
int decayed, removed = 0;
+ struct rq *rq = rq_of(cfs_rq);

if (atomic_long_read(&cfs_rq->removed_load_avg)) {
s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
@@ -2851,10 +2875,10 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
removed = 1;
}

- if (atomic_long_read(&cfs_rq->removed_util_avg)) {
- long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
- sa->util_avg = max_t(long, sa->util_avg - r, 0);
- sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+ if (atomic_long_read(&rq->removed_util_avg)) {
+ long r = atomic_long_xchg(&rq->removed_util_avg, 0);
+ rq->avg.util_avg = max_t(long, rq->avg.util_avg - r, 0);
+ rq->avg.util_sum = max_t(s32, rq->avg.util_sum - r * LOAD_AVG_MAX, 0);
}

decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
@@ -2907,12 +2931,14 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
* See cpu_util().
*/
cpufreq_update_util(rq_clock(rq),
- min(cfs_rq->avg.util_avg, max), max);
+ min(rq->avg.util_avg, max), max);
}
}

static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ struct rq *rq = rq_of(cfs_rq);
+
if (!sched_feat(ATTACH_AGE_LOAD))
goto skip_aging;

@@ -2934,20 +2960,22 @@ skip_aging:
se->avg.last_update_time = cfs_rq->avg.last_update_time;
cfs_rq->avg.load_avg += se->avg.load_avg;
cfs_rq->avg.load_sum += se->avg.load_sum;
- cfs_rq->avg.util_avg += se->avg.util_avg;
- cfs_rq->avg.util_sum += se->avg.util_sum;
+ rq->avg.util_avg += se->avg.util_avg;
+ rq->avg.util_sum += se->avg.util_sum;
}

static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ struct rq *rq = rq_of(cfs_rq);
+
__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
&se->avg, se->on_rq * scale_load_down(se->load.weight),
cfs_rq->curr == se, NULL);

cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
cfs_rq->avg.load_sum = max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0);
- cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
- cfs_rq->avg.util_sum = max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0);
+ rq->avg.util_avg = max_t(long, rq->avg.util_avg - se->avg.util_avg, 0);
+ rq->avg.util_sum = max_t(s32, rq->avg.util_sum - se->avg.util_sum, 0);
}

/* Add the load generated by se into cfs_rq's load average */
@@ -3017,6 +3045,7 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
void remove_entity_load_avg(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ struct rq *rq = rq_of(cfs_rq);
u64 last_update_time;

/*
@@ -3030,7 +3059,7 @@ void remove_entity_load_avg(struct sched_entity *se)

__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
- atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
+ atomic_long_add(se->avg.util_avg, &rq->removed_util_avg);
}

static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
@@ -5145,7 +5174,7 @@ done:
* compare the utilization with the capacity of the CPU that is available for
* CFS task (ie cpu_capacity).
*
- * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
+ * rq->avg.util_avg is the sum of running time of runnable tasks plus the
* recent utilization of currently non-runnable tasks on a CPU. It represents
* the amount of utilization of a CPU in the range [0..capacity_orig] where
* capacity_orig is the cpu_capacity available at the highest frequency
@@ -5154,9 +5183,9 @@ done:
* current capacity (capacity_curr <= capacity_orig) of the CPU because it is
* the running time on this CPU scaled by capacity_curr.
*
- * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
+ * Nevertheless, rq->avg.util_avg can be higher than capacity_curr or even
* higher than capacity_orig because of unfortunate rounding in
- * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
+ * rq->avg.util_avg or just after migrating tasks and new task wakeups until
* the average stabilizes with the new running time. We need to check that the
* utilization stays within the range of [0..capacity_orig] and cap it if
* necessary. Without utilization capping, a group could be seen as overloaded
@@ -5167,7 +5196,7 @@ done:
*/
static int cpu_util(int cpu)
{
- unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
+ unsigned long util = cpu_rq(cpu)->avg.util_avg;
unsigned long capacity = capacity_orig_of(cpu);

return (util >= capacity) ? capacity : util;
@@ -8334,7 +8363,6 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#endif
#ifdef CONFIG_SMP
atomic_long_set(&cfs_rq->removed_load_avg, 0);
- atomic_long_set(&cfs_rq->removed_util_avg, 0);
#endif
}

@@ -8399,7 +8427,6 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
init_cfs_rq(cfs_rq);
init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
init_entity_runnable_average(se);
- post_init_entity_util_avg(se);
}

return 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7cbad7..c64f8b8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -390,7 +390,7 @@ struct cfs_rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
unsigned long tg_load_avg_contrib;
#endif
- atomic_long_t removed_load_avg, removed_util_avg;
+ atomic_long_t removed_load_avg;
#ifndef CONFIG_64BIT
u64 load_last_update_time_copy;
#endif
@@ -652,6 +652,9 @@ struct rq {

/* This is used to determine avg_idle's max value */
u64 max_idle_balance_cost;
+
+ struct sched_avg avg;
+ atomic_long_t removed_util_avg;
#endif

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
--
2.1.4

2016-04-11 06:19:00

by Yuyang Du

[permalink] [raw]
Subject: [PATCH 3/4] sched/fair: Modify accumulated sums for load/util averages

After we dropped the incomplete period, the current period should be
a complete "past" period, since all period boundaries, in the past
or in the future, are predetermined.

With incomplete current period:

| | | | |
-------------------*
^ ^
| |
current

With this patch:

| | | | |
-------------------*
^ ^
| |
current

So, the precomputed sums in __accumulated_sum_N[] and
__accumulated_sum_N32[] should be updated accordingly.

Update the script to generate the constants:

print " #: inv_N sum_N"
print "-----------------------"
y = (0.5)**(1/32.0)
x = 2**32
xx = 1024
for i in range(0, 32):
if i == 0:
x = x-1
else:
x = x*y
xx = int(xx*y + 1024)
print "%2d: %#x %8d" % (i, int(x), int(xx))

print " #: sum_N32"
print "------------"
xxx = xx
for i in range(0, 11):
if i == 0:
xxx = xx
else:
xxx = xxx/2 + xx
print "%2d: %8d" % (i, xxx)

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68273e8..49e9f1a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -666,7 +666,7 @@ static unsigned long task_h_load(struct task_struct *p);
*/
#define LOAD_AVG_PERIOD 32
#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
-#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */
+#define LOAD_AVG_MAX_N 347 /* number of full periods to produce LOAD_AVG_MAX */

/* Give new sched_entity start runnable values to heavy its load in infant time */
void init_entity_runnable_average(struct sched_entity *se)
@@ -2591,9 +2591,9 @@ static const u32 __decay_inv_multiply_N[] = {
* over-estimates when re-combining.
*/
static const u32 __accumulated_sum_N[] = {
- 0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
- 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
- 17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
+ 0, 1024, 2026, 3006, 3965, 4904, 5822, 6721, 7600, 8461, 9303,
+ 10127,10933,11722,12494,13250,13990,14714,15422,16115,16793,17457,
+ 18106,18742,19364,19973,20569,21152,21722,22280,22826,23360,23883,
};

/*
@@ -2601,8 +2601,8 @@ static const u32 __accumulated_sum_N[] = {
* lower integers.
*/
static const u32 __accumulated_sum_N32[] = {
- 0, 23371, 35056, 40899, 43820, 45281,
- 46011, 46376, 46559, 46650, 46696, 46719,
+ 0, 23883, 35824, 41795, 44780, 46273,
+ 47019, 47392, 47579, 47672, 47719, 47742,
};

/*
--
2.1.4

2016-04-11 09:08:28

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

Hi Yuyang

On 11 April 2016 at 00:36, Yuyang Du <[email protected]> wrote:
> In __update_load_avg(), the current period is never complete. This
> basically leads to a slightly over-decayed average, say on average we

yes, that also explains why we almost never reach the max value

> have 50% current period, then we will lose 1.08%(=(1-0.5^(1/64)) of
> past avg. More importantly, the incomplete current period significantly
> complicates the avg computation, even a full period is only about 1ms.
>
> So we attempt to drop it. The outcome is that for any x.y periods to
> update, we will either lose the .y period or unduely gain (1-.y) period.
> How big is the impact? For a large x (say 20ms), you barely notice the
> difference, which is plus/minus 1% (=(before-after)/before). Moreover,
> the aggregated losses and gains in the long run should statistically
> even out.
>
> Signed-off-by: Yuyang Du <[email protected]>
> ---
> include/linux/sched.h | 2 +-
> kernel/sched/fair.c | 170 +++++++++++++++++++-------------------------------
> 2 files changed, 66 insertions(+), 106 deletions(-)
>

[snip]

>
> #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> @@ -2704,11 +2694,14 @@ static __always_inline int
> __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> unsigned long weight, int running, struct cfs_rq *cfs_rq)
> {
> - u64 delta, scaled_delta, periods;
> - u32 contrib;
> - unsigned int delta_w, scaled_delta_w, decayed = 0;
> + u64 delta;
> + u32 contrib, periods;
> unsigned long scale_freq, scale_cpu;
>
> + /*
> + * now rolls down to a period boundary
> + */
> + now = now && (u64)(~0xFFFFF);
> delta = now - sa->last_update_time;
> /*
> * This should only happen when time goes backwards, which it
> @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> }
>
> /*
> - * Use 1024ns as the unit of measurement since it's a reasonable
> - * approximation of 1us and fast to compute.
> + * Use 1024*1024ns as an approximation of 1ms period, pretty close.
> */
> - delta >>= 10;
> - if (!delta)
> + periods = delta >> 20;
> + if (!periods)
> return 0;
> sa->last_update_time = now;

The optimization looks quite interesting but I see one potential issue
with migration as we will lose the part of the ongoing period that is
now not saved anymore. This lost part can be quite significant for a
short task that ping pongs between CPUs.

>
> scale_freq = arch_scale_freq_capacity(NULL, cpu);
> scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
>
> - /* delta_w is the amount already accumulated against our next period */
> - delta_w = sa->period_contrib;
> - if (delta + delta_w >= 1024) {
> - decayed = 1;
> -
> - /* how much left for next period will start over, we don't know yet */
> - sa->period_contrib = 0;
> -
> - /*
> - * Now that we know we're crossing a period boundary, figure
> - * out how much from delta we need to complete the current
> - * period and accrue it.
> - */
> - delta_w = 1024 - delta_w;
> - scaled_delta_w = cap_scale(delta_w, scale_freq);
> - if (weight) {
> - sa->load_sum += weight * scaled_delta_w;
> - if (cfs_rq) {
> - cfs_rq->runnable_load_sum +=
> - weight * scaled_delta_w;
> - }
> - }
> - if (running)
> - sa->util_sum += scaled_delta_w * scale_cpu;
> -
> - delta -= delta_w;
> -
> - /* Figure out how many additional periods this update spans */
> - periods = delta / 1024;
> - delta %= 1024;
> + /*
> + * Now we know we're crossing period boundaries, and the *_sum accrues by
> + * two steps.
> + */
>
> - sa->load_sum = decay_load(sa->load_sum, periods + 1);
> - if (cfs_rq) {
> - cfs_rq->runnable_load_sum =
> - decay_load(cfs_rq->runnable_load_sum, periods + 1);
> - }
> - sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);
> -
> - /* Efficiently calculate \sum (1..n_period) 1024*y^i */
> - contrib = __compute_runnable_contrib(periods);
> - contrib = cap_scale(contrib, scale_freq);
> - if (weight) {
> - sa->load_sum += weight * contrib;
> - if (cfs_rq)
> - cfs_rq->runnable_load_sum += weight * contrib;
> - }
> - if (running)
> - sa->util_sum += contrib * scale_cpu;
> + /*
> + * Step 1: decay old *_sum
> + */
> + sa->load_sum = __decay_sum(sa->load_sum, periods);
> + if (cfs_rq) {
> + cfs_rq->runnable_load_sum =
> + __decay_sum(cfs_rq->runnable_load_sum, periods);
> }
> + sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
>
> - /* Remainder of delta accrued against u_0` */
> - scaled_delta = cap_scale(delta, scale_freq);
> + /*
> + * Step 2: accumulate and meanwhile decay new *_sum by periods since
> + * last_update_time
> + */
> + contrib = __accumulate_sum(periods);
> + contrib = cap_scale(contrib, scale_freq);
> if (weight) {
> - sa->load_sum += weight * scaled_delta;
> + sa->load_sum += weight * contrib;
> if (cfs_rq)
> - cfs_rq->runnable_load_sum += weight * scaled_delta;
> + cfs_rq->runnable_load_sum += weight * contrib;
> }
> if (running)
> - sa->util_sum += scaled_delta * scale_cpu;
> + sa->util_sum += contrib * scale_cpu;
>
> - sa->period_contrib += delta;
> -
> - if (decayed) {
> - sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
> - if (cfs_rq) {
> - cfs_rq->runnable_load_avg =
> - div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
> - }
> - sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
> + /*
> + * *_sum must have been evolved, we update *_avg
> + */
> + sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
> + if (cfs_rq) {
> + cfs_rq->runnable_load_avg =
> + div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
> }
> + sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
>
> - return decayed;
> + return 1;
> }
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> --
> 2.1.4
>

2016-04-11 09:08:39

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

On 11 April 2016 at 00:36, Yuyang Du <[email protected]> wrote:
> __compute_runnable_contrib() uses a loop to compute sum, whereas a
> table loopup can do it faster in a constant time.
>
> The following python script can be used to generate the constants:
>
> print " #: yN_inv yN_sum"
> print "-----------------------"
> y = (0.5)**(1/32.0)
> x = 2**32
> xx = 1024
> for i in range(0, 32):
> if i == 0:
> x = x-1
> xx = xx*y
> else:
> x = x*y
> xx = int(xx*y + 1024*y)
> print "%2d: %#x %8d" % (i, int(x), int(xx))
>
> print " #: sum_N32"
> print "------------"
> xxx = xx
> for i in range(0, 11):
> if i == 0:
> xxx = xx
> else:
> xxx = xxx/2 + xx
> print "%2d: %8d" % (i, xxx)
>
> Signed-off-by: Yuyang Du <[email protected]>
> Reviewed-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 20 ++++++++++++--------
> 1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b8cc1c3..6e0eec0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2603,6 +2603,15 @@ static const u32 runnable_avg_yN_sum[] = {
> };
>
> /*
> + * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> + * lower integers.
> + */
> +static const u32 __accumulated_sum_N32[] = {
> + 0, 23371, 35056, 40899, 43820, 45281,
> + 46011, 46376, 46559, 46650, 46696, 46719,
> +};
> +
> +/*
> * Approximate:
> * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
> */
> @@ -2650,14 +2659,9 @@ static u32 __compute_runnable_contrib(u64 n)
> else if (unlikely(n >= LOAD_AVG_MAX_N))
> return LOAD_AVG_MAX;
>
> - /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
> - do {
> - contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
> - contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
> -
> - n -= LOAD_AVG_PERIOD;
> - } while (n > LOAD_AVG_PERIOD);
> -
> + /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
> + contrib = __accumulated_sum_N32[n>>5]; /* =n/LOAD_AVG_PERIOD */
> + n %= LOAD_AVG_PERIOD;
> contrib = decay_load(contrib, n);
> return contrib + runnable_avg_yN_sum[n];
> }

FWIW, you can add my acked

> --
> 2.1.4
>

2016-04-11 10:39:54

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

Hi,

On 11/04/16 06:36, Yuyang Du wrote:
> __compute_runnable_contrib() uses a loop to compute sum, whereas a
> table loopup can do it faster in a constant time.
>
> The following python script can be used to generate the constants:
>
> print " #: yN_inv yN_sum"
> print "-----------------------"
> y = (0.5)**(1/32.0)
> x = 2**32
> xx = 1024
> for i in range(0, 32):
> if i == 0:
> x = x-1
> xx = xx*y
> else:
> x = x*y
> xx = int(xx*y + 1024*y)
> print "%2d: %#x %8d" % (i, int(x), int(xx))
>
> print " #: sum_N32"
> print "------------"
> xxx = xx
> for i in range(0, 11):
> if i == 0:
> xxx = xx
> else:
> xxx = xxx/2 + xx
> print "%2d: %8d" % (i, xxx)
>

Thanks for the script, really useful. Do you think there is value in
making it general? Like if we want to play with/need changing LOAD_AVG_
PERIOD in the future to something different than 32.

Also, does the following assume LOAD_AVG_PERIOD == 32? And if yes, do
you think there is any value in removing that assumption?

Best,

- Juri

> Signed-off-by: Yuyang Du <[email protected]>
> Reviewed-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 20 ++++++++++++--------
> 1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b8cc1c3..6e0eec0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2603,6 +2603,15 @@ static const u32 runnable_avg_yN_sum[] = {
> };
>
> /*
> + * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> + * lower integers.
> + */
> +static const u32 __accumulated_sum_N32[] = {
> + 0, 23371, 35056, 40899, 43820, 45281,
> + 46011, 46376, 46559, 46650, 46696, 46719,
> +};
> +
> +/*
> * Approximate:
> * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
> */
> @@ -2650,14 +2659,9 @@ static u32 __compute_runnable_contrib(u64 n)
> else if (unlikely(n >= LOAD_AVG_MAX_N))
> return LOAD_AVG_MAX;
>
> - /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
> - do {
> - contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
> - contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
> -
> - n -= LOAD_AVG_PERIOD;
> - } while (n > LOAD_AVG_PERIOD);
> -
> + /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
> + contrib = __accumulated_sum_N32[n>>5]; /* =n/LOAD_AVG_PERIOD */
> + n %= LOAD_AVG_PERIOD;
> contrib = decay_load(contrib, n);
> return contrib + runnable_avg_yN_sum[n];
> }
> --
> 2.1.4
>

2016-04-11 12:29:48

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: Implement flat hierarchical structure for util_avg

Hi Yuyang,

On 11 April 2016 at 00:36, Yuyang Du <[email protected]> wrote:
> The utilization part of the sched averages follows task group's
> hierarchy, but the utils of all group entities and all cfs_rqs except
> the top cfs_rq are needless to update and are never used. And more
> importantly, the top cfs_rq's util will not reflect migration of a
> group task.
>
> So this patch propose a flat task hierarchy for util - all tasks
> affiliated to a rq are flat, ignoring their task group levels, and
> the rq's util is the sum of all the tasks.

In fact, it's only about cfs tasks and not all tasks of rq so your
explanation is misleading and the creation of a new sched_avg struct
at rq level too

>
> Overhead wise, the rg's util update may be more costly, but we don't

s/rg/rq/

> update any group entity's util, so the net overhead should not be
> formidable. In addition, rq's util updates can be very frequent,
> but the updates in a period will be skipped, this is mostly effective
> in update_blocked_averages().

Not sure to catch your point here but the rq->util of an idle CPU will
never be updated before a (idle) load balance as we call
update_cfs_rq_load_avg which doesn't update the cfs_rq->avg.util_avg
anymore nor rq->avg.util_avg.

>
> Signed-off-by: Yuyang Du <[email protected]>
> ---
> kernel/sched/debug.c | 11 ++---
> kernel/sched/fair.c | 123 +++++++++++++++++++++++++++++++--------------------
> kernel/sched/sched.h | 5 ++-
> 3 files changed, 85 insertions(+), 54 deletions(-)
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 4fbc3bd..ba08d1c 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -395,7 +395,6 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
> P(se->load.weight);
> #ifdef CONFIG_SMP
> P(se->avg.load_avg);
> - P(se->avg.util_avg);
> #endif
> #undef PN
> #undef P
> @@ -469,6 +468,12 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
> print_task(m, rq, p);
> }
> rcu_read_unlock();
> +
> + SEQ_printf(m, "\nutilization: \n");
> + SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
> + rq->avg.util_avg);
> + SEQ_printf(m, " .%-30s: %ld\n", "removed_util_avg",
> + atomic_long_read(&rq->removed_util_avg));
> }
>
> void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> @@ -517,12 +522,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> cfs_rq->avg.load_avg);
> SEQ_printf(m, " .%-30s: %lu\n", "runnable_load_avg",
> cfs_rq->runnable_load_avg);
> - SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
> - cfs_rq->avg.util_avg);
> SEQ_printf(m, " .%-30s: %ld\n", "removed_load_avg",
> atomic_long_read(&cfs_rq->removed_load_avg));
> - SEQ_printf(m, " .%-30s: %ld\n", "removed_util_avg",
> - atomic_long_read(&cfs_rq->removed_util_avg));
> #ifdef CONFIG_FAIR_GROUP_SCHED
> SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib",
> cfs_rq->tg_load_avg_contrib);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 49e9f1a..67c2730 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -686,45 +686,27 @@ void init_entity_runnable_average(struct sched_entity *se)
>
> /*
> * With new tasks being created, their initial util_avgs are extrapolated
> - * based on the cfs_rq's current util_avg:
> + * based on the rq's current util_avg. To make the util_avg converge, we
> + * cap the util_avg of successive tasks to only 1/2 of the left utilization
> + * budget:
> *
> - * util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
> - *
> - * However, in many cases, the above util_avg does not give a desired
> - * value. Moreover, the sum of the util_avgs may be divergent, such
> - * as when the series is a harmonic series.
> - *
> - * To solve this problem, we also cap the util_avg of successive tasks to
> - * only 1/2 of the left utilization budget:
> - *
> - * util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
> + * util_avg = (1024 - rq->avg.util_avg) / 2^n
> *
> * where n denotes the nth task.
> *
> * For example, a simplest series from the beginning would be like:
> *
> - * task util_avg: 512, 256, 128, 64, 32, 16, 8, ...
> - * cfs_rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...
> - *
> - * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)
> - * if util_avg > util_avg_cap.
> + * task util_avg: 512, 256, 128, 64, 32, 16, 8, ...
> + * rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...
> */
> void post_init_entity_util_avg(struct sched_entity *se)
> {
> - struct cfs_rq *cfs_rq = cfs_rq_of(se);
> + struct rq *rq = rq_of(cfs_rq_of(se));
> struct sched_avg *sa = &se->avg;
> - long cap = (long)(scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;
> + long cap = (long)(scale_load_down(SCHED_LOAD_SCALE) - rq->avg.util_avg) / 2;
>
> if (cap > 0) {
> - if (cfs_rq->avg.util_avg != 0) {
> - sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
> - sa->util_avg /= (cfs_rq->avg.load_avg + 1);
> -
> - if (sa->util_avg > cap)
> - sa->util_avg = cap;
> - } else {
> - sa->util_avg = cap;
> - }
> + sa->util_avg = cap;
> sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
> }
> }
> @@ -2697,6 +2679,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> u64 delta;
> u32 contrib, periods;
> unsigned long scale_freq, scale_cpu;
> + int update_util = 0;
>
> /*
> * now rolls down to a period boundary
> @@ -2720,15 +2703,19 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> return 0;
> sa->last_update_time = now;
>
> + /*
> + * We update util_sum together with load_sum only if it is a task
> + */
> + if (!cfs_rq && entity_is_task(container_of(sa, struct sched_entity, avg)))
> + update_util = 1;
> +
> scale_freq = arch_scale_freq_capacity(NULL, cpu);
> scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
>
> /*
> * Now we know we're crossing period boundaries, and the *_sum accrues by
> * two steps.
> - */
> -
> - /*
> + *
> * Step 1: decay old *_sum
> */
> sa->load_sum = __decay_sum(sa->load_sum, periods);
> @@ -2736,7 +2723,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> cfs_rq->runnable_load_sum =
> __decay_sum(cfs_rq->runnable_load_sum, periods);
> }
> - sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
> + if (update_util)
> + sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
>
> /*
> * Step 2: accumulate and meanwhile decay new *_sum by periods since
> @@ -2749,7 +2737,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> if (cfs_rq)
> cfs_rq->runnable_load_sum += weight * contrib;
> }
> - if (running)
> + if (update_util && running)
> sa->util_sum += contrib * scale_cpu;
>
> /*
> @@ -2760,6 +2748,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> cfs_rq->runnable_load_avg =
> div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
> }
> + if (update_util)
> + sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
> +
> + if (!cfs_rq)
> + return 1;
> +
> + /*
> + * Update rq's util_sum and util_avg
> + */

Do we really have to update the utilization of rq each time we update
a sched_entity ? IMHO, we don't need to do this update so often even
more if the se is a task_group. But even if it's a task, we don't
especially need to update it right now but we can wait for the update
of the rq->cfs like previously or we explicilty update it when we have
to do a direct sub/add on the util_avg value
See also my comment on removed_util_avg below

> + sa = &rq_of(cfs_rq)->avg;


Why not using the sched_avg of the rq->cfs in order to track the
utilization of the root cfs_rq instead of adding a new sched_avg into
the rq ? Then you call update_cfs_rq_load_avg(rq->cfs) when you want
to update/sync the utilization of the rq->cfs and for one call you
will update both the load_avg and the util_avg of the root cfs instead
of duplicating the sequence in _update_load_avg

> +
> + /*
> + * Of course, we have new delta and periods.
> + */
> + delta = now - sa->last_update_time;
> + periods = delta >> 20;
> +
> + /*
> + * The rq's util should be updated much more frequently
> + * than any cfs_rq. If we are here, child cfs_rq should have
> + * already updated load_avg, so we return 1 anyway.
> + */
> + if (!periods)
> + return 1;
> + sa->last_update_time = now;
> +
> + /* Step 1 */
> + sa->load_sum = __decay_sum(sa->load_sum, periods);
> +
> + /* Step 2 */
> + contrib = __accumulate_sum(periods);
> + contrib = cap_scale(contrib, scale_freq);
> + if (cfs_rq->curr != NULL) /* new running */
> + sa->util_sum += contrib * scale_cpu;
> +
> sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
>
> return 1;
> @@ -2843,6 +2866,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> {
> struct sched_avg *sa = &cfs_rq->avg;
> int decayed, removed = 0;
> + struct rq *rq = rq_of(cfs_rq);
>
> if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> @@ -2851,10 +2875,10 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> removed = 1;
> }
>
> - if (atomic_long_read(&cfs_rq->removed_util_avg)) {
> - long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
> - sa->util_avg = max_t(long, sa->util_avg - r, 0);
> - sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
> + if (atomic_long_read(&rq->removed_util_avg)) {
> + long r = atomic_long_xchg(&rq->removed_util_avg, 0);
> + rq->avg.util_avg = max_t(long, rq->avg.util_avg - r, 0);
> + rq->avg.util_sum = max_t(s32, rq->avg.util_sum - r * LOAD_AVG_MAX, 0);

I see one potential issue here because the rq->util_avg may (surely)
have been already updated and decayed during the update of a
sched_entity but before we substract the removed_util_avg

> }
>
> decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
> @@ -2907,12 +2931,14 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
> * See cpu_util().
> */
> cpufreq_update_util(rq_clock(rq),
> - min(cfs_rq->avg.util_avg, max), max);
> + min(rq->avg.util_avg, max), max);
> }
> }
>
> static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> + struct rq *rq = rq_of(cfs_rq);
> +
> if (!sched_feat(ATTACH_AGE_LOAD))
> goto skip_aging;
>
> @@ -2934,20 +2960,22 @@ skip_aging:
> se->avg.last_update_time = cfs_rq->avg.last_update_time;
> cfs_rq->avg.load_avg += se->avg.load_avg;
> cfs_rq->avg.load_sum += se->avg.load_sum;
> - cfs_rq->avg.util_avg += se->avg.util_avg;
> - cfs_rq->avg.util_sum += se->avg.util_sum;
> + rq->avg.util_avg += se->avg.util_avg;
> + rq->avg.util_sum += se->avg.util_sum;
> }
>
> static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> + struct rq *rq = rq_of(cfs_rq);
> +
> __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
> &se->avg, se->on_rq * scale_load_down(se->load.weight),
> cfs_rq->curr == se, NULL);
>
> cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
> cfs_rq->avg.load_sum = max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0);
> - cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
> - cfs_rq->avg.util_sum = max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0);
> + rq->avg.util_avg = max_t(long, rq->avg.util_avg - se->avg.util_avg, 0);
> + rq->avg.util_sum = max_t(s32, rq->avg.util_sum - se->avg.util_sum, 0);
> }
>
> /* Add the load generated by se into cfs_rq's load average */
> @@ -3017,6 +3045,7 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
> void remove_entity_load_avg(struct sched_entity *se)
> {
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
> + struct rq *rq = rq_of(cfs_rq);
> u64 last_update_time;
>
> /*
> @@ -3030,7 +3059,7 @@ void remove_entity_load_avg(struct sched_entity *se)
>
> __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
> atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
> - atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
> + atomic_long_add(se->avg.util_avg, &rq->removed_util_avg);
> }
>
> static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
> @@ -5145,7 +5174,7 @@ done:
> * compare the utilization with the capacity of the CPU that is available for
> * CFS task (ie cpu_capacity).
> *
> - * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
> + * rq->avg.util_avg is the sum of running time of runnable tasks plus the
> * recent utilization of currently non-runnable tasks on a CPU. It represents
> * the amount of utilization of a CPU in the range [0..capacity_orig] where
> * capacity_orig is the cpu_capacity available at the highest frequency
> @@ -5154,9 +5183,9 @@ done:
> * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
> * the running time on this CPU scaled by capacity_curr.
> *
> - * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
> + * Nevertheless, rq->avg.util_avg can be higher than capacity_curr or even
> * higher than capacity_orig because of unfortunate rounding in
> - * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
> + * rq->avg.util_avg or just after migrating tasks and new task wakeups until
> * the average stabilizes with the new running time. We need to check that the
> * utilization stays within the range of [0..capacity_orig] and cap it if
> * necessary. Without utilization capping, a group could be seen as overloaded
> @@ -5167,7 +5196,7 @@ done:
> */
> static int cpu_util(int cpu)
> {
> - unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
> + unsigned long util = cpu_rq(cpu)->avg.util_avg;
> unsigned long capacity = capacity_orig_of(cpu);
>
> return (util >= capacity) ? capacity : util;
> @@ -8334,7 +8363,6 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
> #endif
> #ifdef CONFIG_SMP
> atomic_long_set(&cfs_rq->removed_load_avg, 0);
> - atomic_long_set(&cfs_rq->removed_util_avg, 0);
> #endif
> }
>
> @@ -8399,7 +8427,6 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> init_cfs_rq(cfs_rq);
> init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
> init_entity_runnable_average(se);
> - post_init_entity_util_avg(se);
> }
>
> return 1;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a7cbad7..c64f8b8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -390,7 +390,7 @@ struct cfs_rq {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> unsigned long tg_load_avg_contrib;
> #endif
> - atomic_long_t removed_load_avg, removed_util_avg;
> + atomic_long_t removed_load_avg;
> #ifndef CONFIG_64BIT
> u64 load_last_update_time_copy;
> #endif
> @@ -652,6 +652,9 @@ struct rq {
>
> /* This is used to determine avg_idle's max value */
> u64 max_idle_balance_cost;
> +
> + struct sched_avg avg;
> + atomic_long_t removed_util_avg;
> #endif
>
> #ifdef CONFIG_IRQ_TIME_ACCOUNTING
> --
> 2.1.4
>

2016-04-11 16:59:16

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

On 10/04/16 23:36, Yuyang Du wrote:
> __compute_runnable_contrib() uses a loop to compute sum, whereas a
> table loopup can do it faster in a constant time.
>
> The following python script can be used to generate the constants:
>
> print " #: yN_inv yN_sum"
> print "-----------------------"
> y = (0.5)**(1/32.0)
> x = 2**32
> xx = 1024
> for i in range(0, 32):
> if i == 0:
> x = x-1
> xx = xx*y
> else:
> x = x*y
> xx = int(xx*y + 1024*y)
> print "%2d: %#x %8d" % (i, int(x), int(xx))
>
> print " #: sum_N32"
> print "------------"
> xxx = xx
> for i in range(0, 11):
> if i == 0:
> xxx = xx
> else:
> xxx = xxx/2 + xx
> print "%2d: %8d" % (i, xxx)
>

IMHO, it would be nice to add this to the existing tool from the patch
header of commit 5b51f2f80b3b
("sched: Make __update_entity_runnable_avg() fast") simply because people
already use this one to tweak their pelt tables. Maybe something like

diff --git a/pelt.c b/pelt.c
index 63e32d1d18b0..b36194e8bb9c 100644
--- a/pelt.c
+++ b/pelt.c
@@ -6,6 +6,8 @@

const long WMULT_CONST = ((1UL << N) - 1);
double y;
+int ld_avg_max_n;
+double sum_fl_n;

long runnable_avg_yN_inv[N];
void calc_mult_inv() {
@@ -42,6 +44,7 @@ void calc_yn_sum(int n)
printf("%2d: %8.0f %8.0f %8.0f\n", i, sum, sum_fl,
sum_fl - sum);
}
+ sum_fl_n = sum_fl;
printf("\n");
}

@@ -55,14 +58,29 @@ void calc_conv(long n) {
n = mult_inv(n, 1) + 1024;
i++;
} while (n != old_n);
+ ld_avg_max_n = i - 1;
printf("%d> %ld\n", i - 1, n);
printf("\n");
}

+void calc_acc_sum() {
+ int i = 1;
+ double sum = sum_fl_n;
+ int periods = ld_avg_max_n/N + 1;
+
+ printf("sum acc\n");
+
+ do {
+ printf("%2d: %8.0f\n", i, sum);
+ sum = floor(sum/2 + sum_fl_n);
+ } while (++i <= periods);
+}
+
void main() {
y = pow(0.5, 1/(double)N);
calc_mult_inv();
calc_conv(1024);
calc_yn_sum(N);
+ calc_acc_sum();
}

[...]

2016-04-11 17:14:10

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/fair: Modify accumulated sums for load/util averages

On 10/04/16 23:36, Yuyang Du wrote:
> After we dropped the incomplete period, the current period should be
> a complete "past" period, since all period boundaries, in the past
> or in the future, are predetermined.
>
> With incomplete current period:
>
> | | | | |
> -------------------*
> ^ ^
> | |
> current
>
> With this patch:
>
> | | | | |
> -------------------*
> ^ ^
> | |
> current
>
> So, the precomputed sums in __accumulated_sum_N[] and
> __accumulated_sum_N32[] should be updated accordingly.
>
> Update the script to generate the constants:
>
> print " #: inv_N sum_N"
> print "-----------------------"
> y = (0.5)**(1/32.0)
> x = 2**32
> xx = 1024
> for i in range(0, 32):
> if i == 0:
> x = x-1
> else:
> x = x*y
> xx = int(xx*y + 1024)
> print "%2d: %#x %8d" % (i, int(x), int(xx))
>

So since with this patch series, we always operate on period boundaries
you want to add 1024 instead of 1024*y (1024 * 0.5^(1/32) = 1002) for
the most recent period in the past?

Paul Turner's pelt program could be adapted like this:

diff --git a/pelt.c b/pelt.c
index b36194e8bb9c..3c6b42e88c2d 100644
--- a/pelt.c
+++ b/pelt.c
@@ -39,8 +39,8 @@ void calc_yn_sum(int n)
printf("sum y^n\n");
printf(" %8s %8s %8s\n", "exact", "floor", "error");
for (i = 1; i <= n; i++) {
- sum = (y * sum + y * 1024);
- sum_fl = floor(y * sum_fl+ y * 1024);
+ sum = (y * sum + 1024);
+ sum_fl = floor(y * sum_fl + 1024);
printf("%2d: %8.0f %8.0f %8.0f\n", i, sum, sum_fl,
sum_fl - sum);
}

to achieve this.

$ ./pelt

sum y^n
exact *floor* error
1: 1024 1024 0
2: 2026 2026 -0
...
31: 23371 23360 -11
32: 23894 23883 -11

> print " #: sum_N32"
> print "------------"
> xxx = xx
> for i in range(0, 11):
> if i == 0:
> xxx = xx
> else:
> xxx = xxx/2 + xx
> print "%2d: %8d" % (i, xxx)
>
> Signed-off-by: Yuyang Du <[email protected]>
> ---
> kernel/sched/fair.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 68273e8..49e9f1a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -666,7 +666,7 @@ static unsigned long task_h_load(struct task_struct *p);
> */
> #define LOAD_AVG_PERIOD 32
> #define LOAD_AVG_MAX 47742 /* maximum possible load avg */
> -#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */
> +#define LOAD_AVG_MAX_N 347 /* number of full periods to produce LOAD_AVG_MAX */
>
> /* Give new sched_entity start runnable values to heavy its load in infant time */
> void init_entity_runnable_average(struct sched_entity *se)
> @@ -2591,9 +2591,9 @@ static const u32 __decay_inv_multiply_N[] = {
> * over-estimates when re-combining.
> */
> static const u32 __accumulated_sum_N[] = {
> - 0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
> - 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
> - 17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
> + 0, 1024, 2026, 3006, 3965, 4904, 5822, 6721, 7600, 8461, 9303,
> + 10127,10933,11722,12494,13250,13990,14714,15422,16115,16793,17457,
> + 18106,18742,19364,19973,20569,21152,21722,22280,22826,23360,23883,
> };
>
> /*
> @@ -2601,8 +2601,8 @@ static const u32 __accumulated_sum_N[] = {
> * lower integers.
> */
> static const u32 __accumulated_sum_N32[] = {
> - 0, 23371, 35056, 40899, 43820, 45281,
> - 46011, 46376, 46559, 46650, 46696, 46719,
> + 0, 23883, 35824, 41795, 44780, 46273,
> + 47019, 47392, 47579, 47672, 47719, 47742,
> };
>
> /*
>

2016-04-11 23:07:31

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

A few comments:

On Mon, 2016-04-11 at 06:36 +0800, Yuyang Du wrote:
> __compute_runnable_contrib() uses a loop to compute sum, whereas a
> table loopup can do it faster in a constant time.

lookup typo

> The following python script can be used to generate the constants:

Thanks for including the script.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
[]
> @@ -2603,6 +2603,15 @@ static const u32 runnable_avg_yN_sum[] = {
> ?};
> ?
> ?/*
> + * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> + * lower integers.
> + */
> +static const u32 __accumulated_sum_N32[] = {
> + ????0, 23371, 35056, 40899, 43820, 45281,
> + 46011, 46376, 46559, 46650, 46696, 46719,
> +};
> +
> +/*
> ? * Approximate:
> ? *???val * y^n,????where y^32 ~= 0.5 (~1 scheduling period)
> ? */
> @@ -2650,14 +2659,9 @@ static u32 __compute_runnable_contrib(u64 n)
> ? else if (unlikely(n >= LOAD_AVG_MAX_N))
> ? return LOAD_AVG_MAX;
> ?
> - /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
> - do {
> - contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
> - contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
> -
> - n -= LOAD_AVG_PERIOD;
> - } while (n > LOAD_AVG_PERIOD);
> -
> + /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
> + contrib = __accumulated_sum_N32[n>>5]; /* =n/LOAD_AVG_PERIOD */

Perhaps the variable name could be improved and
the comment block around the runnable_avg_yN
declarations should include this new declaration.

2016-04-11 23:21:15

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

On Mon, 2016-04-11 at 17:59 +0100, Dietmar Eggemann wrote:
[]
> IMHO, it would be nice to add this to the existing tool from the patch
> header of commit 5b51f2f80b3b
> ("sched: Make __update_entity_runnable_avg() fast") simply because people
> already use this one to tweak their pelt tables.

Maybe create a file and put it in tools somewhere.
Maybe tools/sched ?

2016-04-12 02:54:44

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

On Mon, Apr 11, 2016 at 11:41:28AM +0100, Juri Lelli wrote:
> Hi,
>
> On 11/04/16 06:36, Yuyang Du wrote:
> > __compute_runnable_contrib() uses a loop to compute sum, whereas a
> > table loopup can do it faster in a constant time.
> >
> > The following python script can be used to generate the constants:
> >
> > print " #: yN_inv yN_sum"
> > print "-----------------------"
> > y = (0.5)**(1/32.0)
> > x = 2**32
> > xx = 1024
> > for i in range(0, 32):
> > if i == 0:
> > x = x-1
> > xx = xx*y
> > else:
> > x = x*y
> > xx = int(xx*y + 1024*y)
> > print "%2d: %#x %8d" % (i, int(x), int(xx))
> >
> > print " #: sum_N32"
> > print "------------"
> > xxx = xx
> > for i in range(0, 11):
> > if i == 0:
> > xxx = xx
> > else:
> > xxx = xxx/2 + xx
> > print "%2d: %8d" % (i, xxx)
> >
>
> Thanks for the script, really useful. Do you think there is value in
> making it general? Like if we want to play with/need changing LOAD_AVG_
> PERIOD in the future to something different than 32.

i think a s/32/xx/ should work.

> Also, does the following assume LOAD_AVG_PERIOD == 32? And if yes, do
> you think there is any value in removing that assumption?

Like Peter said, we are heavily dependent on it already. Whether a half-life
of 32 periods (or ~32ms) is the best, maybe we can try 16, but definitely not
64. Or whether exponential decay is the best to compute the impact of old
runnable/running times as a pridiction, it is just I can't think of a better
approach yet, and credits to Paul, Ben, et al.

2016-04-12 02:59:47

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

On Mon, Apr 11, 2016 at 05:59:11PM +0100, Dietmar Eggemann wrote:
> On 10/04/16 23:36, Yuyang Du wrote:
> > __compute_runnable_contrib() uses a loop to compute sum, whereas a
> > table loopup can do it faster in a constant time.
> >
> > The following python script can be used to generate the constants:
> >
> > print " #: yN_inv yN_sum"
> > print "-----------------------"
> > y = (0.5)**(1/32.0)
> > x = 2**32
> > xx = 1024
> > for i in range(0, 32):
> > if i == 0:
> > x = x-1
> > xx = xx*y
> > else:
> > x = x*y
> > xx = int(xx*y + 1024*y)
> > print "%2d: %#x %8d" % (i, int(x), int(xx))
> >
> > print " #: sum_N32"
> > print "------------"
> > xxx = xx
> > for i in range(0, 11):
> > if i == 0:
> > xxx = xx
> > else:
> > xxx = xxx/2 + xx
> > print "%2d: %8d" % (i, xxx)
> >
>
> IMHO, it would be nice to add this to the existing tool from the patch
> header of commit 5b51f2f80b3b
> ("sched: Make __update_entity_runnable_avg() fast") simply because people
> already use this one to tweak their pelt tables. Maybe something like

I'd prefer not, and recommend switching from the C program for this
kind of job. :)

2016-04-12 03:24:11

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

Hi Vincent,

On Mon, Apr 11, 2016 at 11:08:04AM +0200, Vincent Guittot wrote:
> > @@ -2704,11 +2694,14 @@ static __always_inline int
> > __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > {
> > - u64 delta, scaled_delta, periods;
> > - u32 contrib;
> > - unsigned int delta_w, scaled_delta_w, decayed = 0;
> > + u64 delta;
> > + u32 contrib, periods;
> > unsigned long scale_freq, scale_cpu;
> >
> > + /*
> > + * now rolls down to a period boundary
> > + */
> > + now = now && (u64)(~0xFFFFF);
> > delta = now - sa->last_update_time;
> > /*
> > * This should only happen when time goes backwards, which it
> > @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > }
> >
> > /*
> > - * Use 1024ns as the unit of measurement since it's a reasonable
> > - * approximation of 1us and fast to compute.
> > + * Use 1024*1024ns as an approximation of 1ms period, pretty close.
> > */
> > - delta >>= 10;
> > - if (!delta)
> > + periods = delta >> 20;
> > + if (!periods)
> > return 0;
> > sa->last_update_time = now;
>
> The optimization looks quite interesting but I see one potential issue
> with migration as we will lose the part of the ongoing period that is
> now not saved anymore. This lost part can be quite significant for a
> short task that ping pongs between CPUs.

Yes, basically, it is we lose precision (~1ms scale in contrast with ~1us scale).
But as I wrote, we may either lose a sub-1ms, or gain a sub-1ms, statistically,
they should even out, given the load/util updates are quite a large number of
samples, and we do want a lot of samples for the metrics, this is the point of
the entire average thing. Plus, as you also said, the incomplete current period
also plays a (somewhat) negative role here.

In addition, excluding the flat hierarchical util patch, we gain quite some
efficiency.

2016-04-12 04:20:04

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: Implement flat hierarchical structure for util_avg

Hi Vincent,

On Mon, Apr 11, 2016 at 02:29:25PM +0200, Vincent Guittot wrote:
>
> > update any group entity's util, so the net overhead should not be
> > formidable. In addition, rq's util updates can be very frequent,
> > but the updates in a period will be skipped, this is mostly effective
> > in update_blocked_averages().
>
> Not sure to catch your point here but the rq->util of an idle CPU will
> never be updated before a (idle) load balance as we call
> update_cfs_rq_load_avg which doesn't update the cfs_rq->avg.util_avg
> anymore nor rq->avg.util_avg.

I meant in update_blocked_averages(), the rq's util won't be updated
every time a cfs is updated if they are in the same period (~1ms), probabaly
this is the case.

The idle CPU thing is no difference, so it is anyway the responsibility of
any other active CPU to take the idle CPU's idle time into account for any
purpose.

>
> > + if (update_util)
> > + sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
> > +
> > + if (!cfs_rq)
> > + return 1;
> > +
> > + /*
> > + * Update rq's util_sum and util_avg
> > + */
>
> Do we really have to update the utilization of rq each time we update
> a sched_entity ? IMHO, we don't need to do this update so often even
> more if the se is a task_group. But even if it's a task, we don't
> especially need to update it right now but we can wait for the update
> of the rq->cfs like previously or we explicilty update it when we have
> to do a direct sub/add on the util_avg value
> See also my comment on removed_util_avg below
>

No, we only update a rq's util if the update is for cfs, not the sched_entity,
which is the if (!cfs_rq) condition's job

> Why not using the sched_avg of the rq->cfs in order to track the
> utilization of the root cfs_rq instead of adding a new sched_avg into
> the rq ? Then you call update_cfs_rq_load_avg(rq->cfs) when you want
> to update/sync the utilization of the rq->cfs and for one call you
> will update both the load_avg and the util_avg of the root cfs instead
> of duplicating the sequence in _update_load_avg

This is the approach taken by Dietmar in his patch, a fairly easy approach.
The problem is though, if so, we update the root cfs_rq only when it is
the root cfs_rq to update. A simple contrived case will make it never
updated except in update_blocked_averages(). My impression is that this
might be too much precision lost.

And thus we take this alternative approach, and thus I revisited
__update_load_avg() to optimize it.

[snip]

> > - if (atomic_long_read(&cfs_rq->removed_util_avg)) {
> > - long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
> > - sa->util_avg = max_t(long, sa->util_avg - r, 0);
> > - sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
> > + if (atomic_long_read(&rq->removed_util_avg)) {
> > + long r = atomic_long_xchg(&rq->removed_util_avg, 0);
> > + rq->avg.util_avg = max_t(long, rq->avg.util_avg - r, 0);
> > + rq->avg.util_sum = max_t(s32, rq->avg.util_sum - r * LOAD_AVG_MAX, 0);
>
> I see one potential issue here because the rq->util_avg may (surely)
> have been already updated and decayed during the update of a
> sched_entity but before we substract the removed_util_avg

This is the same now, because cfs_rq will be regularly updated in
update_blocked_averages(), which basically means cfs_rq will be newer
than task for sure, although task tries to catch up when removed.
Well, yes, in this patch with rq, the situation is more so. But,
hopefully, this is a formidable concern.

2016-04-12 10:12:41

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

On 12/04/16 03:12, Yuyang Du wrote:
> On Mon, Apr 11, 2016 at 11:41:28AM +0100, Juri Lelli wrote:
> > Hi,
> >
> > On 11/04/16 06:36, Yuyang Du wrote:
> > > __compute_runnable_contrib() uses a loop to compute sum, whereas a
> > > table loopup can do it faster in a constant time.
> > >
> > > The following python script can be used to generate the constants:
> > >
> > > print " #: yN_inv yN_sum"
> > > print "-----------------------"
> > > y = (0.5)**(1/32.0)
> > > x = 2**32
> > > xx = 1024
> > > for i in range(0, 32):
> > > if i == 0:
> > > x = x-1
> > > xx = xx*y
> > > else:
> > > x = x*y
> > > xx = int(xx*y + 1024*y)
> > > print "%2d: %#x %8d" % (i, int(x), int(xx))
> > >
> > > print " #: sum_N32"
> > > print "------------"
> > > xxx = xx
> > > for i in range(0, 11):
> > > if i == 0:
> > > xxx = xx
> > > else:
> > > xxx = xxx/2 + xx
> > > print "%2d: %8d" % (i, xxx)
> > >
> >
> > Thanks for the script, really useful. Do you think there is value in
> > making it general? Like if we want to play with/need changing LOAD_AVG_
> > PERIOD in the future to something different than 32.
>
> i think a s/32/xx/ should work.
>
> > Also, does the following assume LOAD_AVG_PERIOD == 32? And if yes, do
> > you think there is any value in removing that assumption?
>
> Like Peter said, we are heavily dependent on it already.

But I think the current code should still work if we define LOAD_AVG_
PERIOD as, say, 16 and we use Paul's program to recompute the tables.

My point was about trying to keep everything related to LOAD_AVG_PERIOD
and not start assuming it is 32. I'm not saying your changes assume
that, I was asking if they do.

> Whether a half-life
> of 32 periods (or ~32ms) is the best, maybe we can try 16, but definitely not
> 64. Or whether exponential decay is the best to compute the impact of old
> runnable/running times as a pridiction, it is just I can't think of a better
> approach yet, and credits to Paul, Ben, et al.
>

That is fine, I think. Another thing is crafting the code around a
particular half-life, IMHO.

Best,

- Juri

2016-04-12 11:56:52

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

Le Tuesday 12 Apr 2016 ? 03:41:41 (+0800), Yuyang Du a ?crit :
> Hi Vincent,
>
> On Mon, Apr 11, 2016 at 11:08:04AM +0200, Vincent Guittot wrote:
> > > @@ -2704,11 +2694,14 @@ static __always_inline int
> > > __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > > {
> > > - u64 delta, scaled_delta, periods;
> > > - u32 contrib;
> > > - unsigned int delta_w, scaled_delta_w, decayed = 0;
> > > + u64 delta;
> > > + u32 contrib, periods;
> > > unsigned long scale_freq, scale_cpu;
> > >
> > > + /*
> > > + * now rolls down to a period boundary
> > > + */
> > > + now = now && (u64)(~0xFFFFF);
> > > delta = now - sa->last_update_time;
> > > /*
> > > * This should only happen when time goes backwards, which it
> > > @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > }
> > >
> > > /*
> > > - * Use 1024ns as the unit of measurement since it's a reasonable
> > > - * approximation of 1us and fast to compute.
> > > + * Use 1024*1024ns as an approximation of 1ms period, pretty close.
> > > */
> > > - delta >>= 10;
> > > - if (!delta)
> > > + periods = delta >> 20;
> > > + if (!periods)
> > > return 0;
> > > sa->last_update_time = now;
> >
> > The optimization looks quite interesting but I see one potential issue
> > with migration as we will lose the part of the ongoing period that is
> > now not saved anymore. This lost part can be quite significant for a
> > short task that ping pongs between CPUs.
>
> Yes, basically, it is we lose precision (~1ms scale in contrast with ~1us scale).

But with a HZ set to 1000 and a sched-slice in the same range, having a precision
of 1ms instead of 1us makes the precision of load tracking of short task quite
random IMHO.

you can keep recording this partial period without using it in the load tracking.
Something like below keep precision without sacrifying the optimization.

---
kernel/sched/fair.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68273e8..b234169 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -674,6 +674,12 @@ void init_entity_runnable_average(struct sched_entity *se)
struct sched_avg *sa = &se->avg;

sa->last_update_time = 0;
+ /*
+ * sched_avg's period_contrib should be strictly less then 1024 * 1024, so
+ * we give it 1023 * 1024 to make sure it is almost a period (1024us), and
+ * will definitely be updated (after enqueue).
+ */
+ sa->period_contrib = 1023*1024;
sa->load_avg = scale_load_down(se->load.weight);
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
/*
@@ -2698,10 +2704,6 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
u32 contrib, periods;
unsigned long scale_freq, scale_cpu;

- /*
- * now rolls down to a period boundary
- */
- now = now && (u64)(~0xFFFFF);
delta = now - sa->last_update_time;
/*
* This should only happen when time goes backwards, which it
@@ -2712,6 +2714,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
return 0;
}

+ /* Add how much left for the current period */
+ delta += sa->period_contrib;
+
/*
* Use 1024*1024ns as an approximation of 1ms period, pretty close.
*/
@@ -2720,6 +2725,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
return 0;
sa->last_update_time = now;

+ /* Get how much left for the next period */
+ sa->period_contrib = delta & (u64)(0xFFFFF);
+
scale_freq = arch_scale_freq_capacity(NULL, cpu);
scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

> But as I wrote, we may either lose a sub-1ms, or gain a sub-1ms, statistically,
> they should even out, given the load/util updates are quite a large number of
> samples, and we do want a lot of samples for the metrics, this is the point of
> the entire average thing. Plus, as you also said, the incomplete current period
> also plays a (somewhat) negative role here.
>
> In addition, excluding the flat hierarchical util patch, we gain quite some
> efficiency.

2016-04-12 12:01:19

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

On 11/04/16 16:21, Joe Perches wrote:
> On Mon, 2016-04-11 at 17:59 +0100, Dietmar Eggemann wrote:
> []
> > IMHO, it would be nice to add this to the existing tool from the patch
> > header of commit 5b51f2f80b3b
> > ("sched: Make __update_entity_runnable_avg() fast") simply because people
> > already use this one to tweak their pelt tables.
>
> Maybe create a file and put it in tools somewhere.
> Maybe tools/sched ?
>

+1. And I'd stick with the C program.

Best,

- Juri

2016-04-12 12:03:05

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

On 10/04/16 23:36, Yuyang Du wrote:

[...]

> @@ -2704,11 +2694,14 @@ static __always_inline int
> __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> unsigned long weight, int running, struct cfs_rq *cfs_rq)
> {
> - u64 delta, scaled_delta, periods;
> - u32 contrib;
> - unsigned int delta_w, scaled_delta_w, decayed = 0;
> + u64 delta;
> + u32 contrib, periods;
> unsigned long scale_freq, scale_cpu;
>
> + /*
> + * now rolls down to a period boundary
> + */
> + now = now && (u64)(~0xFFFFF);

This forces now to be 1.

s/&&/&

2016-04-12 14:20:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

On Tue, Apr 12, 2016 at 03:17:12AM +0800, Yuyang Du wrote:
> On Mon, Apr 11, 2016 at 05:59:11PM +0100, Dietmar Eggemann wrote:
> > On 10/04/16 23:36, Yuyang Du wrote:
> > > __compute_runnable_contrib() uses a loop to compute sum, whereas a
> > > table loopup can do it faster in a constant time.
> > >
> > > The following python script can be used to generate the constants:
> > >
> > > print " #: yN_inv yN_sum"
> > > print "-----------------------"
> > > y = (0.5)**(1/32.0)
> > > x = 2**32
> > > xx = 1024
> > > for i in range(0, 32):
> > > if i == 0:
> > > x = x-1
> > > xx = xx*y
> > > else:
> > > x = x*y
> > > xx = int(xx*y + 1024*y)
> > > print "%2d: %#x %8d" % (i, int(x), int(xx))
> > >
> > > print " #: sum_N32"
> > > print "------------"
> > > xxx = xx
> > > for i in range(0, 11):
> > > if i == 0:
> > > xxx = xx
> > > else:
> > > xxx = xxx/2 + xx
> > > print "%2d: %8d" % (i, xxx)
> > >
> >
> > IMHO, it would be nice to add this to the existing tool from the patch
> > header of commit 5b51f2f80b3b
> > ("sched: Make __update_entity_runnable_avg() fast") simply because people
> > already use this one to tweak their pelt tables. Maybe something like
>
> I'd prefer not, and recommend switching from the C program for this
> kind of job. :)

I much prefer C because I don't speak snake or any of the other popular
languages -- mostly because I simply don't use them enough to remember
how they work.

Also, if we're going to edit that program, maybe change it such that at
the end it prints the numbers in a copy/paste-able C form, just for the
lazy amongst us :-)

2016-04-13 01:50:22

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

On Tue, Apr 12, 2016 at 11:14:13AM +0100, Juri Lelli wrote:
> On 12/04/16 03:12, Yuyang Du wrote:
> > On Mon, Apr 11, 2016 at 11:41:28AM +0100, Juri Lelli wrote:
> > > Hi,
> > >
> > > On 11/04/16 06:36, Yuyang Du wrote:
> > > > __compute_runnable_contrib() uses a loop to compute sum, whereas a
> > > > table loopup can do it faster in a constant time.
> > > >
> > > > The following python script can be used to generate the constants:
> > > >
> > > > print " #: yN_inv yN_sum"
> > > > print "-----------------------"
> > > > y = (0.5)**(1/32.0)
> > > > x = 2**32
> > > > xx = 1024
> > > > for i in range(0, 32):
> > > > if i == 0:
> > > > x = x-1
> > > > xx = xx*y
> > > > else:
> > > > x = x*y
> > > > xx = int(xx*y + 1024*y)
> > > > print "%2d: %#x %8d" % (i, int(x), int(xx))
> > > >
> > > > print " #: sum_N32"
> > > > print "------------"
> > > > xxx = xx
> > > > for i in range(0, 11):
> > > > if i == 0:
> > > > xxx = xx
> > > > else:
> > > > xxx = xxx/2 + xx
> > > > print "%2d: %8d" % (i, xxx)
> > > >
> > >
> > > Thanks for the script, really useful. Do you think there is value in
> > > making it general? Like if we want to play with/need changing LOAD_AVG_
> > > PERIOD in the future to something different than 32.
> >
> > i think a s/32/xx/ should work.
> >
> > > Also, does the following assume LOAD_AVG_PERIOD == 32? And if yes, do
> > > you think there is any value in removing that assumption?
> >
> > Like Peter said, we are heavily dependent on it already.
>
> But I think the current code should still work if we define LOAD_AVG_
> PERIOD as, say, 16 and we use Paul's program to recompute the tables.
>
> My point was about trying to keep everything related to LOAD_AVG_PERIOD
> and not start assuming it is 32. I'm not saying your changes assume
> that, I was asking if they do.

Oh, then my changes do not make more or less dependency. The entire avg thing
should only have two seeds (and all others depend on them):

(1) a period is 1024*1024ns
(2) a half-life is 32 periods

I'll check if there is anything hard-coded other than the two.

2016-04-13 01:54:28

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

On Tue, Apr 12, 2016 at 04:19:52PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 12, 2016 at 03:17:12AM +0800, Yuyang Du wrote:
> > On Mon, Apr 11, 2016 at 05:59:11PM +0100, Dietmar Eggemann wrote:
> > > On 10/04/16 23:36, Yuyang Du wrote:
> > > > __compute_runnable_contrib() uses a loop to compute sum, whereas a
> > > > table loopup can do it faster in a constant time.
> > > >
> > > > The following python script can be used to generate the constants:
> > > >
> > > > print " #: yN_inv yN_sum"
> > > > print "-----------------------"
> > > > y = (0.5)**(1/32.0)
> > > > x = 2**32
> > > > xx = 1024
> > > > for i in range(0, 32):
> > > > if i == 0:
> > > > x = x-1
> > > > xx = xx*y
> > > > else:
> > > > x = x*y
> > > > xx = int(xx*y + 1024*y)
> > > > print "%2d: %#x %8d" % (i, int(x), int(xx))
> > > >
> > > > print " #: sum_N32"
> > > > print "------------"
> > > > xxx = xx
> > > > for i in range(0, 11):
> > > > if i == 0:
> > > > xxx = xx
> > > > else:
> > > > xxx = xxx/2 + xx
> > > > print "%2d: %8d" % (i, xxx)
> > > >
> > >
> > > IMHO, it would be nice to add this to the existing tool from the patch
> > > header of commit 5b51f2f80b3b
> > > ("sched: Make __update_entity_runnable_avg() fast") simply because people
> > > already use this one to tweak their pelt tables. Maybe something like
> >
> > I'd prefer not, and recommend switching from the C program for this
> > kind of job. :)
>
> I much prefer C because I don't speak snake or any of the other popular
> languages -- mostly because I simply don't use them enough to remember
> how they work.
>
> Also, if we're going to edit that program, maybe change it such that at
> the end it prints the numbers in a copy/paste-able C form, just for the
> lazy amongst us :-)

Sure thing, :)

2016-04-13 03:57:17

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

On Tue, Apr 12, 2016 at 01:02:58PM +0100, Dietmar Eggemann wrote:
> On 10/04/16 23:36, Yuyang Du wrote:
>
> [...]
>
> > @@ -2704,11 +2694,14 @@ static __always_inline int
> > __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > {
> > - u64 delta, scaled_delta, periods;
> > - u32 contrib;
> > - unsigned int delta_w, scaled_delta_w, decayed = 0;
> > + u64 delta;
> > + u32 contrib, periods;
> > unsigned long scale_freq, scale_cpu;
> >
> > + /*
> > + * now rolls down to a period boundary
> > + */
> > + now = now && (u64)(~0xFFFFF);
>
> This forces now to be 1.
>
> s/&&/&

Duh, :)

Thanks.

2016-04-13 04:05:05

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

On Wed, 2016-04-13 at 04:14 +0800, Yuyang Du wrote:
> On Tue, Apr 12, 2016 at 01:02:58PM +0100, Dietmar Eggemann wrote:
> > On 10/04/16 23:36, Yuyang Du wrote:
> > [...]
> > > @@ -2704,11 +2694,14 @@ static __always_inline int
> > > ?__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > ? ??unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > > ?{
> > > - u64 delta, scaled_delta, periods;
> > > - u32 contrib;
> > > - unsigned int delta_w, scaled_delta_w, decayed = 0;
> > > + u64 delta;
> > > + u32 contrib, periods;
> > > ? unsigned long scale_freq, scale_cpu;
> > > ?
> > > + /*
> > > + ?* now rolls down to a period boundary
> > > + ?*/
> > > + now = now && (u64)(~0xFFFFF);
> > This forces now to be 1.
> >
> > s/&&/&
> Duh, :)

You could also avoid the cast:

now &= 0xFFFFull;

2016-04-13 04:51:28

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

Hi Vincent,

On Tue, Apr 12, 2016 at 01:56:45PM +0200, Vincent Guittot wrote:
> Le Tuesday 12 Apr 2016 ? 03:41:41 (+0800), Yuyang Du a ?crit :
> > Hi Vincent,
> >
> > On Mon, Apr 11, 2016 at 11:08:04AM +0200, Vincent Guittot wrote:
> > > > @@ -2704,11 +2694,14 @@ static __always_inline int
> > > > __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > > unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > > > {
> > > > - u64 delta, scaled_delta, periods;
> > > > - u32 contrib;
> > > > - unsigned int delta_w, scaled_delta_w, decayed = 0;
> > > > + u64 delta;
> > > > + u32 contrib, periods;
> > > > unsigned long scale_freq, scale_cpu;
> > > >
> > > > + /*
> > > > + * now rolls down to a period boundary
> > > > + */
> > > > + now = now && (u64)(~0xFFFFF);
> > > > delta = now - sa->last_update_time;
> > > > /*
> > > > * This should only happen when time goes backwards, which it
> > > > @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > > }
> > > >
> > > > /*
> > > > - * Use 1024ns as the unit of measurement since it's a reasonable
> > > > - * approximation of 1us and fast to compute.
> > > > + * Use 1024*1024ns as an approximation of 1ms period, pretty close.
> > > > */
> > > > - delta >>= 10;
> > > > - if (!delta)
> > > > + periods = delta >> 20;
> > > > + if (!periods)
> > > > return 0;
> > > > sa->last_update_time = now;
> > >
> > > The optimization looks quite interesting but I see one potential issue
> > > with migration as we will lose the part of the ongoing period that is
> > > now not saved anymore. This lost part can be quite significant for a
> > > short task that ping pongs between CPUs.
> >
> > Yes, basically, it is we lose precision (~1ms scale in contrast with ~1us scale).
>
> But with a HZ set to 1000 and a sched-slice in the same range, having a precision
> of 1ms instead of 1us makes the precision of load tracking of short task quite
> random IMHO.
>
> you can keep recording this partial period without using it in the load tracking.
> Something like below keep precision without sacrifying the optimization.

The residue is accumulated and rolled over to next update every time. But its
state is runnable/not-runnable, or running/not-running?

> ---
> kernel/sched/fair.c | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 68273e8..b234169 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -674,6 +674,12 @@ void init_entity_runnable_average(struct sched_entity *se)
> struct sched_avg *sa = &se->avg;
>
> sa->last_update_time = 0;
> + /*
> + * sched_avg's period_contrib should be strictly less then 1024 * 1024, so
> + * we give it 1023 * 1024 to make sure it is almost a period (1024us), and
> + * will definitely be updated (after enqueue).
> + */
> + sa->period_contrib = 1023*1024;
> sa->load_avg = scale_load_down(se->load.weight);
> sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
> /*
> @@ -2698,10 +2704,6 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> u32 contrib, periods;
> unsigned long scale_freq, scale_cpu;
>
> - /*
> - * now rolls down to a period boundary
> - */
> - now = now && (u64)(~0xFFFFF);
> delta = now - sa->last_update_time;
> /*
> * This should only happen when time goes backwards, which it
> @@ -2712,6 +2714,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> return 0;
> }
>
> + /* Add how much left for the current period */
> + delta += sa->period_contrib;
> +
> /*
> * Use 1024*1024ns as an approximation of 1ms period, pretty close.
> */
> @@ -2720,6 +2725,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> return 0;
> sa->last_update_time = now;
>
> + /* Get how much left for the next period */
> + sa->period_contrib = delta & (u64)(0xFFFFF);
> +
> scale_freq = arch_scale_freq_capacity(NULL, cpu);
> scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
>
> > But as I wrote, we may either lose a sub-1ms, or gain a sub-1ms, statistically,
> > they should even out, given the load/util updates are quite a large number of
> > samples, and we do want a lot of samples for the metrics, this is the point of
> > the entire average thing. Plus, as you also said, the incomplete current period
> > also plays a (somewhat) negative role here.
> >
> > In addition, excluding the flat hierarchical util patch, we gain quite some
> > efficiency.

2016-04-13 08:37:56

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

On Wed, Apr 13, 2016 at 04:14:48AM +0800, Yuyang Du wrote:
> On Tue, Apr 12, 2016 at 01:02:58PM +0100, Dietmar Eggemann wrote:
> > On 10/04/16 23:36, Yuyang Du wrote:
> >
> > [...]
> >
> > > @@ -2704,11 +2694,14 @@ static __always_inline int
> > > __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > > {
> > > - u64 delta, scaled_delta, periods;
> > > - u32 contrib;
> > > - unsigned int delta_w, scaled_delta_w, decayed = 0;
> > > + u64 delta;
> > > + u32 contrib, periods;
> > > unsigned long scale_freq, scale_cpu;
> > >
> > > + /*
> > > + * now rolls down to a period boundary
> > > + */
> > > + now = now && (u64)(~0xFFFFF);
> >
> > This forces now to be 1.
> >
> > s/&&/&
>
> Duh, :)

Have you, or anybody else, actually tested the impact of this patch?

2016-04-13 09:09:26

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/fair: Optimize sum computation with a lookup table

On 13/04/16 02:07, Yuyang Du wrote:
> On Tue, Apr 12, 2016 at 11:14:13AM +0100, Juri Lelli wrote:
> > On 12/04/16 03:12, Yuyang Du wrote:
> > > On Mon, Apr 11, 2016 at 11:41:28AM +0100, Juri Lelli wrote:
> > > > Hi,
> > > >
> > > > On 11/04/16 06:36, Yuyang Du wrote:
> > > > > __compute_runnable_contrib() uses a loop to compute sum, whereas a
> > > > > table loopup can do it faster in a constant time.
> > > > >
> > > > > The following python script can be used to generate the constants:
> > > > >
> > > > > print " #: yN_inv yN_sum"
> > > > > print "-----------------------"
> > > > > y = (0.5)**(1/32.0)
> > > > > x = 2**32
> > > > > xx = 1024
> > > > > for i in range(0, 32):
> > > > > if i == 0:
> > > > > x = x-1
> > > > > xx = xx*y
> > > > > else:
> > > > > x = x*y
> > > > > xx = int(xx*y + 1024*y)
> > > > > print "%2d: %#x %8d" % (i, int(x), int(xx))
> > > > >
> > > > > print " #: sum_N32"
> > > > > print "------------"
> > > > > xxx = xx
> > > > > for i in range(0, 11):
> > > > > if i == 0:
> > > > > xxx = xx
> > > > > else:
> > > > > xxx = xxx/2 + xx
> > > > > print "%2d: %8d" % (i, xxx)
> > > > >
> > > >
> > > > Thanks for the script, really useful. Do you think there is value in
> > > > making it general? Like if we want to play with/need changing LOAD_AVG_
> > > > PERIOD in the future to something different than 32.
> > >
> > > i think a s/32/xx/ should work.
> > >
> > > > Also, does the following assume LOAD_AVG_PERIOD == 32? And if yes, do
> > > > you think there is any value in removing that assumption?
> > >
> > > Like Peter said, we are heavily dependent on it already.
> >
> > But I think the current code should still work if we define LOAD_AVG_
> > PERIOD as, say, 16 and we use Paul's program to recompute the tables.
> >
> > My point was about trying to keep everything related to LOAD_AVG_PERIOD
> > and not start assuming it is 32. I'm not saying your changes assume
> > that, I was asking if they do.
>
> Oh, then my changes do not make more or less dependency. The entire avg thing
> should only have two seeds (and all others depend on them):
>
> (1) a period is 1024*1024ns

Which I think is fine.

> (2) a half-life is 32 periods
>

And this is fine as well, but only if we can actually write

(2) a half-life is LOAD_AVG_PERIOD periods

Best,

- Juri

> I'll check if there is anything hard-coded other than the two.
>

2016-04-13 11:11:55

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

On 12 April 2016 at 23:09, Yuyang Du <[email protected]> wrote:
>
> Hi Vincent,
>
> On Tue, Apr 12, 2016 at 01:56:45PM +0200, Vincent Guittot wrote:
> > Le Tuesday 12 Apr 2016 à 03:41:41 (+0800), Yuyang Du a écrit :
> > > Hi Vincent,
> > >
> > > On Mon, Apr 11, 2016 at 11:08:04AM +0200, Vincent Guittot wrote:
> > > > > @@ -2704,11 +2694,14 @@ static __always_inline int
> > > > > __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > > > unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > > > > {
> > > > > - u64 delta, scaled_delta, periods;
> > > > > - u32 contrib;
> > > > > - unsigned int delta_w, scaled_delta_w, decayed = 0;
> > > > > + u64 delta;
> > > > > + u32 contrib, periods;
> > > > > unsigned long scale_freq, scale_cpu;
> > > > >
> > > > > + /*
> > > > > + * now rolls down to a period boundary
> > > > > + */
> > > > > + now = now && (u64)(~0xFFFFF);
> > > > > delta = now - sa->last_update_time;
> > > > > /*
> > > > > * This should only happen when time goes backwards, which it
> > > > > @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > > > }
> > > > >
> > > > > /*
> > > > > - * Use 1024ns as the unit of measurement since it's a reasonable
> > > > > - * approximation of 1us and fast to compute.
> > > > > + * Use 1024*1024ns as an approximation of 1ms period, pretty close.
> > > > > */
> > > > > - delta >>= 10;
> > > > > - if (!delta)
> > > > > + periods = delta >> 20;
> > > > > + if (!periods)
> > > > > return 0;
> > > > > sa->last_update_time = now;
> > > >
> > > > The optimization looks quite interesting but I see one potential issue
> > > > with migration as we will lose the part of the ongoing period that is
> > > > now not saved anymore. This lost part can be quite significant for a
> > > > short task that ping pongs between CPUs.
> > >
> > > Yes, basically, it is we lose precision (~1ms scale in contrast with ~1us scale).
> >
> > But with a HZ set to 1000 and a sched-slice in the same range, having a precision
> > of 1ms instead of 1us makes the precision of load tracking of short task quite
> > random IMHO.
> >
> > you can keep recording this partial period without using it in the load tracking.
> > Something like below keep precision without sacrifying the optimization.
>
> The residue is accumulated and rolled over to next update every time. But its
> state is runnable/not-runnable, or running/not-running?

yes, this need to be sorted

>
>
> > ---
> > kernel/sched/fair.c | 16 ++++++++++++----
> > 1 file changed, 12 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 68273e8..b234169 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -674,6 +674,12 @@ void init_entity_runnable_average(struct sched_entity *se)
> > struct sched_avg *sa = &se->avg;
> >
> > sa->last_update_time = 0;
> > + /*
> > + * sched_avg's period_contrib should be strictly less then 1024 * 1024, so
> > + * we give it 1023 * 1024 to make sure it is almost a period (1024us), and
> > + * will definitely be updated (after enqueue).
> > + */
> > + sa->period_contrib = 1023*1024;
> > sa->load_avg = scale_load_down(se->load.weight);
> > sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
> > /*
> > @@ -2698,10 +2704,6 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > u32 contrib, periods;
> > unsigned long scale_freq, scale_cpu;
> >
> > - /*
> > - * now rolls down to a period boundary
> > - */
> > - now = now && (u64)(~0xFFFFF);
> > delta = now - sa->last_update_time;
> > /*
> > * This should only happen when time goes backwards, which it
> > @@ -2712,6 +2714,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > return 0;
> > }
> >
> > + /* Add how much left for the current period */
> > + delta += sa->period_contrib;
> > +
> > /*
> > * Use 1024*1024ns as an approximation of 1ms period, pretty close.
> > */
> > @@ -2720,6 +2725,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > return 0;
> > sa->last_update_time = now;
> >
> > + /* Get how much left for the next period */
> > + sa->period_contrib = delta & (u64)(0xFFFFF);
> > +
> > scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> >
> > > But as I wrote, we may either lose a sub-1ms, or gain a sub-1ms, statistically,
> > > they should even out, given the load/util updates are quite a large number of
> > > samples, and we do want a lot of samples for the metrics, this is the point of
> > > the entire average thing. Plus, as you also said, the incomplete current period
> > > also plays a (somewhat) negative role here.
> > >
> > > In addition, excluding the flat hierarchical util patch, we gain quite some
> > > efficiency.

2016-04-13 11:27:45

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: Implement flat hierarchical structure for util_avg

On 11 April 2016 at 22:37, Yuyang Du <[email protected]> wrote:
> Hi Vincent,
>
> On Mon, Apr 11, 2016 at 02:29:25PM +0200, Vincent Guittot wrote:
>>
>> > update any group entity's util, so the net overhead should not be
>> > formidable. In addition, rq's util updates can be very frequent,
>> > but the updates in a period will be skipped, this is mostly effective
>> > in update_blocked_averages().
>>
>> Not sure to catch your point here but the rq->util of an idle CPU will
>> never be updated before a (idle) load balance as we call
>> update_cfs_rq_load_avg which doesn't update the cfs_rq->avg.util_avg
>> anymore nor rq->avg.util_avg.
>
> I meant in update_blocked_averages(), the rq's util won't be updated
> every time a cfs is updated if they are in the same period (~1ms), probabaly
> this is the case.
>
> The idle CPU thing is no difference, so it is anyway the responsibility of
> any other active CPU to take the idle CPU's idle time into account for any
> purpose.
>
>>
>> > + if (update_util)
>> > + sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
>> > +
>> > + if (!cfs_rq)
>> > + return 1;
>> > +
>> > + /*
>> > + * Update rq's util_sum and util_avg
>> > + */
>>
>> Do we really have to update the utilization of rq each time we update
>> a sched_entity ? IMHO, we don't need to do this update so often even
>> more if the se is a task_group. But even if it's a task, we don't
>> especially need to update it right now but we can wait for the update
>> of the rq->cfs like previously or we explicilty update it when we have
>> to do a direct sub/add on the util_avg value
>> See also my comment on removed_util_avg below
>>
>
> No, we only update a rq's util if the update is for cfs, not the sched_entity,
> which is the if (!cfs_rq) condition's job

ah yes i have skiped the ! when reading

>
>> Why not using the sched_avg of the rq->cfs in order to track the
>> utilization of the root cfs_rq instead of adding a new sched_avg into
>> the rq ? Then you call update_cfs_rq_load_avg(rq->cfs) when you want
>> to update/sync the utilization of the rq->cfs and for one call you
>> will update both the load_avg and the util_avg of the root cfs instead
>> of duplicating the sequence in _update_load_avg
>
> This is the approach taken by Dietmar in his patch, a fairly easy approach.
> The problem is though, if so, we update the root cfs_rq only when it is
> the root cfs_rq to update. A simple contrived case will make it never
> updated except in update_blocked_averages(). My impression is that this
> might be too much precision lost.
>
> And thus we take this alternative approach, and thus I revisited
> __update_load_avg() to optimize it.
>
> [snip]
>
>> > - if (atomic_long_read(&cfs_rq->removed_util_avg)) {
>> > - long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
>> > - sa->util_avg = max_t(long, sa->util_avg - r, 0);
>> > - sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
>> > + if (atomic_long_read(&rq->removed_util_avg)) {
>> > + long r = atomic_long_xchg(&rq->removed_util_avg, 0);
>> > + rq->avg.util_avg = max_t(long, rq->avg.util_avg - r, 0);
>> > + rq->avg.util_sum = max_t(s32, rq->avg.util_sum - r * LOAD_AVG_MAX, 0);
>>
>> I see one potential issue here because the rq->util_avg may (surely)
>> have been already updated and decayed during the update of a
>> sched_entity but before we substract the removed_util_avg
>
> This is the same now, because cfs_rq will be regularly updated in
> update_blocked_averages(), which basically means cfs_rq will be newer
> than task for sure, although task tries to catch up when removed.

I don't agree on that part. At now, we check and substract
removed_util_avg before calling __update_load_avg for a cfs_rq, so it
will be removed before changing last_update_time.
With your patch, we update rq->avg.util_avg and last_update_time
without checking removed_util_avg.

> Well, yes, in this patch with rq, the situation is more so. But,
> hopefully, this is a formidable concern.

2016-04-13 15:13:10

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

On 10/04/16 23:36, Yuyang Du wrote:
> In __update_load_avg(), the current period is never complete. This
> basically leads to a slightly over-decayed average, say on average we
> have 50% current period, then we will lose 1.08%(=(1-0.5^(1/64)) of
> past avg. More importantly, the incomplete current period significantly
> complicates the avg computation, even a full period is only about 1ms.
>
> So we attempt to drop it. The outcome is that for any x.y periods to
> update, we will either lose the .y period or unduely gain (1-.y) period.
> How big is the impact? For a large x (say 20ms), you barely notice the
> difference, which is plus/minus 1% (=(before-after)/before). Moreover,
> the aggregated losses and gains in the long run should statistically
> even out.
>

For a periodic task, the signals really get much more unstable. Even for
a steady state (load/util related) periodic task there is a meander
pattern which depends on if we for instance hit a dequeue (decay +
accrue) or an enqueue (decay only) after the 1ms has elapsed.

IMHO, 1ms is too big to create signals describing task and cpu load/util
signals given the current scheduler dynamics. We simply see too many
signal driving points (e.g. enqueue/dequeue) bailing out of
__update_load_avg().

Examples of 1 periodic task pinned to a cpu on an ARM64 system, HZ=250
in steady state:

(1) task runtime = 100us period = 200us

pelt load/util signal

1us: 488-491

1ms: 483-534

We get ~2 dequeues (load/util example: 493->504) and ~2 enqueues
(load/util example: 496->483) in the meander pattern in the 1ms case.

(2) task runtime = 100us period = 1000us

pelt load/util signal

1us: 103-105

1ms: 84-145

We get ~3-4 dequeues (load/util example: 104->124->134->140) and ~16-20
enqueues (load/util example: 137->134->...->99->97) in the meander
pattern in the 1ms case.

[...]

2016-04-13 15:28:42

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

On 13 April 2016 at 17:13, Dietmar Eggemann <[email protected]> wrote:
> On 10/04/16 23:36, Yuyang Du wrote:
>> In __update_load_avg(), the current period is never complete. This
>> basically leads to a slightly over-decayed average, say on average we
>> have 50% current period, then we will lose 1.08%(=(1-0.5^(1/64)) of
>> past avg. More importantly, the incomplete current period significantly
>> complicates the avg computation, even a full period is only about 1ms.
>>
>> So we attempt to drop it. The outcome is that for any x.y periods to
>> update, we will either lose the .y period or unduely gain (1-.y) period.
>> How big is the impact? For a large x (say 20ms), you barely notice the
>> difference, which is plus/minus 1% (=(before-after)/before). Moreover,
>> the aggregated losses and gains in the long run should statistically
>> even out.
>>
>
> For a periodic task, the signals really get much more unstable. Even for
> a steady state (load/util related) periodic task there is a meander
> pattern which depends on if we for instance hit a dequeue (decay +
> accrue) or an enqueue (decay only) after the 1ms has elapsed.
>
> IMHO, 1ms is too big to create signals describing task and cpu load/util
> signals given the current scheduler dynamics. We simply see too many
> signal driving points (e.g. enqueue/dequeue) bailing out of
> __update_load_avg().
>
> Examples of 1 periodic task pinned to a cpu on an ARM64 system, HZ=250
> in steady state:
>
> (1) task runtime = 100us period = 200us
>
> pelt load/util signal
>
> 1us: 488-491
>
> 1ms: 483-534
>
> We get ~2 dequeues (load/util example: 493->504) and ~2 enqueues
> (load/util example: 496->483) in the meander pattern in the 1ms case.
>
> (2) task runtime = 100us period = 1000us
>
> pelt load/util signal
>
> 1us: 103-105
>
> 1ms: 84-145
>
> We get ~3-4 dequeues (load/util example: 104->124->134->140) and ~16-20
> enqueues (load/util example: 137->134->...->99->97) in the meander
> pattern in the 1ms case.

yes, similarly i have some use cases with 2ms running task in a period
of 5.12ms. it will be seen either as a 1ms running task or a 2ms
running tasks depending on how the running is synced with the 1ms
boundary

so the load will vary between 197-215 up to 396-423 depending of when
the 1ms boundary occurs in the 2ms running

Vincent

>
> [...]

2016-04-13 16:21:19

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

On 13 April 2016 at 17:28, Vincent Guittot <[email protected]> wrote:
> On 13 April 2016 at 17:13, Dietmar Eggemann <[email protected]> wrote:
>> On 10/04/16 23:36, Yuyang Du wrote:
>>> In __update_load_avg(), the current period is never complete. This
>>> basically leads to a slightly over-decayed average, say on average we
>>> have 50% current period, then we will lose 1.08%(=(1-0.5^(1/64)) of
>>> past avg. More importantly, the incomplete current period significantly
>>> complicates the avg computation, even a full period is only about 1ms.
>>>
>>> So we attempt to drop it. The outcome is that for any x.y periods to
>>> update, we will either lose the .y period or unduely gain (1-.y) period.
>>> How big is the impact? For a large x (say 20ms), you barely notice the
>>> difference, which is plus/minus 1% (=(before-after)/before). Moreover,
>>> the aggregated losses and gains in the long run should statistically
>>> even out.
>>>
>>
>> For a periodic task, the signals really get much more unstable. Even for
>> a steady state (load/util related) periodic task there is a meander
>> pattern which depends on if we for instance hit a dequeue (decay +
>> accrue) or an enqueue (decay only) after the 1ms has elapsed.
>>
>> IMHO, 1ms is too big to create signals describing task and cpu load/util
>> signals given the current scheduler dynamics. We simply see too many
>> signal driving points (e.g. enqueue/dequeue) bailing out of
>> __update_load_avg().
>>
>> Examples of 1 periodic task pinned to a cpu on an ARM64 system, HZ=250
>> in steady state:
>>
>> (1) task runtime = 100us period = 200us
>>
>> pelt load/util signal
>>
>> 1us: 488-491
>>
>> 1ms: 483-534
>>
>> We get ~2 dequeues (load/util example: 493->504) and ~2 enqueues
>> (load/util example: 496->483) in the meander pattern in the 1ms case.
>>
>> (2) task runtime = 100us period = 1000us
>>
>> pelt load/util signal
>>
>> 1us: 103-105
>>
>> 1ms: 84-145
>>
>> We get ~3-4 dequeues (load/util example: 104->124->134->140) and ~16-20
>> enqueues (load/util example: 137->134->...->99->97) in the meander
>> pattern in the 1ms case.
>
> yes, similarly i have some use cases with 2ms running task in a period
> of 5.12ms. it will be seen either as a 1ms running task or a 2ms

sorry, it's 5.242ms no 5.12ms

> running tasks depending on how the running is synced with the 1ms
> boundary
>
> so the load will vary between 197-215 up to 396-423 depending of when
> the 1ms boundary occurs in the 2ms running
>
> Vincent
>
>>
>> [...]

2016-04-14 02:02:50

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: Implement flat hierarchical structure for util_avg

Hi Vincent,

On Wed, Apr 13, 2016 at 01:27:21PM +0200, Vincent Guittot wrote:
> >> Why not using the sched_avg of the rq->cfs in order to track the
> >> utilization of the root cfs_rq instead of adding a new sched_avg into
> >> the rq ? Then you call update_cfs_rq_load_avg(rq->cfs) when you want
> >> to update/sync the utilization of the rq->cfs and for one call you
> >> will update both the load_avg and the util_avg of the root cfs instead
> >> of duplicating the sequence in _update_load_avg
> >
> > This is the approach taken by Dietmar in his patch, a fairly easy approach.
> > The problem is though, if so, we update the root cfs_rq only when it is
> > the root cfs_rq to update. A simple contrived case will make it never
> > updated except in update_blocked_averages(). My impression is that this
> > might be too much precision lost.
> >
> > And thus we take this alternative approach, and thus I revisited
> > __update_load_avg() to optimize it.
> >
> > [snip]
> >
> >> > - if (atomic_long_read(&cfs_rq->removed_util_avg)) {
> >> > - long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
> >> > - sa->util_avg = max_t(long, sa->util_avg - r, 0);
> >> > - sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
> >> > + if (atomic_long_read(&rq->removed_util_avg)) {
> >> > + long r = atomic_long_xchg(&rq->removed_util_avg, 0);
> >> > + rq->avg.util_avg = max_t(long, rq->avg.util_avg - r, 0);
> >> > + rq->avg.util_sum = max_t(s32, rq->avg.util_sum - r * LOAD_AVG_MAX, 0);
> >>
> >> I see one potential issue here because the rq->util_avg may (surely)
> >> have been already updated and decayed during the update of a
> >> sched_entity but before we substract the removed_util_avg
> >
> > This is the same now, because cfs_rq will be regularly updated in
> > update_blocked_averages(), which basically means cfs_rq will be newer
> > than task for sure, although task tries to catch up when removed.
>
> I don't agree on that part. At now, we check and substract
> removed_util_avg before calling __update_load_avg for a cfs_rq, so it
> will be removed before changing last_update_time.

Despite the cross CPU issue, you are right.

> With your patch, we update rq->avg.util_avg and last_update_time
> without checking removed_util_avg.

But, yes, we do.

2016-04-14 02:26:39

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

On Wed, Apr 13, 2016 at 05:28:18PM +0200, Vincent Guittot wrote:
> > For a periodic task, the signals really get much more unstable. Even for
> > a steady state (load/util related) periodic task there is a meander
> > pattern which depends on if we for instance hit a dequeue (decay +
> > accrue) or an enqueue (decay only) after the 1ms has elapsed.
> >
> > IMHO, 1ms is too big to create signals describing task and cpu load/util
> > signals given the current scheduler dynamics. We simply see too many
> > signal driving points (e.g. enqueue/dequeue) bailing out of
> > __update_load_avg().

By "bailing out", you mean return without update because the delta is less
than 1ms?

> > Examples of 1 periodic task pinned to a cpu on an ARM64 system, HZ=250
> > in steady state:
> >
> > (1) task runtime = 100us period = 200us
> >
> > pelt load/util signal
> >
> > 1us: 488-491
> >
> > 1ms: 483-534

100us/200us = 50%, so the util should center around 512, it seems in this
regard, it is better, but the variance is undesirable.

> > We get ~2 dequeues (load/util example: 493->504) and ~2 enqueues
> > (load/util example: 496->483) in the meander pattern in the 1ms case.
> >
> > (2) task runtime = 100us period = 1000us
> >
> > pelt load/util signal
> >
> > 1us: 103-105
> >
> > 1ms: 84-145
> >
> > We get ~3-4 dequeues (load/util example: 104->124->134->140) and ~16-20
> > enqueues (load/util example: 137->134->...->99->97) in the meander
> > pattern in the 1ms case.

The same as above.

>
> yes, similarly i have some use cases with 2ms running task in a period
> of 5.12ms. it will be seen either as a 1ms running task or a 2ms
> running tasks depending on how the running is synced with the 1ms
> boundary
>
> so the load will vary between 197-215 up to 396-423 depending of when
> the 1ms boundary occurs in the 2ms running
>

Same as above, and this time, the util is "expected" to be 2/5.242*1024=391
of all the samples. We solve the problem of overly-decay, but the precision
loss is a new problem.

Let me see if we can get to a 2-level period scheme, :)

2016-04-14 12:52:50

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

On 13/04/16 19:44, Yuyang Du wrote:
> On Wed, Apr 13, 2016 at 05:28:18PM +0200, Vincent Guittot wrote:

[...]

> By "bailing out", you mean return without update because the delta is less
> than 1ms?

yes.

>
>>> Examples of 1 periodic task pinned to a cpu on an ARM64 system, HZ=250
>>> in steady state:
>>>
>>> (1) task runtime = 100us period = 200us
>>>
>>> pelt load/util signal
>>>
>>> 1us: 488-491
>>>
>>> 1ms: 483-534
>
> 100us/200us = 50%, so the util should center around 512, it seems in this
> regard, it is better, but the variance is undesirable.

I see. You mentioned the over-decay thing in the patch header. Is this
also why you change the contribution of the most recent period from 1002
(1024*y) to 1024?

This variance gets worse if the ratio runtime/period is further reduced
(e.g. 25us/1000us).

You can even create tasks which go stealth mode (e.g. 25us/1048us). It
shows periods of 0 load/util (~1.55s) and than massive spikes (~700 for
~300ms). The short runtime and the task period synced to 1024*1024ns
allow that we hit consecutive enqueues or dequeues for a long time even
the task might drift relative to the pelt window.

[...]

2016-04-15 03:47:47

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

Hi Dietmar,

On Thu, Apr 14, 2016 at 01:52:43PM +0100, Dietmar Eggemann wrote:
> On 13/04/16 19:44, Yuyang Du wrote:
> > On Wed, Apr 13, 2016 at 05:28:18PM +0200, Vincent Guittot wrote:
>
> [...]
>
> > By "bailing out", you mean return without update because the delta is less
> > than 1ms?
>
> yes.
>
> >
> >>> Examples of 1 periodic task pinned to a cpu on an ARM64 system, HZ=250
> >>> in steady state:
> >>>
> >>> (1) task runtime = 100us period = 200us
> >>>
> >>> pelt load/util signal
> >>>
> >>> 1us: 488-491
> >>>
> >>> 1ms: 483-534
> >
> > 100us/200us = 50%, so the util should center around 512, it seems in this
> > regard, it is better, but the variance is undesirable.
>
> I see. You mentioned the over-decay thing in the patch header. Is this
> also why you change the contribution of the most recent period from 1002
> (1024*y) to 1024?

Yes, it is because that (most recent) period is the "current" period.

> This variance gets worse if the ratio runtime/period is further reduced
> (e.g. 25us/1000us).
>
> You can even create tasks which go stealth mode (e.g. 25us/1048us).

En... this is a good case to beat it.

> It shows periods of 0 load/util (~1.55s) and than massive spikes (~700 for
> ~300ms). The short runtime and the task period synced to 1024*1024ns
> allow that we hit consecutive enqueues or dequeues for a long time even
> the task might drift relative to the pelt window.

But whenever we pass 1ms, we will update. And I am curious, how does the
current 1us works in this case? Anyway, I will reproduce it myself.

2016-04-19 01:41:39

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

Hi Dietmar,

On Fri, Apr 15, 2016 at 04:05:23AM +0800, Yuyang Du wrote:
> > It shows periods of 0 load/util (~1.55s) and than massive spikes (~700 for
> > ~300ms). The short runtime and the task period synced to 1024*1024ns
> > allow that we hit consecutive enqueues or dequeues for a long time even
> > the task might drift relative to the pelt window.
>
> But whenever we pass 1ms, we will update. And I am curious, how does the
> current 1us works in this case? Anyway, I will reproduce it myself.

I did some experiments to compare.

First, the starting 0 Dietmar observed is due to rq's util_avg may already reached
full util, so new task's util_avg is initialized as 0. But I never observed
0 for any long time (definitely not as long as 1s). The following experiments
did not include the fourth patch (flat util hierarchy implementation).

Second, the 1ms grainular update loses much precision, and leads to big variance.
See attached figures, which runs 100us out of every 200us. I actually did
many more other combinations, but the results are about the same (disappointing).
The experiments used fixed workload (rt-app), fixed CPU, and fixed frequency.

In addition, about the 2-level period scheme, the idea is to have a finer
period level than 1ms, such as 128us. And to not generate more constants,
we can have a first-level period as 2ms and a second-level period as 128us,
and the half-life is the same 32ms. It is definitely doable, but the implementation
complicates the __accumulated_sum(). Therefore, I simply dropped it.

So, let me get back to the 1us precision, and then move on for the flat hierarchical
util.

Thanks,
Yuyang


Attachments:
(No filename) (1.61 kB)
Opt_100us_200us.jpg (62.55 kB)
Master_100us_200us.jpg (45.83 kB)
Download all attachments