2016-04-05 11:55:22

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v3 0/6] sched/fair: Clean up sched metric definitions

Hi Peter,

Would you please give it a look?

This series cleans up the sched metrics, changes include:
(1) Define SCHED_FIXEDPOINT_SHIFT for all fixed point arithmetic scaling.
(2) Get rid of confusing scaling factors: SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE,
and thus only leave NICE_0_LOAD (for load) and SCHED_CAPACITY_SCALE (for util).
(3) Consistently use SCHED_CAPACITY_SCALE for all util related.
(4) Add detailed introduction to the sched metrics.
(5) Get rid of unnecessary scaling up and down for load.
(6) Rename the mappings between priority (user) and load (kernel).
(7) Move inactive code.

The previous version is at: http://thread.gmane.org/gmane.linux.kernel/2187272

v3 changes:
(1) Rebase to current tip
(2) Changelog fix, thanks to Ben.

Thanks,
Yuyang

---

Yuyang Du (6):
sched/fair: Generalize the load/util averages resolution definition
sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
sched/fair: Add introduction to the sched load avg metrics
sched/fair: Remove scale_load_down() for load_avg
sched/fair: Rename scale_load() and scale_load_down()
sched/fair: Move (inactive) option from code to config

include/linux/sched.h | 81 +++++++++++++++++++++++++++++++++++++++++++--------
init/Kconfig | 16 ++++++++++
kernel/sched/core.c | 8 ++---
kernel/sched/fair.c | 33 ++++++++++-----------
kernel/sched/sched.h | 52 +++++++++++++++------------------
5 files changed, 127 insertions(+), 63 deletions(-)

--
2.1.4


2016-04-05 11:55:26

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v3 3/6] sched/fair: Add introduction to the sched load avg metrics

These sched metrics have become complex enough. We introduce them
at their definition.

Signed-off-by: Yuyang Du <[email protected]>
---
include/linux/sched.h | 60 +++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bc652fe..b0a6cf0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1209,18 +1209,56 @@ struct load_weight {
};

/*
- * The load_avg/util_avg accumulates an infinite geometric series.
- * 1) load_avg factors frequency scaling into the amount of time that a
- * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
- * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
- * For cfs_rq, it is the aggregated such times of all runnable and
+ * The load_avg/util_avg accumulates an infinite geometric series
+ * (see __update_load_avg() in kernel/sched/fair.c).
+ *
+ * [load_avg definition]
+ *
+ * load_avg = runnable% * scale_load_down(load)
+ *
+ * where runnable% is the time ratio that a sched_entity is runnable.
+ * For cfs_rq, it is the aggregated such load_avg of all runnable and
* blocked sched_entities.
- * The 64 bit load_sum can:
- * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
- * the highest weight (=88761) always runnable, we should not overflow
- * 2) for entity, support any load.weight always runnable
+ *
+ * load_avg may also take frequency scaling into account:
+ *
+ * load_avg = runnable% * scale_load_down(load) * freq%
+ *
+ * where freq% is the CPU frequency normalize to the highest frequency
+ *
+ * [util_avg definition]
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE
+ *
+ * where running% is the time ratio that a sched_entity is running on
+ * a CPU. For cfs_rq, it is the aggregated such util_avg of all runnable
+ * and blocked sched_entities.
+ *
+ * util_avg may also factor frequency scaling and CPU capacity scaling:
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity%
+ *
+ * where freq% is the same as above, and capacity% is the CPU capacity
+ * normalized to the greatest capacity (due to uarch differences, etc).
+ *
+ * N.B., the above ratios (runnable%, running%, freq%, and capacity%)
+ * themselves are in the range of [0, 1]. To do fixed point arithmetic,
+ * we therefore scale them to as large range as necessary. This is for
+ * example reflected by util_avg's SCHED_CAPACITY_SCALE.
+ *
+ * [Overflow issue]
+ *
+ * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
+ * with the highest load (=88761) always runnable on a single cfs_rq, we
+ * should not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * For all other cases (including 32bit kernel), struct load_weight's
+ * weight will overflow first before we do, because:
+ *
+ * Max(load_avg) <= Max(load.weight)
+ *
+ * Then, it is the load_weight's responsibility to consider overflow
+ * issues.
*/
struct sched_avg {
u64 last_update_time, load_sum;
--
2.1.4

2016-04-05 11:55:34

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg

Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
down of load does not make much sense, because load_avg is primarily THE
load and on top of that, we take runnable time into account.

We therefore remove scale_load_down() for load_avg. But we need to
carefully consider the overflow risk if load has higher range
(2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
to us is on 64bit kernel with increased load range. In that case,
the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
entities with the highest load (=88761*1024) always runnable on one
single cfs_rq, which may be an issue, but should be fine. Even if this
occurs at the end of day, on the condition where it occurs, the
load average will not be useful anyway.

Signed-off-by: Yuyang Du <[email protected]>
[update calculate_imbalance]
Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 19 ++++++++++++++-----
kernel/sched/fair.c | 19 +++++++++----------
2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b0a6cf0..8d2e8f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1214,7 +1214,7 @@ struct load_weight {
*
* [load_avg definition]
*
- * load_avg = runnable% * scale_load_down(load)
+ * load_avg = runnable% * load
*
* where runnable% is the time ratio that a sched_entity is runnable.
* For cfs_rq, it is the aggregated such load_avg of all runnable and
@@ -1222,7 +1222,7 @@ struct load_weight {
*
* load_avg may also take frequency scaling into account:
*
- * load_avg = runnable% * scale_load_down(load) * freq%
+ * load_avg = runnable% * load * freq%
*
* where freq% is the CPU frequency normalize to the highest frequency
*
@@ -1248,9 +1248,18 @@ struct load_weight {
*
* [Overflow issue]
*
- * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
- * with the highest load (=88761) always runnable on a single cfs_rq, we
- * should not overflow as the number already hits PID_MAX_LIMIT.
+ * On 64bit kernel:
+ *
+ * When load has small fixed point range (SCHED_FIXEDPOINT_SHIFT), the
+ * 64bit load_sum can have 4353082796 (=2^64/47742/88761) tasks with
+ * the highest load (=88761) always runnable on a cfs_rq, we should
+ * not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * When load has large fixed point range (2*SCHED_FIXEDPOINT_SHIFT),
+ * the 64bit load_sum can have 4251057 (=2^64/47742/88761/1024) tasks
+ * with the highest load (=88761*1024) always runnable on ONE cfs_rq,
+ * we should be fine. Even if the overflow occurs at the end of day,
+ * at the time the load_avg won't be useful anyway in that situation.
*
* For all other cases (including 32bit kernel), struct load_weight's
* weight will overflow first before we do, because:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aca96d7..2be613b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,7 +680,7 @@ void init_entity_runnable_average(struct sched_entity *se)
* will definitely be update (after enqueue).
*/
sa->period_contrib = 1023;
- sa->load_avg = scale_load_down(se->load.weight);
+ sa->load_avg = se->load.weight;
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
/*
* At this point, util_avg won't be used in select_task_rq_fair anyway
@@ -2890,7 +2890,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
}

decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
- scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
+ cfs_rq->load.weight, cfs_rq->curr != NULL, cfs_rq);

#ifndef CONFIG_64BIT
smp_wmb();
@@ -2912,8 +2912,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
* Track task load average for carrying it to new CPU after migrated, and
* track group sched_entity load average for task_h_load calc in migration
*/
- __update_load_avg(now, cpu, &se->avg,
- se->on_rq * scale_load_down(se->load.weight),
+ __update_load_avg(now, cpu, &se->avg, se->on_rq * se->load.weight,
cfs_rq->curr == se, NULL);

if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
@@ -2973,7 +2972,7 @@ skip_aging:
static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
- &se->avg, se->on_rq * scale_load_down(se->load.weight),
+ &se->avg, se->on_rq * se->load.weight,
cfs_rq->curr == se, NULL);

cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
@@ -2993,7 +2992,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
migrated = !sa->last_update_time;
if (!migrated) {
__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
- se->on_rq * scale_load_down(se->load.weight),
+ se->on_rq * se->load.weight,
cfs_rq->curr == se, NULL);
}

@@ -6953,10 +6952,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
*/
if (busiest->group_type == group_overloaded &&
local->group_type == group_overloaded) {
- load_above_capacity = busiest->sum_nr_running *
- scale_load_down(NICE_0_LOAD);
- if (load_above_capacity > busiest->group_capacity)
- load_above_capacity -= busiest->group_capacity;
+ load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
+ if (load_above_capacity > scale_load(busiest->group_capacity))
+ load_above_capacity -=
+ scale_load(busiest->group_capacity);
else
load_above_capacity = ~0UL;
}
--
2.1.4

2016-04-05 11:55:31

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down()

Rename scale_load() and scale_load_down() to user_to_kernel_load()
and kernel_to_user_load() respectively, to allow the names to bear
what they are really about.

Signed-off-by: Yuyang Du <[email protected]>
[update calculate_imbalance]
Signed-off-by: Vincent Guittot <[email protected]>

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/core.c | 8 ++++----
kernel/sched/fair.c | 14 ++++++++------
kernel/sched/sched.h | 16 ++++++++--------
3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1159423..21c08c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -695,12 +695,12 @@ static void set_load_weight(struct task_struct *p)
* SCHED_IDLE tasks get minimal weight:
*/
if (idle_policy(p->policy)) {
- load->weight = scale_load(WEIGHT_IDLEPRIO);
+ load->weight = user_to_kernel_load(WEIGHT_IDLEPRIO);
load->inv_weight = WMULT_IDLEPRIO;
return;
}

- load->weight = scale_load(sched_prio_to_weight[prio]);
+ load->weight = user_to_kernel_load(sched_prio_to_weight[prio]);
load->inv_weight = sched_prio_to_wmult[prio];
}

@@ -8180,7 +8180,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
{
- return sched_group_set_shares(css_tg(css), scale_load(shareval));
+ return sched_group_set_shares(css_tg(css), user_to_kernel_load(shareval));
}

static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
@@ -8188,7 +8188,7 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
{
struct task_group *tg = css_tg(css);

- return (u64) scale_load_down(tg->shares);
+ return (u64) kernel_to_user_load(tg->shares);
}

#ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2be613b..0b6659d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
if (likely(lw->inv_weight))
return;

- w = scale_load_down(lw->weight);
+ w = kernel_to_user_load(lw->weight);

if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
lw->inv_weight = 1;
@@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
*/
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
- u64 fact = scale_load_down(weight);
+ u64 fact = kernel_to_user_load(weight);
int shift = WMULT_SHIFT;

__update_inv_weight(lw);
@@ -6952,10 +6952,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
*/
if (busiest->group_type == group_overloaded &&
local->group_type == group_overloaded) {
+ unsigned long min_cpu_load =
+ kernel_to_user_load(NICE_0_LOAD) * busiest->group_capacity;
load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
- if (load_above_capacity > scale_load(busiest->group_capacity))
- load_above_capacity -=
- scale_load(busiest->group_capacity);
+ if (load_above_capacity > min_cpu_load)
+ load_above_capacity -= min_cpu_load;
else
load_above_capacity = ~0UL;
}
@@ -8510,7 +8511,8 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
if (!tg->se[0])
return -EINVAL;

- shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
+ shares = clamp(shares, user_to_kernel_load(MIN_SHARES),
+ user_to_kernel_load(MAX_SHARES));

mutex_lock(&shares_mutex);
if (tg->shares == shares)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6aafe6c..b00e6e5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -55,22 +55,22 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
*/
#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */
# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
-# define scale_load_down(w) ((w) >> SCHED_FIXEDPOINT_SHIFT)
+# define user_to_kernel_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
+# define kernel_to_user_load(w) ((w) >> SCHED_FIXEDPOINT_SHIFT)
#else
# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w) (w)
-# define scale_load_down(w) (w)
+# define user_to_kernel_load(w) (w)
+# define kernel_to_user_load(w) (w)
#endif

/*
* Task weight (visible to user) and its load (invisible to user) have
* independent resolution, but they should be well calibrated. We use
- * scale_load() and scale_load_down(w) to convert between them. The
- * following must be true:
- *
- * scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * user_to_kernel_load() and kernel_to_user_load(w) to convert between
+ * them. The following must be true:
*
+ * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
*/
#define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT)

--
2.1.4

2016-04-05 11:55:40

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config

The option of increased load resolution (fixed point arithmetic range) is
unconditionally deactivated with #if 0. But since it may still be used
somewhere (e.g., in Google), we want to keep this option.

Regardless, there should be a way to express this option. Considering the
current circumstances, the reconciliation is we define a config
CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
64BIT and BROKEN.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Yuyang Du <[email protected]>
---
init/Kconfig | 16 +++++++++++++++
kernel/sched/sched.h | 55 +++++++++++++++++++++-------------------------------
2 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 0dfd09d..ad75ff7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1026,6 +1026,22 @@ config CFS_BANDWIDTH
restriction.
See tip/Documentation/scheduler/sched-bwc.txt for more information.

+config CFS_INCREASE_LOAD_RANGE
+ bool "Increase kernel load range"
+ depends on 64BIT && BROKEN
+ default n
+ help
+ Increase resolution of nice-level calculations for 64-bit architectures.
+ The extra resolution improves shares distribution and load balancing of
+ low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
+ hierarchies, especially on larger systems. This is not a user-visible change
+ and does not change the user-interface for setting shares/weights.
+ We increase resolution only if we have enough bits to allow this increased
+ resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
+ when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
+ increased costs.
+ Currently broken: it increases power usage under light load.
+
config RT_GROUP_SCHED
bool "Group scheduling for SCHED_RR/FIFO"
depends on CGROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b00e6e5..aafb3e7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -42,39 +42,6 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
#define NS_TO_JIFFIES(TIME) ((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))

/*
- * Increase resolution of nice-level calculations for 64-bit architectures.
- * The extra resolution improves shares distribution and load balancing of
- * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
- * hierarchies, especially on larger systems. This is not a user-visible change
- * and does not change the user-interface for setting shares/weights.
- *
- * We increase resolution only if we have enough bits to allow this increased
- * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
- * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
- * increased costs.
- */
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */
-# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
-# define kernel_to_user_load(w) ((w) >> SCHED_FIXEDPOINT_SHIFT)
-#else
-# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w) (w)
-# define kernel_to_user_load(w) (w)
-#endif
-
-/*
- * Task weight (visible to user) and its load (invisible to user) have
- * independent resolution, but they should be well calibrated. We use
- * user_to_kernel_load() and kernel_to_user_load(w) to convert between
- * them. The following must be true:
- *
- * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
- * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
- */
-#define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT)
-
-/*
* Single value that decides SCHED_DEADLINE internal math precision.
* 10 -> just above 1us
* 9 -> just above 0.5us
@@ -1150,6 +1117,28 @@ extern const int sched_prio_to_weight[40];
extern const u32 sched_prio_to_wmult[40];

/*
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use
+ * user_to_kernel_load() and kernel_to_user_load(w) to convert between
+ * them.
+ *
+ * The following must also be true:
+ * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ * kernel_to_user_load(NICE_0_LOAD) == sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
+ */
+#ifdef CONFIG_CFS_INCREASE_LOAD_RANGE
+#define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w) (w << SCHED_FIXEDPOINT_SHIFT)
+#define kernel_to_user_load(w) (w >> SCHED_FIXEDPOINT_SHIFT)
+#else
+#define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w) (w)
+#define kernel_to_user_load(w) (w)
+#endif
+
+#define NICE_0_LOAD (1UL << NICE_0_LOAD_SHIFT)
+
+/*
* {de,en}queue flags:
*
* DEQUEUE_SLEEP - task is no longer runnable
--
2.1.4

2016-04-05 11:56:44

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v3 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE

After cleaning up the sched metrics, these two definitions that cause
ambiguity are not needed any more. Use NICE_0_LOAD_SHIFT and NICE_0_LOAD
instead (the names suggest clearly who they are).

Suggested-by: Ben Segall <[email protected]>
Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 4 ++--
kernel/sched/sched.h | 22 +++++++++++-----------
2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 88ab334..aca96d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -719,7 +719,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
struct sched_avg *sa = &se->avg;
- long cap = (long)(scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;
+ long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;

if (cap > 0) {
if (cfs_rq->avg.util_avg != 0) {
@@ -6954,7 +6954,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
if (busiest->group_type == group_overloaded &&
local->group_type == group_overloaded) {
load_above_capacity = busiest->sum_nr_running *
- SCHED_LOAD_SCALE;
+ scale_load_down(NICE_0_LOAD);
if (load_above_capacity > busiest->group_capacity)
load_above_capacity -= busiest->group_capacity;
else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 03ce2de..6aafe6c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,25 +54,25 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
* increased costs.
*/
#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */
-# define SCHED_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
# define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
# define scale_load_down(w) ((w) >> SCHED_FIXEDPOINT_SHIFT)
#else
-# define SCHED_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
# define scale_load(w) (w)
# define scale_load_down(w) (w)
#endif

-#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
-
/*
- * NICE_0's weight (visible to user) and its load (invisible to user) have
- * independent ranges, but they should be well calibrated. We use scale_load()
- * and scale_load_down(w) to convert between them, the following must be true:
- * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent resolution, but they should be well calibrated. We use
+ * scale_load() and scale_load_down(w) to convert between them. The
+ * following must be true:
+ *
+ * scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *
*/
-#define NICE_0_LOAD SCHED_LOAD_SCALE
-#define NICE_0_SHIFT SCHED_LOAD_SHIFT
+#define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT)

/*
* Single value that decides SCHED_DEADLINE internal math precision.
@@ -859,7 +859,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_asym);
struct sched_group_capacity {
atomic_t ref;
/*
- * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
+ * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
* for a single CPU.
*/
unsigned int capacity;
--
2.1.4

2016-04-05 11:55:20

by Yuyang Du

[permalink] [raw]
Subject: [PATCH v3 1/6] sched/fair: Generalize the load/util averages resolution definition

Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges, which makes their update and
usage error-prone.

In order to avoid the errors relating to the fixed point range, we
definie a basic fixed point range, and then formalize all metrics to
base on the basic range.

The basic range is 1024 or (1 << 10). Further, one can recursively
apply the basic range to have larger range.

Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.

Signed-off-by: Yuyang Du <[email protected]>
---
include/linux/sched.h | 16 +++++++++++++---
kernel/sched/fair.c | 4 ----
kernel/sched/sched.h | 15 ++++++++++-----
3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45e848c..bc652fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -935,9 +935,19 @@ enum cpu_idle_type {
};

/*
+ * Integer metrics need fixed point arithmetic, e.g., sched/fair
+ * has a few: load, load_avg, util_avg, freq, and capacity.
+ *
+ * We define a basic fixed point arithmetic range, and then formalize
+ * all these metrics based on that basic range.
+ */
+# define SCHED_FIXEDPOINT_SHIFT 10
+# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)
+
+/*
* Increase resolution of cpu_capacity calculations
*/
-#define SCHED_CAPACITY_SHIFT 10
+#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)

/*
@@ -1203,8 +1213,8 @@ struct load_weight {
* 1) load_avg factors frequency scaling into the amount of time that a
* sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
* aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
+ * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
+ * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
* For cfs_rq, it is the aggregated such times of all runnable and
* blocked sched_entities.
* The 64 bit load_sum can:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8cc1c3..88ab334 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2662,10 +2662,6 @@ static u32 __compute_runnable_contrib(u64 n)
return contrib + runnable_avg_yN_sum[n];
}

-#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
-#error "load tracking assumes 2^10 as unit"
-#endif
-
#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7cbad7..03ce2de 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,18 +54,23 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
* increased costs.
*/
#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */
-# define SCHED_LOAD_RESOLUTION 10
-# define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION)
-# define scale_load_down(w) ((w) >> SCHED_LOAD_RESOLUTION)
+# define SCHED_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
+# define scale_load_down(w) ((w) >> SCHED_FIXEDPOINT_SHIFT)
#else
-# define SCHED_LOAD_RESOLUTION 0
+# define SCHED_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
# define scale_load(w) (w)
# define scale_load_down(w) (w)
#endif

-#define SCHED_LOAD_SHIFT (10 + SCHED_LOAD_RESOLUTION)
#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)

+/*
+ * NICE_0's weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use scale_load()
+ * and scale_load_down(w) to convert between them, the following must be true:
+ * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ */
#define NICE_0_LOAD SCHED_LOAD_SCALE
#define NICE_0_SHIFT SCHED_LOAD_SHIFT

--
2.1.4

2016-04-28 10:25:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg

On Tue, Apr 05, 2016 at 12:12:29PM +0800, Yuyang Du wrote:
> Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
> down of load does not make much sense, because load_avg is primarily THE
> load and on top of that, we take runnable time into account.
>
> We therefore remove scale_load_down() for load_avg. But we need to
> carefully consider the overflow risk if load has higher range
> (2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
> to us is on 64bit kernel with increased load range. In that case,
> the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
> entities with the highest load (=88761*1024) always runnable on one
> single cfs_rq, which may be an issue, but should be fine. Even if this
> occurs at the end of day, on the condition where it occurs, the
> load average will not be useful anyway.

I do feel we need a little more words on the actual ramification of
overflowing here.

Yes, having 4m tasks on a single runqueue will be somewhat unlikely, but
if it happens, then what will the user experience? How long (if ever)
does it take for numbers to correct themselves etc..

> Signed-off-by: Yuyang Du <[email protected]>
> [update calculate_imbalance]
> Signed-off-by: Vincent Guittot <[email protected]>

This SoB Chain suggests you wrote it and Vincent send it on, yet this
email is from you and Vincent isn't anywhere. Something's not right.

2016-04-28 09:19:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down()

On Tue, Apr 05, 2016 at 12:12:30PM +0800, Yuyang Du wrote:
> Rename scale_load() and scale_load_down() to user_to_kernel_load()
> and kernel_to_user_load() respectively, to allow the names to bear
> what they are really about.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
> if (likely(lw->inv_weight))
> return;
>
> - w = scale_load_down(lw->weight);
> + w = kernel_to_user_load(lw->weight);
>
> if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
> lw->inv_weight = 1;
> @@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
> */
> static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
> {
> - u64 fact = scale_load_down(weight);
> + u64 fact = kernel_to_user_load(weight);
> int shift = WMULT_SHIFT;
>
> __update_inv_weight(lw);
> @@ -6952,10 +6952,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> */
> if (busiest->group_type == group_overloaded &&
> local->group_type == group_overloaded) {
> + unsigned long min_cpu_load =
> + kernel_to_user_load(NICE_0_LOAD) * busiest->group_capacity;
> load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
> - if (load_above_capacity > scale_load(busiest->group_capacity))
> - load_above_capacity -=
> - scale_load(busiest->group_capacity);
> + if (load_above_capacity > min_cpu_load)
> + load_above_capacity -= min_cpu_load;
> else
> load_above_capacity = ~0UL;
> }

Except these 3 really are not about user/kernel visible fixed point
ranges _at_all_... :/

2016-04-28 09:45:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config


* Peter Zijlstra <[email protected]> wrote:

> On Tue, Apr 05, 2016 at 12:12:31PM +0800, Yuyang Du wrote:
> > The option of increased load resolution (fixed point arithmetic range) is
> > unconditionally deactivated with #if 0. But since it may still be used
> > somewhere (e.g., in Google), we want to keep this option.
> >
> > Regardless, there should be a way to express this option. Considering the
> > current circumstances, the reconciliation is we define a config
> > CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
> > 64BIT and BROKEN.
> >
> > Suggested-by: Ingo Molnar <[email protected]>
>
> So I'm very tempted to simply, unconditionally, reinstate this larger
> range for everything CONFIG_64BIT && CONFIG_FAIR_GROUP_SCHED.
>
> There was but the single claim on increased power usage, nobody could
> reproduce / analyze and Google has been running with this for years now.
>
> Furthermore, it seems to be leading to the obvious problems on bigger
> machines where we basically run out of precision by the sheer number of
> cpus (nr_cpus ~ SCHED_LOAD_SCALE and stuff comes apart quickly).

Agreed.

Thanks,

Ingo

2016-04-28 09:37:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config

On Tue, Apr 05, 2016 at 12:12:31PM +0800, Yuyang Du wrote:
> The option of increased load resolution (fixed point arithmetic range) is
> unconditionally deactivated with #if 0. But since it may still be used
> somewhere (e.g., in Google), we want to keep this option.
>
> Regardless, there should be a way to express this option. Considering the
> current circumstances, the reconciliation is we define a config
> CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
> 64BIT and BROKEN.
>
> Suggested-by: Ingo Molnar <[email protected]>

So I'm very tempted to simply, unconditionally, reinstate this larger
range for everything CONFIG_64BIT && CONFIG_FAIR_GROUP_SCHED.

There was but the single claim on increased power usage, nobody could
reproduce / analyze and Google has been running with this for years now.

Furthermore, it seems to be leading to the obvious problems on bigger
machines where we basically run out of precision by the sheer number of
cpus (nr_cpus ~ SCHED_LOAD_SCALE and stuff comes apart quickly).


2016-04-28 10:43:11

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg

On Thu, Apr 28, 2016 at 12:25:32PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 12:12:29PM +0800, Yuyang Du wrote:
> > Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
> > down of load does not make much sense, because load_avg is primarily THE
> > load and on top of that, we take runnable time into account.
> >
> > We therefore remove scale_load_down() for load_avg. But we need to
> > carefully consider the overflow risk if load has higher range
> > (2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
> > to us is on 64bit kernel with increased load range. In that case,
> > the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
> > entities with the highest load (=88761*1024) always runnable on one
> > single cfs_rq, which may be an issue, but should be fine. Even if this
> > occurs at the end of day, on the condition where it occurs, the
> > load average will not be useful anyway.
>
> I do feel we need a little more words on the actual ramification of
> overflowing here.
>
> Yes, having 4m tasks on a single runqueue will be somewhat unlikely, but
> if it happens, then what will the user experience? How long (if ever)
> does it take for numbers to correct themselves etc..
>
> > Signed-off-by: Yuyang Du <[email protected]>
> > [update calculate_imbalance]
> > Signed-off-by: Vincent Guittot <[email protected]>
>
> This SoB Chain suggests you wrote it and Vincent send it on, yet this
> email is from you and Vincent isn't anywhere. Something's not right.

Since you started to review patches, I just sent you more, :) What a coincidance.

I actually don't know the rules for this SoB, let me learn how to do this
co-signed-off.

2016-04-28 11:18:07

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down()

Le Thursday 28 Apr 2016 ? 11:19:19 (+0200), Peter Zijlstra a ?crit :
> On Tue, Apr 05, 2016 at 12:12:30PM +0800, Yuyang Du wrote:
> > Rename scale_load() and scale_load_down() to user_to_kernel_load()
> > and kernel_to_user_load() respectively, to allow the names to bear
> > what they are really about.
>
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
> > if (likely(lw->inv_weight))
> > return;
> >
> > - w = scale_load_down(lw->weight);
> > + w = kernel_to_user_load(lw->weight);
> >
> > if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
> > lw->inv_weight = 1;
> > @@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
> > */
> > static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
> > {
> > - u64 fact = scale_load_down(weight);
> > + u64 fact = kernel_to_user_load(weight);
> > int shift = WMULT_SHIFT;
> >
> > __update_inv_weight(lw);
> > @@ -6952,10 +6952,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > */
> > if (busiest->group_type == group_overloaded &&
> > local->group_type == group_overloaded) {
> > + unsigned long min_cpu_load =
> > + kernel_to_user_load(NICE_0_LOAD) * busiest->group_capacity;
> > load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
> > - if (load_above_capacity > scale_load(busiest->group_capacity))
> > - load_above_capacity -=
> > - scale_load(busiest->group_capacity);
> > + if (load_above_capacity > min_cpu_load)
> > + load_above_capacity -= min_cpu_load;
> > else
> > load_above_capacity = ~0UL;
> > }
>
> Except these 3 really are not about user/kernel visible fixed point
> ranges _at_all_... :/

While trying to optimize the calcultaion of min_cpu_load, i have broken evrything

it should be :

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b6659d..3411eb7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6953,7 +6953,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
if (busiest->group_type == group_overloaded &&
local->group_type == group_overloaded) {
unsigned long min_cpu_load =
- kernel_to_user_load(NICE_0_LOAD) * busiest->group_capacity;
+ busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE;
load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
if (load_above_capacity > min_cpu_load)
load_above_capacity -= min_cpu_load;


>
>

2016-04-29 03:11:35

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg

On Thu, Apr 28, 2016 at 12:25:32PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 12:12:29PM +0800, Yuyang Du wrote:
> > Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
> > down of load does not make much sense, because load_avg is primarily THE
> > load and on top of that, we take runnable time into account.
> >
> > We therefore remove scale_load_down() for load_avg. But we need to
> > carefully consider the overflow risk if load has higher range
> > (2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
> > to us is on 64bit kernel with increased load range. In that case,
> > the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
> > entities with the highest load (=88761*1024) always runnable on one
> > single cfs_rq, which may be an issue, but should be fine. Even if this
> > occurs at the end of day, on the condition where it occurs, the
> > load average will not be useful anyway.
>
> I do feel we need a little more words on the actual ramification of
> overflowing here.
>
> Yes, having 4m tasks on a single runqueue will be somewhat unlikely, but
> if it happens, then what will the user experience? How long (if ever)
> does it take for numbers to correct themselves etc..

Well, regarding the experience, this should be a stress test study.

But if the system can miraculously survive, and we end up in the scenario
that we have a ~0ULL load_sum and the rq suddently dropps to 0 load, it
would take roughly 2 seconds (=32ms*64) to converge. This time is the bound.

2016-04-29 04:12:04

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down()

On Thu, Apr 28, 2016 at 11:19:19AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 12:12:30PM +0800, Yuyang Du wrote:
> > Rename scale_load() and scale_load_down() to user_to_kernel_load()
> > and kernel_to_user_load() respectively, to allow the names to bear
> > what they are really about.
>
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
> > if (likely(lw->inv_weight))
> > return;
> >
> > - w = scale_load_down(lw->weight);
> > + w = kernel_to_user_load(lw->weight);
> >
> > if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
> > lw->inv_weight = 1;
> > @@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
> > */
> > static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
> > {
> > - u64 fact = scale_load_down(weight);
> > + u64 fact = kernel_to_user_load(weight);
> > int shift = WMULT_SHIFT;
> >
> > __update_inv_weight(lw);

[snip]

> Except these 3 really are not about user/kernel visible fixed point
> ranges _at_all_... :/

But are the above two falling back to user fixed point precision? And
the reason being that we can't efficiently do this multiply/divide
thing with increased fixed point for kernel load.

2016-04-29 04:16:00

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config

On Thu, Apr 28, 2016 at 11:37:33AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 12:12:31PM +0800, Yuyang Du wrote:
> > The option of increased load resolution (fixed point arithmetic range) is
> > unconditionally deactivated with #if 0. But since it may still be used
> > somewhere (e.g., in Google), we want to keep this option.
> >
> > Regardless, there should be a way to express this option. Considering the
> > current circumstances, the reconciliation is we define a config
> > CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
> > 64BIT and BROKEN.
> >
> > Suggested-by: Ingo Molnar <[email protected]>
>
> So I'm very tempted to simply, unconditionally, reinstate this larger
> range for everything CONFIG_64BIT && CONFIG_FAIR_GROUP_SCHED.
>
> There was but the single claim on increased power usage, nobody could
> reproduce / analyze and Google has been running with this for years now.
>
> Furthermore, it seems to be leading to the obvious problems on bigger
> machines where we basically run out of precision by the sheer number of
> cpus (nr_cpus ~ SCHED_LOAD_SCALE and stuff comes apart quickly).

Great.