2011-04-20 20:52:20

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][PATCH 00/18] Increase resolution of load weights

Hi All,

I have attached an early version of a RFC patchset to increase resolution of
sched entity load weights. This RFC introduces SCHED_LOAD_RESOLUTION which
scales NICE_0_LOAD by a factor of 1024. The scaling is done internally and should
be completely invisible to the user.

Why do we need this?
This extra resolution allows us to scale on two dimensions - number of cpus and
the depth of hierarchies. It also allows for proper load balancing of low weight
task groups (for eg., nice+19 on autogroup).

One of the big roadblocks for increasing resolution is the use of unsigned long
for load.weight, which on 32-bit architectures can overflow with ~48 max-weight
sched entities. In this RFC we convert all uses of load.weight to u64. This is
still a work-in-progress and I have listed some of the issues I am still
investigating below.

I would like to get some feedback on the direction of this patchset. Please let
me know if there are alternative ways of doing this, and I'll be happy to
explore them as well.

The patchset applies cleanly to v2.6.39-rc4. It compiles for i386 and boots on
x86_64. Beyond the basic checks, it has not been well tested yet.

Major TODOs:
- Detect overflow in update shares calculations (time * load), and set load_avg
to maximum possible value (~0ULL).
- tg->task_weight uses an atomic which needs to be updates to 64-bit on 32-bit
machines. Might need to add a lock to protect this instead of atomic ops.
- Check wake-affine math and effective load calculations for overflows.
- Needs more testing and need to ensure fairness/balancing is not broken.

-Thanks,
Nikhil

Nikhil Rao (18):
sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations
sched: increase SCHED_LOAD_SCALE resolution
sched: use u64 for load_weight fields
sched: update cpu_load to be u64
sched: update this_cpu_load() to return u64 value
sched: update source_load(), target_load() and weighted_cpuload() to
use u64
sched: update find_idlest_cpu() to use u64 for load
sched: update find_idlest_group() to use u64
sched: update division in cpu_avg_load_per_task to use div_u64
sched: update wake_affine path to use u64, s64 for weights
sched: update update_sg_lb_stats() to use u64
sched: Update update_sd_lb_stats() to use u64
sched: update f_b_g() to use u64 for weights
sched: change type of imbalance to be u64
sched: update h_load to use u64
sched: update move_task() and helper functions to use u64 for weights
sched: update f_b_q() to use u64 for weighted cpuload
sched: update shares distribution to use u64

drivers/cpuidle/governors/menu.c | 5 +-
include/linux/sched.h | 22 +++--
kernel/sched.c | 61 ++++++-----
kernel/sched_debug.c | 10 +-
kernel/sched_fair.c | 218 ++++++++++++++++++++------------------
kernel/sched_stats.h | 2 +-
6 files changed, 167 insertions(+), 151 deletions(-)

--
1.7.3.1


2011-04-20 20:51:58

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 03/18] sched: use u64 for load_weight fields

This patch converts load_weight fields to use u64 instead of unsigned long.
This is effectively a no-op for 64-bit where unsigned long is 64-bit wide
anyway. On 32-bit architectures, it is required to ensure the rq load weight
does not overflow in the presence of multiple large weight entities. Also
increase MAX_SHARES to 2^28 (from 2^18).

Signed-off-by: Nikhil Rao <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched_debug.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2c3bab..6d88be1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1104,7 +1104,7 @@ struct sched_class {
};

struct load_weight {
- unsigned long weight, inv_weight;
+ u64 weight, inv_weight;
};

#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 7bacd83..d22b666 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -201,7 +201,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %d\n", "nr_spread_over",
cfs_rq->nr_spread_over);
SEQ_printf(m, " .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
- SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
+ SEQ_printf(m, " .%-30s: %lld\n", "load", cfs_rq->load.weight);
#ifdef CONFIG_FAIR_GROUP_SCHED
#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "load_avg",
@@ -264,7 +264,7 @@ static void print_cpu(struct seq_file *m, int cpu)
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(rq->x))

P(nr_running);
- SEQ_printf(m, " .%-30s: %lu\n", "load",
+ SEQ_printf(m, " .%-30s: %llu\n", "load",
rq->load.weight);
P(nr_switches);
P(nr_load_updates);
--
1.7.3.1

2011-04-20 20:52:02

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 07/18] sched: update find_idlest_cpu() to use u64 for load

find_idlest_cpu() compares load across runqueues which is now using u64.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched_fair.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8d8d11c..0b36548 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1578,7 +1578,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
static int
find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
{
- unsigned long load, min_load = ULONG_MAX;
+ u64 load, min_load = ULLONG_MAX;
int idlest = -1;
int i;

--
1.7.3.1

2011-04-20 20:52:18

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 18/18] sched: update shares distribution to use u64

Update the shares distribution code to use u64. We still maintain tg->shares as
an unsigned long since sched entity weights can't exceed MAX_SHARES (2^28). This
patch updates all the calculations required to estimate shares to use u64.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched.c | 2 +-
kernel/sched_debug.c | 6 +++---
kernel/sched_fair.c | 13 ++++++++-----
3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7c1f3fc..a9e85a0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -367,7 +367,7 @@ struct cfs_rq {
u64 load_period;
u64 load_stamp, load_last, load_unacc_exec_time;

- unsigned long load_contribution;
+ u64 load_contribution;
#endif
#endif
};
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index d22b666..b809651 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -204,11 +204,11 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %lld\n", "load", cfs_rq->load.weight);
#ifdef CONFIG_FAIR_GROUP_SCHED
#ifdef CONFIG_SMP
- SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "load_avg",
+ SEQ_printf(m, " .%-30s: %lld.%06ld\n", "load_avg",
SPLIT_NS(cfs_rq->load_avg));
- SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "load_period",
+ SEQ_printf(m, " .%-30s: %lld.%06ld\n", "load_period",
SPLIT_NS(cfs_rq->load_period));
- SEQ_printf(m, " .%-30s: %ld\n", "load_contrib",
+ SEQ_printf(m, " .%-30s: %lld\n", "load_contrib",
cfs_rq->load_contribution);
SEQ_printf(m, " .%-30s: %d\n", "load_tg",
atomic_read(&cfs_rq->tg->load_weight));
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 33c36f1..6808f26 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -708,12 +708,13 @@ static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
int global_update)
{
struct task_group *tg = cfs_rq->tg;
- long load_avg;
+ s64 load_avg;

load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
load_avg -= cfs_rq->load_contribution;

if (global_update || abs(load_avg) > cfs_rq->load_contribution / 8) {
+ /* TODO: fix atomics for 64-bit additions */
atomic_add(load_avg, &tg->load_weight);
cfs_rq->load_contribution += load_avg;
}
@@ -723,7 +724,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
{
u64 period = sysctl_sched_shares_window;
u64 now, delta;
- unsigned long load = cfs_rq->load.weight;
+ u64 load = cfs_rq->load.weight;

if (cfs_rq->tg == &root_task_group)
return;
@@ -745,6 +746,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
if (load) {
cfs_rq->load_last = now;
cfs_rq->load_avg += delta * load;
+ /* TODO: detect overflow and fix */
}

/* consider updating load contribution on each fold or truncate */
@@ -769,24 +771,25 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)

static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
{
- long load_weight, load, shares;
+ s64 load_weight, load, shares;

load = cfs_rq->load.weight;

+ /* TODO: fixup atomics to handle u64 in 32-bit */
load_weight = atomic_read(&tg->load_weight);
load_weight += load;
load_weight -= cfs_rq->load_contribution;

shares = (tg->shares * load);
if (load_weight)
- shares /= load_weight;
+ shares = div64_u64(shares, load_weight);

if (shares < MIN_SHARES)
shares = MIN_SHARES;
if (shares > tg->shares)
shares = tg->shares;

- return shares;
+ return (long)shares;
}

static void update_entity_shares_tick(struct cfs_rq *cfs_rq)
--
1.7.3.1

2011-04-20 20:52:34

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 11/18] sched: update update_sg_lb_stats() to use u64

Update variable types and 64-bit math in update_sg_lb_stats() to handle u64
weights.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched_fair.c | 22 +++++++++++++---------
1 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 2b030f4..d5b1276 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2345,14 +2345,14 @@ struct sd_lb_stats {
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
struct sg_lb_stats {
- unsigned long avg_load; /*Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
+ u64 avg_load; /* Avg load across the CPUs of the group */
+ u64 group_load; /* Total load over the CPUs of the group */
unsigned long sum_nr_running; /* Nr tasks running in the group */
- unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+ u64 sum_weighted_load; /* Weighted load of group's tasks */
unsigned long group_capacity;
unsigned long idle_cpus;
unsigned long group_weight;
- int group_imb; /* Is there an imbalance in the group ? */
+ int group_imb; /* Is there an imbalance in the group ? */
int group_has_capacity; /* Is there extra capacity in the group? */
};

@@ -2679,7 +2679,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
int local_group, const struct cpumask *cpus,
int *balance, struct sg_lb_stats *sgs)
{
- unsigned long load, max_cpu_load, min_cpu_load, max_nr_running;
+ u64 load, max_cpu_load, min_cpu_load;
+ unsigned long max_nr_running;
int i;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
unsigned long avg_load_per_task = 0;
@@ -2689,7 +2690,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,

/* Tally up the load of all CPUs in the group */
max_cpu_load = 0;
- min_cpu_load = ~0UL;
+ min_cpu_load = ~0ULL;
max_nr_running = 0;

for_each_cpu_and(i, sched_group_cpus(group), cpus) {
@@ -2735,7 +2736,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
}

/* Adjust by relative CPU power of the group */
- sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->cpu_power;
+ sgs->avg_load = div_u64(sgs->group_load * SCHED_POWER_SCALE,
+ group->cpu_power);

/*
* Consider the group unbalanced when the imbalance is larger
@@ -2747,9 +2749,11 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
* the hierarchy?
*/
if (sgs->sum_nr_running)
- avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
+ avg_load_per_task = div_u64(sgs->sum_weighted_load,
+ sgs->sum_nr_running);

- if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1)
+ if ((max_cpu_load - min_cpu_load) >= avg_load_per_task &&
+ max_nr_running > 1)
sgs->group_imb = 1;

sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power,
--
1.7.3.1

2011-04-20 20:52:38

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 02/18] sched: increase SCHED_LOAD_SCALE resolution

Introduce SCHED_LOAD_RESOLUTION, which scales is added to SCHED_LOAD_SHIFT and
increases the resolution of SCHED_LOAD_SCALE. This patch sets the value of
SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all sched entities by
a factor of 1024. With this extra resolution, we can handle deeper cgroup
hiearchies and the scheduler can do better shares distribution and load
load balancing on larger systems (especially for low weight task groups).

This does not change the existing user interface, the scaled weights are only
used internally. We do not modify prio_to_weight values or inverses, but use
the original weights when calculating the inverse which is used to scale
execution time delta in calc_delta_mine(). This ensures we do not lose accuracy
when accounting time to the sched entities.

Signed-off-by: Nikhil Rao <[email protected]>
---
include/linux/sched.h | 3 ++-
kernel/sched.c | 18 ++++++++++--------
2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d1ff2b..d2c3bab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -794,7 +794,8 @@ enum cpu_idle_type {
/*
* Increase resolution of nice-level calculations:
*/
-#define SCHED_LOAD_SHIFT 10
+#define SCHED_LOAD_RESOLUTION 10
+#define SCHED_LOAD_SHIFT (10 + SCHED_LOAD_RESOLUTION)
#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)

/*
diff --git a/kernel/sched.c b/kernel/sched.c
index 50f97cc..bfee8ff 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(task_group_lock);
* limitation from this.)
*/
#define MIN_SHARES 2
-#define MAX_SHARES (1UL << 18)
+#define MAX_SHARES (1UL << 28)

static int root_task_group_load = ROOT_TASK_GROUP_LOAD;
#endif
@@ -1308,11 +1308,11 @@ calc_delta_mine(unsigned long delta_exec, unsigned long weight,
u64 tmp;

if (!lw->inv_weight) {
- if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))
+ unsigned long w = lw->weight >> SCHED_LOAD_RESOLUTION;
+ if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
lw->inv_weight = 1;
else
- lw->inv_weight = 1 + (WMULT_CONST-lw->weight/2)
- / (lw->weight+1);
+ lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
}

tmp = (u64)delta_exec * weight;
@@ -1759,12 +1759,13 @@ static void set_load_weight(struct task_struct *p)
* SCHED_IDLE tasks get minimal weight:
*/
if (p->policy == SCHED_IDLE) {
- p->se.load.weight = WEIGHT_IDLEPRIO;
+ p->se.load.weight = WEIGHT_IDLEPRIO << SCHED_LOAD_RESOLUTION;
p->se.load.inv_weight = WMULT_IDLEPRIO;
return;
}

- p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
+ p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO]
+ << SCHED_LOAD_RESOLUTION;
p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
}

@@ -9130,14 +9131,15 @@ cpu_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
u64 shareval)
{
- return sched_group_set_shares(cgroup_tg(cgrp), shareval);
+ return sched_group_set_shares(cgroup_tg(cgrp),
+ shareval << SCHED_LOAD_RESOLUTION);
}

static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
{
struct task_group *tg = cgroup_tg(cgrp);

- return (u64) tg->shares;
+ return (u64) tg->shares >> SCHED_LOAD_RESOLUTION;
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

--
1.7.3.1

2011-04-20 20:52:36

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 16/18] sched: update move_task() and helper functions to use u64 for weights

This patch updates move_task() and helper functions to use u64 to handle load
weight related calculations.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched_fair.c | 41 +++++++++++++++++++----------------------
1 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ab2d1c9..386d832 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2099,14 +2099,14 @@ move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
return 0;
}

-static unsigned long
+static u64
balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
- unsigned long max_load_move, struct sched_domain *sd,
+ u64 max_load_move, struct sched_domain *sd,
enum cpu_idle_type idle, int *all_pinned,
int *this_best_prio, struct cfs_rq *busiest_cfs_rq)
{
int loops = 0, pulled = 0;
- long rem_load_move = max_load_move;
+ s64 rem_load_move = max_load_move;
struct task_struct *p, *n;

if (max_load_move == 0)
@@ -2199,13 +2199,12 @@ static void update_shares(int cpu)
rcu_read_unlock();
}

-static unsigned long
+static u64
load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
- unsigned long max_load_move,
- struct sched_domain *sd, enum cpu_idle_type idle,
- int *all_pinned, int *this_best_prio)
+ u64 max_load_move, struct sched_domain *sd,
+ enum cpu_idle_type idle, int *all_pinned, int *this_best_prio)
{
- long rem_load_move = max_load_move;
+ s64 rem_load_move = max_load_move;
int busiest_cpu = cpu_of(busiest);
struct task_group *tg;

@@ -2214,8 +2213,8 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,

list_for_each_entry_rcu(tg, &task_groups, list) {
struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu];
- unsigned long busiest_h_load = busiest_cfs_rq->h_load;
- unsigned long busiest_weight = busiest_cfs_rq->load.weight;
+ u64 busiest_h_load = busiest_cfs_rq->h_load;
+ u64 busiest_weight = busiest_cfs_rq->load.weight;
u64 rem_load, moved_load;

/*
@@ -2224,8 +2223,8 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
if (!busiest_cfs_rq->task_weight)
continue;

- rem_load = (u64)rem_load_move * busiest_weight;
- rem_load = div_u64(rem_load, busiest_h_load + 1);
+ rem_load = div64_u64(busiest_weight * rem_load_move,
+ busiest_h_load + 1);

moved_load = balance_tasks(this_rq, this_cpu, busiest,
rem_load, sd, idle, all_pinned, this_best_prio,
@@ -2234,8 +2233,8 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
if (!moved_load)
continue;

- moved_load *= busiest_h_load;
- moved_load = div_u64(moved_load, busiest_weight + 1);
+ moved_load = div64_u64(moved_load * busiest_h_load,
+ busiest_weight + 1);

rem_load_move -= moved_load;
if (rem_load_move < 0)
@@ -2250,11 +2249,10 @@ static inline void update_shares(int cpu)
{
}

-static unsigned long
+static u64
load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
- unsigned long max_load_move,
- struct sched_domain *sd, enum cpu_idle_type idle,
- int *all_pinned, int *this_best_prio)
+ u64 max_load_move, struct sched_domain *sd,
+ enum cpu_idle_type idle, int *all_pinned, int *this_best_prio)
{
return balance_tasks(this_rq, this_cpu, busiest,
max_load_move, sd, idle, all_pinned,
@@ -2270,11 +2268,10 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
* Called with both runqueues locked.
*/
static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
- unsigned long max_load_move,
- struct sched_domain *sd, enum cpu_idle_type idle,
- int *all_pinned)
+ u64 max_load_move, struct sched_domain *sd,
+ enum cpu_idle_type idle, int *all_pinned)
{
- unsigned long total_load_moved = 0, load_moved;
+ u64 total_load_moved = 0, load_moved;
int this_best_prio = this_rq->curr->prio;

do {
--
1.7.3.1

2011-04-20 20:52:31

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 05/18] sched: update this_cpu_load() to return u64 value

The idlecpu governor uses this_cpu_load() for calculations which is now returns
u64. Update idlecpu governor to also use u64.

Signed-off-by: Nikhil Rao <[email protected]>
---
drivers/cpuidle/governors/menu.c | 5 ++---
include/linux/sched.h | 2 +-
kernel/sched.c | 2 +-
3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index f508690..2051134 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -127,10 +127,9 @@ struct menu_device {

static int get_loadavg(void)
{
- unsigned long this = this_cpu_load();
+ u64 this = this_cpu_load();

-
- return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10;
+ return div_u64(LOAD_INT(this) * 10 + LOAD_FRAC(this), 10);
}

static inline int which_bucket(unsigned int duration)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d88be1..546a418 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -140,7 +140,7 @@ extern unsigned long nr_running(void);
extern unsigned long nr_uninterruptible(void);
extern unsigned long nr_iowait(void);
extern unsigned long nr_iowait_cpu(int cpu);
-extern unsigned long this_cpu_load(void);
+extern u64 this_cpu_load(void);


extern void calc_global_load(unsigned long ticks);
diff --git a/kernel/sched.c b/kernel/sched.c
index 94dd1df..175764b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3062,7 +3062,7 @@ unsigned long nr_iowait_cpu(int cpu)
return atomic_read(&this->nr_iowait);
}

-unsigned long this_cpu_load(void)
+u64 this_cpu_load(void)
{
struct rq *this = this_rq();
return this->cpu_load[0];
--
1.7.3.1

2011-04-20 20:52:28

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 04/18] sched: update cpu_load to be u64

cpu_load derives from rq->load.weight and needs to be updated to u64 as it can
now overflow on 32-bit machines. This patch updates cpu_load in rq struct, and
all functions that use this field.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched.c | 9 ++++-----
1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index bfee8ff..94dd1df 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -454,7 +454,7 @@ struct rq {
*/
unsigned long nr_running;
#define CPU_LOAD_IDX_MAX 5
- unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+ u64 cpu_load[CPU_LOAD_IDX_MAX];
unsigned long last_load_update_tick;
#ifdef CONFIG_NO_HZ
u64 nohz_stamp;
@@ -3365,8 +3365,7 @@ static const unsigned char
* would be when CPU is idle and so we just decay the old load without
* adding any new load.
*/
-static unsigned long
-decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
+static u64 decay_load_missed(u64 load, unsigned long missed_updates, int idx)
{
int j = 0;

@@ -3396,7 +3395,7 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
*/
static void update_cpu_load(struct rq *this_rq)
{
- unsigned long this_load = this_rq->load.weight;
+ u64 this_load = this_rq->load.weight;
unsigned long curr_jiffies = jiffies;
unsigned long pending_updates;
int i, scale;
@@ -3413,7 +3412,7 @@ static void update_cpu_load(struct rq *this_rq)
/* Update our load: */
this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */
for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
- unsigned long old_load, new_load;
+ u64 old_load, new_load;

/* scale is effectively 1 << i now, and >> i divides by scale */

--
1.7.3.1

2011-04-20 20:52:33

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 15/18] sched: update h_load to use u64

Calculate tg->h_load using u64 to handle u64 load weights.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched.c | 12 +++++++-----
1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8047f10..7c1f3fc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -354,7 +354,7 @@ struct cfs_rq {
* Where f(tg) is the recursive weight fraction assigned to
* this group.
*/
- unsigned long h_load;
+ u64 h_load;

/*
* Maintaining per-cpu shares distribution for group scheduling
@@ -1541,15 +1541,17 @@ static unsigned long cpu_avg_load_per_task(int cpu)
*/
static int tg_load_down(struct task_group *tg, void *data)
{
- unsigned long load;
+ u64 load;
long cpu = (long)data;

if (!tg->parent) {
load = cpu_rq(cpu)->load.weight;
} else {
- load = tg->parent->cfs_rq[cpu]->h_load;
- load *= tg->se[cpu]->load.weight;
- load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+ u64 parent_h_load = tg->parent->cfs_rq[cpu]->h_load;
+ u64 parent_weight = tg->parent->cfs_rq[cpu]->load.weight;
+
+ load = div64_u64(parent_h_load * tg->se[cpu]->load.weight,
+ parent_weight + 1);
}

tg->cfs_rq[cpu]->h_load = load;
--
1.7.3.1

2011-04-20 20:52:26

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 06/18] sched: update source_load(), target_load() and weighted_cpuload() to use u64

source_load(), target_load() and weighted_cpuload() refer to values in
rq->cpu_load, which is now u64. Update these functions to return u64 as well.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched.c | 10 +++++-----
1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 175764b..f0adb0e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1474,7 +1474,7 @@ static int tg_nop(struct task_group *tg, void *data)

#ifdef CONFIG_SMP
/* Used instead of source_load when we know the type == 0 */
-static unsigned long weighted_cpuload(const int cpu)
+static u64 weighted_cpuload(const int cpu)
{
return cpu_rq(cpu)->load.weight;
}
@@ -1486,10 +1486,10 @@ static unsigned long weighted_cpuload(const int cpu)
* We want to under-estimate the load of migration sources, to
* balance conservatively.
*/
-static unsigned long source_load(int cpu, int type)
+static u64 source_load(int cpu, int type)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long total = weighted_cpuload(cpu);
+ u64 total = weighted_cpuload(cpu);

if (type == 0 || !sched_feat(LB_BIAS))
return total;
@@ -1501,10 +1501,10 @@ static unsigned long source_load(int cpu, int type)
* Return a high guess at the load of a migration-target cpu weighted
* according to the scheduling class and "nice" value.
*/
-static unsigned long target_load(int cpu, int type)
+static u64 target_load(int cpu, int type)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long total = weighted_cpuload(cpu);
+ u64 total = weighted_cpuload(cpu);

if (type == 0 || !sched_feat(LB_BIAS))
return total;
--
1.7.3.1

2011-04-20 20:52:25

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 13/18] sched: update f_b_g() to use u64 for weights

This patch updates f_b_g() and helper functions to use u64 to handle the
increased sched load resolution.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched_fair.c | 51 +++++++++++++++++++++++++++------------------------
1 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 12b25b7..8478aac 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2946,12 +2946,13 @@ static int check_asym_packing(struct sched_domain *sd,
static inline void fix_small_imbalance(struct sd_lb_stats *sds,
int this_cpu, unsigned long *imbalance)
{
- unsigned long tmp, pwr_now = 0, pwr_move = 0;
+ u64 tmp, pwr_now = 0, pwr_move = 0;
unsigned int imbn = 2;
unsigned long scaled_busy_load_per_task;

if (sds->this_nr_running) {
- sds->this_load_per_task /= sds->this_nr_running;
+ sds->this_load_per_task = div_u64(sds->this_load_per_task,
+ sds->this_nr_running);
if (sds->busiest_load_per_task >
sds->this_load_per_task)
imbn = 1;
@@ -2959,9 +2960,9 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
sds->this_load_per_task =
cpu_avg_load_per_task(this_cpu);

- scaled_busy_load_per_task = sds->busiest_load_per_task
- * SCHED_POWER_SCALE;
- scaled_busy_load_per_task /= sds->busiest->cpu_power;
+ scaled_busy_load_per_task =
+ div_u64(sds->busiest_load_per_task * SCHED_POWER_SCALE,
+ sds->busiest->cpu_power);

if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
(scaled_busy_load_per_task * imbn)) {
@@ -2979,11 +2980,11 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
min(sds->busiest_load_per_task, sds->max_load);
pwr_now += sds->this->cpu_power *
min(sds->this_load_per_task, sds->this_load);
- pwr_now /= SCHED_POWER_SCALE;
+ pwr_now = div_u64(pwr_now, SCHED_POWER_SCALE);

/* Amount of load we'd subtract */
- tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
- sds->busiest->cpu_power;
+ tmp = div_u64(sds->busiest_load_per_task * SCHED_POWER_SCALE,
+ sds->busiest->cpu_power);
if (sds->max_load > tmp)
pwr_move += sds->busiest->cpu_power *
min(sds->busiest_load_per_task, sds->max_load - tmp);
@@ -2991,14 +2992,15 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
/* Amount of load we'd add */
if (sds->max_load * sds->busiest->cpu_power <
sds->busiest_load_per_task * SCHED_POWER_SCALE)
- tmp = (sds->max_load * sds->busiest->cpu_power) /
- sds->this->cpu_power;
+ tmp = div_u64(sds->max_load * sds->busiest->cpu_power,
+ sds->this->cpu_power);
else
- tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
- sds->this->cpu_power;
+ tmp = div_u64(sds->busiest_load_per_task * SCHED_POWER_SCALE,
+ sds->this->cpu_power);
+
pwr_move += sds->this->cpu_power *
min(sds->this_load_per_task, sds->this_load + tmp);
- pwr_move /= SCHED_POWER_SCALE;
+ pwr_move = div_u64(pwr_move, SCHED_POWER_SCALE);

/* Move if we gain throughput */
if (pwr_move > pwr_now)
@@ -3015,9 +3017,10 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
unsigned long *imbalance)
{
- unsigned long max_pull, load_above_capacity = ~0UL;
+ u64 max_pull, load_above_capacity = ~0ULL;

- sds->busiest_load_per_task /= sds->busiest_nr_running;
+ sds->busiest_load_per_task = div_u64(sds->busiest_load_per_task,
+ sds->busiest_nr_running);
if (sds->group_imb) {
sds->busiest_load_per_task =
min(sds->busiest_load_per_task, sds->avg_load);
@@ -3034,15 +3037,15 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
}

if (!sds->group_imb) {
+ unsigned long imb_capacity = (sds->busiest_nr_running -
+ sds->busiest_group_capacity);
/*
* Don't want to pull so many tasks that a group would go idle.
*/
- load_above_capacity = (sds->busiest_nr_running -
- sds->busiest_group_capacity);
-
- load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
-
- load_above_capacity /= sds->busiest->cpu_power;
+ load_above_capacity = NICE_0_LOAD * imb_capacity;
+ load_above_capacity =
+ div_u64(load_above_capacity * SCHED_POWER_SCALE,
+ sds->busiest->cpu_power);
}

/*
@@ -3059,8 +3062,8 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,

/* How much load to actually move to equalise the imbalance */
*imbalance = min(max_pull * sds->busiest->cpu_power,
- (sds->avg_load - sds->this_load) * sds->this->cpu_power)
- / SCHED_POWER_SCALE;
+ (sds->avg_load - sds->this_load)*sds->this->cpu_power);
+ *imbalance = div_u64(*imbalance, SCHED_POWER_SCALE);

/*
* if *imbalance is less than the average load per runnable task
@@ -3129,7 +3132,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (!sds.busiest || sds.busiest_nr_running == 0)
goto out_balanced;

- sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
+ sds.avg_load = div_u64(sds.total_load*SCHED_POWER_SCALE, sds.total_pwr);

/*
* If the busiest group is imbalanced the below checks don't
--
1.7.3.1

2011-04-20 20:52:23

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 14/18] sched: change type of imbalance to be u64

This patch changes the type of imbalance to be u64. With increased sched load
resolution, it is possible for a runqueue to have a sched weight of 2^32, and
imbalance needs to be updated to u64 to handle this case.

Signed-off-by: Nikhil Rao <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched_fair.c | 24 ++++++++++++------------
kernel/sched_stats.h | 2 +-
3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 546a418..2d9689a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -945,10 +945,10 @@ struct sched_domain {

#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
+ u64 lb_imbalance[CPU_MAX_IDLE_TYPES];
unsigned int lb_count[CPU_MAX_IDLE_TYPES];
unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
- unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8478aac..ab2d1c9 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2497,7 +2497,7 @@ static inline void update_sd_power_savings_stats(struct sched_group *group,
* Else returns 0.
*/
static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
- int this_cpu, unsigned long *imbalance)
+ int this_cpu, u64 *imbalance)
{
if (!sds->power_savings_balance)
return 0;
@@ -2526,7 +2526,7 @@ static inline void update_sd_power_savings_stats(struct sched_group *group,
}

static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
- int this_cpu, unsigned long *imbalance)
+ int this_cpu, u64 *imbalance)
{
return 0;
}
@@ -2916,7 +2916,7 @@ int __weak arch_sd_sibling_asym_packing(void)
*/
static int check_asym_packing(struct sched_domain *sd,
struct sd_lb_stats *sds,
- int this_cpu, unsigned long *imbalance)
+ int this_cpu, u64 *imbalance)
{
int busiest_cpu;

@@ -2943,8 +2943,8 @@ static int check_asym_packing(struct sched_domain *sd,
* @this_cpu: The cpu at whose sched_domain we're performing load-balance.
* @imbalance: Variable to store the imbalance.
*/
-static inline void fix_small_imbalance(struct sd_lb_stats *sds,
- int this_cpu, unsigned long *imbalance)
+static inline
+void fix_small_imbalance(struct sd_lb_stats *sds, int this_cpu, u64 *imbalance)
{
u64 tmp, pwr_now = 0, pwr_move = 0;
unsigned int imbn = 2;
@@ -3014,8 +3014,8 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
* @this_cpu: Cpu for which currently load balance is being performed.
* @imbalance: The variable to store the imbalance.
*/
-static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
- unsigned long *imbalance)
+static inline
+void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu, u64 *imbalance)
{
u64 max_pull, load_above_capacity = ~0ULL;

@@ -3103,9 +3103,9 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
* put to idle by rebalancing its tasks onto our group.
*/
static struct sched_group *
-find_busiest_group(struct sched_domain *sd, int this_cpu,
- unsigned long *imbalance, enum cpu_idle_type idle,
- const struct cpumask *cpus, int *balance)
+find_busiest_group(struct sched_domain *sd, int this_cpu, u64 *imbalance,
+ enum cpu_idle_type idle, const struct cpumask *cpus,
+ int *balance)
{
struct sd_lb_stats sds;

@@ -3202,7 +3202,7 @@ ret:
*/
static struct rq *
find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
- enum cpu_idle_type idle, unsigned long imbalance,
+ enum cpu_idle_type idle, u64 imbalance,
const struct cpumask *cpus)
{
struct rq *busiest = NULL, *rq;
@@ -3308,7 +3308,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
{
int ld_moved, all_pinned = 0, active_balance = 0;
struct sched_group *group;
- unsigned long imbalance;
+ u64 imbalance;
struct rq *busiest;
unsigned long flags;
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
index 48ddf43..f44676c 100644
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -46,7 +46,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
seq_printf(seq, "domain%d %s", dcount++, mask_str);
for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
itype++) {
- seq_printf(seq, " %u %u %u %u %u %u %u %u",
+ seq_printf(seq, " %u %u %u %llu %u %u %u %u",
sd->lb_count[itype],
sd->lb_balanced[itype],
sd->lb_failed[itype],
--
1.7.3.1

2011-04-20 20:52:17

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 10/18] sched: update wake_affine path to use u64, s64 for weights

Update the s64/u64 math in wake_affine() and effective_load() to handle
increased resolution.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched_fair.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 029dd4f..2b030f4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1388,7 +1388,7 @@ static void task_waking_fair(struct rq *rq, struct task_struct *p)
* of group shares between cpus. Assuming the shares were perfectly aligned one
* can calculate the shift in shares.
*/
-static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
+static s64 effective_load(struct task_group *tg, int cpu, s64 wl, s64 wg)
{
struct sched_entity *se = tg->se[cpu];

@@ -1396,7 +1396,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
return wl;

for_each_sched_entity(se) {
- long lw, w;
+ s64 lw, w;

tg = se->my_q->tg;
w = se->my_q->load.weight;
@@ -1409,7 +1409,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
wl += w;

if (lw > 0 && wl < lw)
- wl = (wl * tg->shares) / lw;
+ wl = div64_s64(wl * tg->shares, lw);
else
wl = tg->shares;

@@ -1504,7 +1504,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)

if (balanced ||
(this_load <= load &&
- this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
+ this_load + target_load(prev_cpu, idx) <= (u64)tl_per_task)) {
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and
--
1.7.3.1

2011-04-20 20:52:15

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 12/18] sched: Update update_sd_lb_stats() to use u64

Update update_sd_lb_stats() and helper functions to use u64/s64 for weight
calculations.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched_fair.c | 22 +++++++++++-----------
1 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index d5b1276..12b25b7 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2308,23 +2308,23 @@ static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
* during load balancing.
*/
struct sd_lb_stats {
- struct sched_group *busiest; /* Busiest group in this sd */
- struct sched_group *this; /* Local group in this sd */
- unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_pwr; /* Total power of all groups in sd */
- unsigned long avg_load; /* Average load across all groups in sd */
+ struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *this; /* Local group in this sd */
+ u64 total_load; /* Total load of all groups in sd */
+ unsigned long total_pwr; /* Total power of all groups in sd */
+ u64 avg_load; /* Avg load across all groups in sd */

/** Statistics of this group */
- unsigned long this_load;
- unsigned long this_load_per_task;
+ u64 this_load;
+ u64 this_load_per_task;
unsigned long this_nr_running;
unsigned long this_has_capacity;
unsigned int this_idle_cpus;

/* Statistics of the busiest group */
unsigned int busiest_idle_cpus;
- unsigned long max_load;
- unsigned long busiest_load_per_task;
+ u64 max_load;
+ u64 busiest_load_per_task;
unsigned long busiest_nr_running;
unsigned long busiest_group_capacity;
unsigned long busiest_has_capacity;
@@ -2461,8 +2461,8 @@ static inline void update_sd_power_savings_stats(struct sched_group *group,
group_first_cpu(group) > group_first_cpu(sds->group_min))) {
sds->group_min = group;
sds->min_nr_running = sgs->sum_nr_running;
- sds->min_load_per_task = sgs->sum_weighted_load /
- sgs->sum_nr_running;
+ sds->min_load_per_task = div_u64(sgs->sum_weighted_load,
+ sgs->sum_nr_running);
}

/*
--
1.7.3.1

2011-04-20 20:55:46

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 01/18] sched: introduce SCHED_POWER_SCALE to scale cpu_power calculations

SCHED_LOAD_SCALE is used to increase nice resolution and to scale cpu_power
calculations in the scheduler. This patch introduces SCHED_POWER_SCALE and
converts all uses of SCHED_LOAD_SCALE for scaling cpu_power to use
SCHED_POWER_SCALE instead.

This is a preparatory patch for increasing the resolution of SCHED_LOAD_SCALE,
and there is no need to increase resolution for cpu_power calculations.

Signed-off-by: Nikhil Rao <[email protected]>
---
include/linux/sched.h | 13 ++++++++-----
kernel/sched.c | 6 +++---
kernel/sched_fair.c | 44 +++++++++++++++++++++++---------------------
3 files changed, 34 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18d63ce..8d1ff2b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -792,17 +792,20 @@ enum cpu_idle_type {
};

/*
- * sched-domains (multiprocessor balancing) declarations:
- */
-
-/*
* Increase resolution of nice-level calculations:
*/
#define SCHED_LOAD_SHIFT 10
#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)

-#define SCHED_LOAD_SCALE_FUZZ SCHED_LOAD_SCALE
+/*
+ * Increase resolution of cpu_power calculations
+ */
+#define SCHED_POWER_SHIFT 10
+#define SCHED_POWER_SCALE (1L << SCHED_POWER_SHIFT)

+/*
+ * sched-domains (multiprocessor balancing) declarations:
+ */
#ifdef CONFIG_SMP
#define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */
#define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */
diff --git a/kernel/sched.c b/kernel/sched.c
index 312f8b9..50f97cc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6468,7 +6468,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));

printk(KERN_CONT " %s", str);
- if (group->cpu_power != SCHED_LOAD_SCALE) {
+ if (group->cpu_power != SCHED_POWER_SCALE) {
printk(KERN_CONT " (cpu_power = %d)",
group->cpu_power);
}
@@ -7176,7 +7176,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
sd->groups->cpu_power = 0;

if (!child) {
- power = SCHED_LOAD_SCALE;
+ power = SCHED_POWER_SCALE;
weight = cpumask_weight(sched_domain_span(sd));
/*
* SMT siblings share the power of a single core.
@@ -8224,7 +8224,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
rq->sd = NULL;
rq->rd = NULL;
- rq->cpu_power = SCHED_LOAD_SCALE;
+ rq->cpu_power = SCHED_POWER_SCALE;
rq->post_schedule = 0;
rq->active_balance = 0;
rq->next_balance = jiffies;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 6fa833a..8d8d11c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1557,7 +1557,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
}

/* Adjust by relative CPU power of the group */
- avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
+ avg_load = (avg_load * SCHED_POWER_SCALE) / group->cpu_power;

if (local_group) {
this_load = avg_load;
@@ -1692,7 +1692,7 @@ select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_
nr_running += cpu_rq(i)->cfs.nr_running;
}

- capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+ capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);

if (tmp->flags & SD_POWERSAVINGS_BALANCE)
nr_running /= 2;
@@ -2534,7 +2534,7 @@ static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,

unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
{
- return SCHED_LOAD_SCALE;
+ return SCHED_POWER_SCALE;
}

unsigned long __weak arch_scale_freq_power(struct sched_domain *sd, int cpu)
@@ -2571,8 +2571,8 @@ unsigned long scale_rt_power(int cpu)
available = total - rq->rt_avg;
}

- if (unlikely((s64)total < SCHED_LOAD_SCALE))
- total = SCHED_LOAD_SCALE;
+ if (unlikely((s64)total < SCHED_POWER_SCALE))
+ total = SCHED_POWER_SCALE;

total >>= SCHED_LOAD_SHIFT;

@@ -2582,7 +2582,7 @@ unsigned long scale_rt_power(int cpu)
static void update_cpu_power(struct sched_domain *sd, int cpu)
{
unsigned long weight = sd->span_weight;
- unsigned long power = SCHED_LOAD_SCALE;
+ unsigned long power = SCHED_POWER_SCALE;
struct sched_group *sdg = sd->groups;

if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
@@ -2646,7 +2646,7 @@ static inline int
fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
{
/*
- * Only siblings can have significantly less than SCHED_LOAD_SCALE
+ * Only siblings can have significantly less than SCHED_POWER_SCALE
*/
if (sd->level != SD_LV_SIBLING)
return 0;
@@ -2734,7 +2734,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
}

/* Adjust by relative CPU power of the group */
- sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
+ sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->cpu_power;

/*
* Consider the group unbalanced when the imbalance is larger
@@ -2751,7 +2751,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1)
sgs->group_imb = 1;

- sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+ sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power,
+ SCHED_POWER_SCALE);
if (!sgs->group_capacity)
sgs->group_capacity = fix_small_capacity(sd, group);
sgs->group_weight = group->group_weight;
@@ -2925,7 +2926,7 @@ static int check_asym_packing(struct sched_domain *sd,
return 0;

*imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power,
- SCHED_LOAD_SCALE);
+ SCHED_POWER_SCALE);
return 1;
}

@@ -2954,7 +2955,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
cpu_avg_load_per_task(this_cpu);

scaled_busy_load_per_task = sds->busiest_load_per_task
- * SCHED_LOAD_SCALE;
+ * SCHED_POWER_SCALE;
scaled_busy_load_per_task /= sds->busiest->cpu_power;

if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
@@ -2973,10 +2974,10 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
min(sds->busiest_load_per_task, sds->max_load);
pwr_now += sds->this->cpu_power *
min(sds->this_load_per_task, sds->this_load);
- pwr_now /= SCHED_LOAD_SCALE;
+ pwr_now /= SCHED_POWER_SCALE;

/* Amount of load we'd subtract */
- tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+ tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
sds->busiest->cpu_power;
if (sds->max_load > tmp)
pwr_move += sds->busiest->cpu_power *
@@ -2984,15 +2985,15 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,

/* Amount of load we'd add */
if (sds->max_load * sds->busiest->cpu_power <
- sds->busiest_load_per_task * SCHED_LOAD_SCALE)
+ sds->busiest_load_per_task * SCHED_POWER_SCALE)
tmp = (sds->max_load * sds->busiest->cpu_power) /
sds->this->cpu_power;
else
- tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+ tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
sds->this->cpu_power;
pwr_move += sds->this->cpu_power *
min(sds->this_load_per_task, sds->this_load + tmp);
- pwr_move /= SCHED_LOAD_SCALE;
+ pwr_move /= SCHED_POWER_SCALE;

/* Move if we gain throughput */
if (pwr_move > pwr_now)
@@ -3034,7 +3035,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
load_above_capacity = (sds->busiest_nr_running -
sds->busiest_group_capacity);

- load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
+ load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);

load_above_capacity /= sds->busiest->cpu_power;
}
@@ -3054,7 +3055,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
/* How much load to actually move to equalise the imbalance */
*imbalance = min(max_pull * sds->busiest->cpu_power,
(sds->avg_load - sds->this_load) * sds->this->cpu_power)
- / SCHED_LOAD_SCALE;
+ / SCHED_POWER_SCALE;

/*
* if *imbalance is less than the average load per runnable task
@@ -3123,7 +3124,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (!sds.busiest || sds.busiest_nr_running == 0)
goto out_balanced;

- sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;
+ sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;

/*
* If the busiest group is imbalanced the below checks don't
@@ -3202,7 +3203,8 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,

for_each_cpu(i, sched_group_cpus(group)) {
unsigned long power = power_of(i);
- unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+ unsigned long capacity = DIV_ROUND_CLOSEST(power,
+ SCHED_POWER_SCALE);
unsigned long wl;

if (!capacity)
@@ -3227,7 +3229,7 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
* the load can be moved away from the cpu that is potentially
* running at a lower capacity.
*/
- wl = (wl * SCHED_LOAD_SCALE) / power;
+ wl = (wl * SCHED_POWER_SCALE) / power;

if (wl > max_load) {
max_load = wl;
--
1.7.3.1

2011-04-20 20:55:45

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 17/18] sched: update f_b_q() to use u64 for weighted cpuload

Update f_b_q() to use u64 when comparing loads.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched_fair.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 386d832..33c36f1 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3203,14 +3203,14 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
const struct cpumask *cpus)
{
struct rq *busiest = NULL, *rq;
- unsigned long max_load = 0;
+ u64 max_load = 0;
int i;

for_each_cpu(i, sched_group_cpus(group)) {
unsigned long power = power_of(i);
unsigned long capacity = DIV_ROUND_CLOSEST(power,
SCHED_POWER_SCALE);
- unsigned long wl;
+ u64 wl;

if (!capacity)
capacity = fix_small_capacity(sd, group);
@@ -3234,7 +3234,7 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
* the load can be moved away from the cpu that is potentially
* running at a lower capacity.
*/
- wl = (wl * SCHED_POWER_SCALE) / power;
+ wl = div_u64(wl * SCHED_POWER_SCALE, power);

if (wl > max_load) {
max_load = wl;
--
1.7.3.1

2011-04-20 20:56:11

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 08/18] sched: update find_idlest_group() to use u64

Update find_idlest_group() to use u64 to accumulate and maintain cpu_load
weights.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched_fair.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0b36548..029dd4f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1527,11 +1527,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int load_idx)
{
struct sched_group *idlest = NULL, *group = sd->groups;
- unsigned long min_load = ULONG_MAX, this_load = 0;
+ u64 min_load = ULLONG_MAX, this_load = 0;
int imbalance = 100 + (sd->imbalance_pct-100)/2;

do {
- unsigned long load, avg_load;
+ u64 load, avg_load;
int local_group;
int i;

@@ -1557,7 +1557,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
}

/* Adjust by relative CPU power of the group */
- avg_load = (avg_load * SCHED_POWER_SCALE) / group->cpu_power;
+ avg_load *= SCHED_POWER_SCALE;
+ avg_load = div_u64(avg_load, group->cpu_power);

if (local_group) {
this_load = avg_load;
--
1.7.3.1

2011-04-20 20:56:35

by Nikhil Rao

[permalink] [raw]
Subject: [RFC][Patch 09/18] sched: update division in cpu_avg_load_per_task to use div_u64

This patch updates the division in cpu_avg_load_per_task() to use div_u64, so
that it works on 32-bit. We do not convert avg_load_per_task to u64 since this
can be atmost 2^28, and fits into unsigned long on 32-bit.

Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index f0adb0e..8047f10 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1525,7 +1525,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);

if (nr_running)
- rq->avg_load_per_task = rq->load.weight / nr_running;
+ rq->avg_load_per_task = div_u64(rq->load.weight, nr_running);
else
rq->avg_load_per_task = 0;

--
1.7.3.1

2011-04-21 06:16:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights


* Nikhil Rao <[email protected]> wrote:

> Major TODOs:
> - Detect overflow in update shares calculations (time * load), and set load_avg
> to maximum possible value (~0ULL).
> - tg->task_weight uses an atomic which needs to be updates to 64-bit on 32-bit
> machines. Might need to add a lock to protect this instead of atomic ops.
> - Check wake-affine math and effective load calculations for overflows.
> - Needs more testing and need to ensure fairness/balancing is not broken.

Please measure micro-costs accurately as well, via perf stat --repeat 10 or so.

For example, on a testsystem doing 200k pipe triggered context switches (100k
pipe ping-pongs) costs this much:

$ taskset 1 perf stat --repeat 10 ./pipe-test-100k

630.908390 task-clock-msecs # 0.434 CPUs ( +- 0.499% )
200,001 context-switches # 0.317 M/sec ( +- 0.000% )
0 CPU-migrations # 0.000 M/sec ( +- 66.667% )
145 page-faults # 0.000 M/sec ( +- 0.253% )
1,374,978,900 cycles # 2179.364 M/sec ( +- 0.516% )
1,373,646,429 instructions # 0.999 IPC ( +- 0.134% )
264,223,224 branches # 418.798 M/sec ( +- 0.134% )
16,613,988 branch-misses # 6.288 % ( +- 0.755% )
204,162 cache-references # 0.324 M/sec ( +- 18.805% )
5,152 cache-misses # 0.008 M/sec ( +- 21.280% )

We want to know the delta in the 'instructions' value resulting from the patch
(this can be measured very accurately) and we also want to see the 'cycles'
effect - both can be measured pretty accurately.

I've attached the testcase - you might need to increase the --repeat value so
that noise drops below the level of the effect from these patches. (the effect
is likely in the 0.01% range)

It would also be nice to see how 'size vmlinux' changes with these patches
applied, on a 'make defconfig' build.

Thanks,

Ingo


Attachments:
(No filename) (2.02 kB)
pipe-test-100k.c (537.00 B)
Download all attachments

2011-04-21 16:29:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights

On Thu, 2011-04-21 at 08:16 +0200, Ingo Molnar wrote:
> * Nikhil Rao <[email protected]> wrote:
>
> > Major TODOs:
> > - Detect overflow in update shares calculations (time * load), and set load_avg
> > to maximum possible value (~0ULL).
> > - tg->task_weight uses an atomic which needs to be updates to 64-bit on 32-bit
> > machines. Might need to add a lock to protect this instead of atomic ops.
> > - Check wake-affine math and effective load calculations for overflows.
> > - Needs more testing and need to ensure fairness/balancing is not broken.
>
> Please measure micro-costs accurately as well, via perf stat --repeat 10 or so.
>
> For example, on a testsystem doing 200k pipe triggered context switches (100k
> pipe ping-pongs) costs this much:
>
> $ taskset 1 perf stat --repeat 10 ./pipe-test-100k
>

> We want to know the delta in the 'instructions' value resulting from the patch
> (this can be measured very accurately) and we also want to see the 'cycles'
> effect - both can be measured pretty accurately.
>
> I've attached the testcase - you might need to increase the --repeat value so
> that noise drops below the level of the effect from these patches. (the effect
> is likely in the 0.01% range)
>
> It would also be nice to see how 'size vmlinux' changes with these patches
> applied, on a 'make defconfig' build.

Well, it shouldn't change at all for 64bit kernels (would be good to
double check), however 32bit kernels will get bigger and slower.

The problem is, the only alternative is to make 32bit !cgroup kernels
different, but then we end up with a massive wrappery mess... and the
last thing the load-balance code needs is more obfuscation.




2011-04-21 16:38:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights

On Wed, 2011-04-20 at 13:51 -0700, Nikhil Rao wrote:
>
> I would like to get some feedback on the direction of this patchset. Please let
> me know if there are alternative ways of doing this, and I'll be happy to
> explore them as well.
>
> The patchset applies cleanly to v2.6.39-rc4. It compiles for i386 and boots on
> x86_64. Beyond the basic checks, it has not been well tested yet.
>
> Major TODOs:
> - Detect overflow in update shares calculations (time * load), and set load_avg
> to maximum possible value (~0ULL).
> - tg->task_weight uses an atomic which needs to be updates to 64-bit on 32-bit
> machines. Might need to add a lock to protect this instead of atomic ops.
> - Check wake-affine math and effective load calculations for overflows.
> - Needs more testing and need to ensure fairness/balancing is not broken.

The code looks fairly ok and I can't fault the TODOs ;-)

I guess getting some measurements on the performance penalty on 32bit
would be nice -- if only to know about how bad it is.

And while its not perfect by a long stretch (we can still blow the whole
lot by creating a deep enough hierarchy) it should be much better.. the
only advantage of going with a full on wrapper solution would be that
plugging in an arbitrary precision type would be simple ;-)

2011-04-26 16:11:51

by Nikhil Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights

On Wed, Apr 20, 2011 at 11:16 PM, Ingo Molnar <[email protected]> wrote:
>
> * Nikhil Rao <[email protected]> wrote:
>
>> Major TODOs:
>> - Detect overflow in update shares calculations (time * load), and set load_avg
>>   to maximum possible value (~0ULL).
>> - tg->task_weight uses an atomic which needs to be updates to 64-bit on 32-bit
>>   machines. Might need to add a lock to protect this instead of atomic ops.
>> - Check wake-affine math and effective load calculations for overflows.
>> - Needs more testing and need to ensure fairness/balancing is not broken.
>
> Please measure micro-costs accurately as well, via perf stat --repeat 10 or so.
>
> For example, on a testsystem doing 200k pipe triggered context switches (100k
> pipe ping-pongs) costs this much:
>
>  $ taskset 1 perf stat --repeat 10 ./pipe-test-100k
>
>        630.908390 task-clock-msecs         #      0.434 CPUs    ( +-   0.499% )
>           200,001 context-switches         #      0.317 M/sec   ( +-   0.000% )
>                 0 CPU-migrations           #      0.000 M/sec   ( +-  66.667% )
>               145 page-faults              #      0.000 M/sec   ( +-   0.253% )
>     1,374,978,900 cycles                   #   2179.364 M/sec   ( +-   0.516% )
>     1,373,646,429 instructions             #      0.999 IPC     ( +-   0.134% )
>       264,223,224 branches                 #    418.798 M/sec   ( +-   0.134% )
>        16,613,988 branch-misses            #      6.288 %       ( +-   0.755% )
>           204,162 cache-references         #      0.324 M/sec   ( +-  18.805% )
>             5,152 cache-misses             #      0.008 M/sec   ( +-  21.280% )
>
> We want to know the delta in the 'instructions' value resulting from the patch
> (this can be measured very accurately) and we also want to see the 'cycles'
> effect - both can be measured pretty accurately.
>
> I've attached the testcase - you might need to increase the --repeat value so
> that noise drops below the level of the effect from these patches. (the effect
> is likely in the 0.01% range)
>

Thanks for the test program. Sorry for the delay in getting back to
you with results. I had some trouble wrangling machines :-(

I have data from pipe_test_100k on 32-bit builds below. I ran this
test 5000 times on each kernel with the two events (instructions,
cycles) configured (the test machine does not have enough PMUs to
measure all events without scaling).

taskset 1 perf stat --repeat 5000 -e instructions,cycles ./pipe-test-100k

baseline (v2.6.39-rc4):

Performance counter stats for './pipe-test-100k' (5000 runs):

994,061,050 instructions # 0.412 IPC ( +- 0.133% )
2,414,463,154 cycles ( +- 0.056% )

2.251820874 seconds time elapsed ( +- 0.429% )

kernel + patch:

Performance counter stats for './pipe-test-100k' (5000 runs):

1,064,610,666 instructions # 0.435 IPC ( +- 0.086% )
2,448,568,573 cycles ( +- 0.037% )

1.704553841 seconds time elapsed ( +- 0.288% )

We see a ~7.1% increase in instructions executed and a 1.4% increase
in cycles. We also see a 5.5% increase in IPC (understandable since we
do more work). I can't explain how elapsed time drops by about 0.5s
though.

> It would also be nice to see how 'size vmlinux' changes with these patches
> applied, on a 'make defconfig' build.
>

With a defconfig build, we see a marginal increase in vmlinux text
size (3049 bytes, 0.043%), and a small decreased in data size (-4040
bytes, -0.57%).

baseline (v2.6.39-rc4):
text data bss dec hex filename
7025688 711604 1875968 9613260 92afcc vmlinux-2.6.39-rc4

kernel + patch:
text data bss dec hex filename
7028737 707564 1875968 9612269 92abed vmlinux

-Thanks
Nikhil

> Thanks,
>
>        Ingo
>

2011-04-28 09:55:37

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [RFC][Patch 02/18] sched: increase SCHED_LOAD_SCALE resolution

On Wed, 20 Apr 2011 13:51:21 -0700, Nikhil Rao <[email protected]> wrote:
> Introduce SCHED_LOAD_RESOLUTION, which scales is added to SCHED_LOAD_SHIFT and
> increases the resolution of SCHED_LOAD_SCALE. This patch sets the value of
> SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all sched entities by
> a factor of 1024. With this extra resolution, we can handle deeper cgroup
> hiearchies and the scheduler can do better shares distribution and load
> load balancing on larger systems (especially for low weight task groups).
>
> This does not change the existing user interface, the scaled weights are only
> used internally. We do not modify prio_to_weight values or inverses, but use
> the original weights when calculating the inverse which is used to scale
> execution time delta in calc_delta_mine(). This ensures we do not lose accuracy
> when accounting time to the sched entities.
>
> Signed-off-by: Nikhil Rao <[email protected]>
> ---
> include/linux/sched.h | 3 ++-
> kernel/sched.c | 18 ++++++++++--------
> 2 files changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8d1ff2b..d2c3bab 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -794,7 +794,8 @@ enum cpu_idle_type {
> /*
> * Increase resolution of nice-level calculations:
> */
> -#define SCHED_LOAD_SHIFT 10
> +#define SCHED_LOAD_RESOLUTION 10
> +#define SCHED_LOAD_SHIFT (10 + SCHED_LOAD_RESOLUTION)
> #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
>
> /*
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 50f97cc..bfee8ff 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(task_group_lock);
> * limitation from this.)
> */
> #define MIN_SHARES 2
> -#define MAX_SHARES (1UL << 18)
> +#define MAX_SHARES (1UL << 28)
Shouldn't this change depend on SCHED_LOAD_RESOLUTION?

#define MAX_SHARES (1UL << (18 + SCHED_LOAD_RESOLUTION))

Nikunj

2011-04-28 11:48:35

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights

On Thu, 28 Apr 2011 12:37:27 +0530, "Nikunj A. Dadhania" <[email protected]> wrote:
> On Wed, 20 Apr 2011 13:51:19 -0700, Nikhil Rao <[email protected]> wrote:
> > Hi All,
> >
> > I have attached an early version of a RFC patchset to increase resolution of
> > sched entity load weights. This RFC introduces SCHED_LOAD_RESOLUTION which
> > scales NICE_0_LOAD by a factor of 1024. The scaling is done internally and should
> > be completely invisible to the user.
> >
> > Why do we need this?
> > This extra resolution allows us to scale on two dimensions - number of cpus and
> > the depth of hierarchies. It also allows for proper load balancing of low weight
> > task groups (for eg., nice+19 on autogroup).
> >
> > One of the big roadblocks for increasing resolution is the use of unsigned long
> > for load.weight, which on 32-bit architectures can overflow with ~48 max-weight
> > sched entities. In this RFC we convert all uses of load.weight to u64. This is
> > still a work-in-progress and I have listed some of the issues I am still
> > investigating below.
> >
> > I would like to get some feedback on the direction of this patchset. Please let
> > me know if there are alternative ways of doing this, and I'll be happy to
> > explore them as well.
> >
> > The patchset applies cleanly to v2.6.39-rc4. It compiles for i386 and boots on
> > x86_64. Beyond the basic checks, it has not been well tested yet.
> >
> > Major TODOs:
> > - Detect overflow in update shares calculations (time * load), and set load_avg
> > to maximum possible value (~0ULL).
> > - tg->task_weight uses an atomic which needs to be updates to 64-bit on 32-bit
> > machines. Might need to add a lock to protect this instead of atomic ops.
> > - Check wake-affine math and effective load calculations for overflows.
> > - Needs more testing and need to ensure fairness/balancing is not broken.
> >
> Hi Nikhil,
>
> I did a quick test for creating 600 cpu hog tasks with and without this
> patches on a 16cpu machine(x86_64) and I am seeing some mis-behaviour:
>
> Base kernel - 2.6.39-rc4
>
> [root@krm1 ~]# time -p ./test
> real 43.54
> user 0.12
> sys 1.05
> [root@krm1 ~]#
>
> Base + patches
>
> [root@krm1 ~]# time -p ./test
>
> Takes almost infinity, after 2 minutes I see only 16 tasks created
> viewed from another ssh session to the machine:
>
I could get this working using following patch, not sure if it has other
implications though. With this, I am back to saner time values for
creating 600 cpu hog tasks:

[root@ ~]# time -p ./test
real 45.02
user 0.13
sys 1.07
[root@ ~]#

===================================================================
From: Nikunj A. Dadhania <[email protected]>

sched: calc_delta_mine - fix calculation

All the calculations of inv_weight takes scaled down weight, while
calculating the tmp, weight is not scaled down by
SCHED_LOAD_RESOLUTION, which then will return big values because of
which the sched_slice thinks that its not time to preempt the
current running task

Signed-off-by: Nikunj A. Dadhania <[email protected]>

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig 2011-04-28 16:34:24.000000000 +0530
+++ kernel/sched.c 2011-04-28 16:36:29.000000000 +0530
@@ -1336,7 +1336,7 @@ calc_delta_mine(unsigned long delta_exec
lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
}

- tmp = (u64)delta_exec * weight;
+ tmp = (u64)delta_exec * (weight >> SCHED_LOAD_RESOLUTION);
/*
* Check whether we'd overflow the 64-bit multiplication:
*/

2011-04-28 12:13:52

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights

On Thu, Apr 28, 2011 at 05:18:27PM +0530, Nikunj A. Dadhania wrote:
> --- kernel/sched.c.orig 2011-04-28 16:34:24.000000000 +0530
> +++ kernel/sched.c 2011-04-28 16:36:29.000000000 +0530
> @@ -1336,7 +1336,7 @@ calc_delta_mine(unsigned long delta_exec
> lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
> }
>
> - tmp = (u64)delta_exec * weight;
> + tmp = (u64)delta_exec * (weight >> SCHED_LOAD_RESOLUTION);

Should we be fixing inv_weight rather to account for SCHED_LOAD_RESOLUTION here?

- vatsa

2011-04-28 17:11:58

by Nikhil Rao

[permalink] [raw]
Subject: Re: [RFC][Patch 02/18] sched: increase SCHED_LOAD_SCALE resolution

On Thu, Apr 28, 2011 at 2:54 AM, Nikunj A. Dadhania
<[email protected]> wrote:
> On Wed, 20 Apr 2011 13:51:21 -0700, Nikhil Rao <[email protected]> wrote:
>> Introduce SCHED_LOAD_RESOLUTION, which scales is added to SCHED_LOAD_SHIFT and
>> increases the resolution of SCHED_LOAD_SCALE. This patch sets the value of
>> SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all sched entities by
>> a factor of 1024. With this extra resolution, we can handle deeper cgroup
>> hiearchies and the scheduler can do better shares distribution and load
>> load balancing on larger systems (especially for low weight task groups).
>>
>> This does not change the existing user interface, the scaled weights are only
>> used internally. We do not modify prio_to_weight values or inverses, but use
>> the original weights when calculating the inverse which is used to scale
>> execution time delta in calc_delta_mine(). This ensures we do not lose accuracy
>> when accounting time to the sched entities.
>>
>> Signed-off-by: Nikhil Rao <[email protected]>
>> ---
>>  include/linux/sched.h |    3 ++-
>>  kernel/sched.c        |   18 ++++++++++--------
>>  2 files changed, 12 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 8d1ff2b..d2c3bab 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -794,7 +794,8 @@ enum cpu_idle_type {
>>  /*
>>   * Increase resolution of nice-level calculations:
>>   */
>> -#define SCHED_LOAD_SHIFT     10
>> +#define SCHED_LOAD_RESOLUTION        10
>> +#define SCHED_LOAD_SHIFT     (10 + SCHED_LOAD_RESOLUTION)
>>  #define SCHED_LOAD_SCALE     (1L << SCHED_LOAD_SHIFT)
>>
>>  /*
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 50f97cc..bfee8ff 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(task_group_lock);
>>   *  limitation from this.)
>>   */
>>  #define MIN_SHARES   2
>> -#define MAX_SHARES   (1UL << 18)
>> +#define MAX_SHARES   (1UL << 28)
> Shouldn't this change depend on SCHED_LOAD_RESOLUTION?
>
> #define MAX_SHARES      (1UL << (18 + SCHED_LOAD_RESOLUTION))
>

Right, and the other 18 bits come from SCHED_LOAD_SCALE + some slack,
which is also implicitly defined. I can make this more explicit in the
next rev as you suggest. I don't expect SCHED_LOAD_RESOLUTION to
change often though.

> Nikunj
>

2011-04-28 18:21:13

by Nikhil Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights

On Thu, Apr 28, 2011 at 4:48 AM, Nikunj A. Dadhania
<[email protected]> wrote:
> On Thu, 28 Apr 2011 12:37:27 +0530, "Nikunj A. Dadhania" <[email protected]> wrote:
>> On Wed, 20 Apr 2011 13:51:19 -0700, Nikhil Rao <[email protected]> wrote:
>> > Hi All,
>> >
>> > I have attached an early version of a RFC patchset to increase resolution of
>> > sched entity load weights. This RFC introduces SCHED_LOAD_RESOLUTION which
>> > scales NICE_0_LOAD by a factor of 1024. The scaling is done internally and should
>> > be completely invisible to the user.
>> >
>> > Why do we need this?
>> > This extra resolution allows us to scale on two dimensions - number of cpus and
>> > the depth of hierarchies. It also allows for proper load balancing of low weight
>> > task groups (for eg., nice+19 on autogroup).
>> >
>> > One of the big roadblocks for increasing resolution is the use of unsigned long
>> > for load.weight, which on 32-bit architectures can overflow with ~48 max-weight
>> > sched entities. In this RFC we convert all uses of load.weight to u64. This is
>> > still a work-in-progress and I have listed some of the issues I am still
>> > investigating below.
>> >
>> > I would like to get some feedback on the direction of this patchset. Please let
>> > me know if there are alternative ways of doing this, and I'll be happy to
>> > explore them as well.
>> >
>> > The patchset applies cleanly to v2.6.39-rc4. It compiles for i386 and boots on
>> > x86_64. Beyond the basic checks, it has not been well tested yet.
>> >
>> > Major TODOs:
>> > - Detect overflow in update shares calculations (time * load), and set load_avg
>> >   to maximum possible value (~0ULL).
>> > - tg->task_weight uses an atomic which needs to be updates to 64-bit on 32-bit
>> >   machines. Might need to add a lock to protect this instead of atomic ops.
>> > - Check wake-affine math and effective load calculations for overflows.
>> > - Needs more testing and need to ensure fairness/balancing is not broken.
>> >
>> Hi Nikhil,
>>
>> I did a quick test for creating 600 cpu hog tasks with and without this
>> patches on a 16cpu machine(x86_64) and I am seeing some mis-behaviour:
>>
>> Base kernel - 2.6.39-rc4
>>
>> [root@krm1 ~]# time -p ./test
>> real 43.54
>> user 0.12
>> sys 1.05
>> [root@krm1 ~]#
>>
>> Base + patches
>>
>> [root@krm1 ~]# time -p ./test
>>
>> Takes almost infinity, after 2 minutes I see only 16 tasks created
>> viewed from another ssh session to the machine:
>>
> I could get this working using following patch, not sure if it has other
> implications though. With this, I am back to saner time values for
> creating 600 cpu hog tasks:
>
> [root@ ~]# time -p ./test
> real 45.02
> user 0.13
> sys 1.07
> [root@ ~]#
>

Nikunj,

Thanks for running the tests and identifying this issue. You are right
-- we need to scale the reference weight, else we end up with slices
that are 2^10 times the expected value. Thanks for the patch.

-Thanks,
Nikhil

> ===================================================================
>    From: Nikunj A. Dadhania <[email protected]>
>
>    sched: calc_delta_mine - fix calculation
>
>    All the calculations of inv_weight takes scaled down weight, while
>    calculating the tmp, weight is not scaled down by
>    SCHED_LOAD_RESOLUTION, which then will return big values because of
>    which the sched_slice thinks that its not time to preempt the
>    current running task
>
>    Signed-off-by: Nikunj A. Dadhania <[email protected]>
>
> Index: kernel/sched.c
> ===================================================================
> --- kernel/sched.c.orig 2011-04-28 16:34:24.000000000 +0530
> +++ kernel/sched.c      2011-04-28 16:36:29.000000000 +0530
> @@ -1336,7 +1336,7 @@ calc_delta_mine(unsigned long delta_exec
>                        lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
>        }
>
> -       tmp = (u64)delta_exec * weight;
> +       tmp = (u64)delta_exec * (weight >> SCHED_LOAD_RESOLUTION);
>        /*
>         * Check whether we'd overflow the 64-bit multiplication:
>         */
>
>

2011-04-28 18:33:56

by Nikhil Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights

On Thu, Apr 28, 2011 at 5:12 AM, Srivatsa Vaddagiri
<[email protected]> wrote:
> On Thu, Apr 28, 2011 at 05:18:27PM +0530, Nikunj A. Dadhania wrote:
>> --- kernel/sched.c.orig       2011-04-28 16:34:24.000000000 +0530
>> +++ kernel/sched.c    2011-04-28 16:36:29.000000000 +0530
>> @@ -1336,7 +1336,7 @@ calc_delta_mine(unsigned long delta_exec
>>                       lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
>>       }
>>
>> -     tmp = (u64)delta_exec * weight;
>> +     tmp = (u64)delta_exec * (weight >> SCHED_LOAD_RESOLUTION);
>
> Should we be fixing inv_weight rather to account for SCHED_LOAD_RESOLUTION here?
>

Yes, I have been looking into fixing inv_weight and calc_delta_mine()
calculations based on the assumption that we have u64 weights. IMO the
function is complicated because the return value needs to be
calculated to fit into unsigned long. I would like to update users of
calc_delta_mine() to use u64 instead of unsigned longs and I think
this can be easily done (quick inspection of the code shows two call
sites that need to be updated - update_curr() and wakeup_gran()).
Without the restriction to fit into unsigned long, I think we can make
calc_delta_mine() and the inv_weight calculations simpler.

-Thanks,
Nikhil

2011-04-28 18:52:33

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights

On Thu, Apr 28, 2011 at 11:33 AM, Nikhil Rao <[email protected]> wrote:
> On Thu, Apr 28, 2011 at 5:12 AM, Srivatsa Vaddagiri
> <[email protected]> wrote:
>> On Thu, Apr 28, 2011 at 05:18:27PM +0530, Nikunj A. Dadhania wrote:
>>> --- kernel/sched.c.orig ? ? ? 2011-04-28 16:34:24.000000000 +0530
>>> +++ kernel/sched.c ? ?2011-04-28 16:36:29.000000000 +0530
>>> @@ -1336,7 +1336,7 @@ calc_delta_mine(unsigned long delta_exec
>>> ? ? ? ? ? ? ? ? ? ? ? lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
>>> ? ? ? }
>>>
>>> - ? ? tmp = (u64)delta_exec * weight;
>>> + ? ? tmp = (u64)delta_exec * (weight >> SCHED_LOAD_RESOLUTION);
>>
>> Should we be fixing inv_weight rather to account for SCHED_LOAD_RESOLUTION here?
>>
>
> Yes, I have been looking into fixing inv_weight and calc_delta_mine()
> calculations based on the assumption that we have u64 weights. IMO the
> function is complicated because the return value needs to be
> calculated to fit into unsigned long. I would like to update users of
> calc_delta_mine() to use u64 instead of unsigned longs and I think
> this can be easily done (quick inspection of the code shows two call
> sites that need to be updated - update_curr() and wakeup_gran()).
> Without the restriction to fit into unsigned long, I think we can make
> calc_delta_mine() and the inv_weight calculations simpler.
>

I don't think you have much room to maneuver here, the calculations in
c_d_m() are already u64 based, even on 32bit. Changing the external
load factors to 64 bit doesn't change this.

We lose fairness in cdm beyond 32 bits, at the old LOAD_SCALE=10
you've got 22 bits with which you can maintain fairness. This gives
total accuracy in total curr on any delta <= ~4ms (for a NICE_0 task).
If you bump this up (and don't downshift before computing the inverse
as you are) then you start introducing rounding errors beyond ~4us.

This would also be further exacerbated in sched_period() since that's
using the total cfs_rq weight.

> -Thanks,
> Nikhil
>

2011-04-28 18:53:54

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights

On Thu, Apr 28, 2011 at 11:51 AM, Paul Turner <[email protected]> wrote:
> On Thu, Apr 28, 2011 at 11:33 AM, Nikhil Rao <[email protected]> wrote:
>> On Thu, Apr 28, 2011 at 5:12 AM, Srivatsa Vaddagiri
>> <[email protected]> wrote:
>>> On Thu, Apr 28, 2011 at 05:18:27PM +0530, Nikunj A. Dadhania wrote:
>>>> --- kernel/sched.c.orig ? ? ? 2011-04-28 16:34:24.000000000 +0530
>>>> +++ kernel/sched.c ? ?2011-04-28 16:36:29.000000000 +0530
>>>> @@ -1336,7 +1336,7 @@ calc_delta_mine(unsigned long delta_exec
>>>> ? ? ? ? ? ? ? ? ? ? ? lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
>>>> ? ? ? }
>>>>
>>>> - ? ? tmp = (u64)delta_exec * weight;
>>>> + ? ? tmp = (u64)delta_exec * (weight >> SCHED_LOAD_RESOLUTION);
>>>
>>> Should we be fixing inv_weight rather to account for SCHED_LOAD_RESOLUTION here?
>>>
>>
>> Yes, I have been looking into fixing inv_weight and calc_delta_mine()
>> calculations based on the assumption that we have u64 weights. IMO the
>> function is complicated because the return value needs to be
>> calculated to fit into unsigned long. I would like to update users of
>> calc_delta_mine() to use u64 instead of unsigned longs and I think
>> this can be easily done (quick inspection of the code shows two call
>> sites that need to be updated - update_curr() and wakeup_gran()).
>> Without the restriction to fit into unsigned long, I think we can make
>> calc_delta_mine() and the inv_weight calculations simpler.
>>
>
> I don't think you have much room to maneuver here, the calculations in
> c_d_m() are already u64 based, even on 32bit. ?Changing the external
> load factors to 64 bit doesn't change this.
>
> We lose fairness in cdm beyond 32 bits, at the old LOAD_SCALE=10
> you've got 22 bits with which you can maintain fairness. This gives
> total accuracy in total curr on any delta <= ~4ms (for a NICE_0 task).
> ?If you bump this up (and don't downshift before computing the inverse
> as you are) then you start introducing rounding errors beyond ~4us.
>
> This would also be further exacerbated in sched_period() since that's
> using the total cfs_rq weight.
>

(Downshifting as you are nicely avoids all this because we don't
really need fairness at the load-balance resolution, as Nikunj points
out above it was just the overflow detection that was broken)

>> -Thanks,
>> Nikhil
>>
>

2011-04-28 21:28:24

by Nikhil Rao

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights

On Thu, Apr 28, 2011 at 11:51 AM, Paul Turner <[email protected]> wrote:
> On Thu, Apr 28, 2011 at 11:33 AM, Nikhil Rao <[email protected]> wrote:
>> On Thu, Apr 28, 2011 at 5:12 AM, Srivatsa Vaddagiri
>> <[email protected]> wrote:
>>> On Thu, Apr 28, 2011 at 05:18:27PM +0530, Nikunj A. Dadhania wrote:
>>>> --- kernel/sched.c.orig       2011-04-28 16:34:24.000000000 +0530
>>>> +++ kernel/sched.c    2011-04-28 16:36:29.000000000 +0530
>>>> @@ -1336,7 +1336,7 @@ calc_delta_mine(unsigned long delta_exec
>>>>                       lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
>>>>       }
>>>>
>>>> -     tmp = (u64)delta_exec * weight;
>>>> +     tmp = (u64)delta_exec * (weight >> SCHED_LOAD_RESOLUTION);
>>>
>>> Should we be fixing inv_weight rather to account for SCHED_LOAD_RESOLUTION here?
>>>
>>
>> Yes, I have been looking into fixing inv_weight and calc_delta_mine()
>> calculations based on the assumption that we have u64 weights. IMO the
>> function is complicated because the return value needs to be
>> calculated to fit into unsigned long. I would like to update users of
>> calc_delta_mine() to use u64 instead of unsigned longs and I think
>> this can be easily done (quick inspection of the code shows two call
>> sites that need to be updated - update_curr() and wakeup_gran()).
>> Without the restriction to fit into unsigned long, I think we can make
>> calc_delta_mine() and the inv_weight calculations simpler.
>>
>
> I don't think you have much room to maneuver here, the calculations in
> c_d_m() are already u64 based, even on 32bit.  Changing the external
> load factors to 64 bit doesn't change this.
>
> We lose fairness in cdm beyond 32 bits, at the old LOAD_SCALE=10
> you've got 22 bits with which you can maintain fairness. This gives
> total accuracy in total curr on any delta <= ~4ms (for a NICE_0 task).
>  If you bump this up (and don't downshift before computing the inverse
> as you are) then you start introducing rounding errors beyond ~4us.
>
> This would also be further exacerbated in sched_period() since that's
> using the total cfs_rq weight.
>

Hmm... yeah, I think you have nicely explained the problem at hand. As
you said, we can't bump up inv_weight without losing accuracy. We want
to keep the 32-bit bound because it avoids the division. I was
exploring other optimizations in this functions, but I think what Paul
pointed out is key here.

-Thanks
Nikhil

>> -Thanks,
>> Nikhil
>>
>

2011-04-29 16:56:17

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/18] Increase resolution of load weights

On Thu, Apr 28, 2011 at 2:27 PM, Nikhil Rao <[email protected]> wrote:
> On Thu, Apr 28, 2011 at 11:51 AM, Paul Turner <[email protected]> wrote:
>> On Thu, Apr 28, 2011 at 11:33 AM, Nikhil Rao <[email protected]> wrote:
>>> On Thu, Apr 28, 2011 at 5:12 AM, Srivatsa Vaddagiri
>>> <[email protected]> wrote:
>>>> On Thu, Apr 28, 2011 at 05:18:27PM +0530, Nikunj A. Dadhania wrote:
>>>>> --- kernel/sched.c.orig ? ? ? 2011-04-28 16:34:24.000000000 +0530
>>>>> +++ kernel/sched.c ? ?2011-04-28 16:36:29.000000000 +0530
>>>>> @@ -1336,7 +1336,7 @@ calc_delta_mine(unsigned long delta_exec
>>>>> ? ? ? ? ? ? ? ? ? ? ? lw->inv_weight = 1 + (WMULT_CONST - w/2) / (w + 1);
>>>>> ? ? ? }
>>>>>
>>>>> - ? ? tmp = (u64)delta_exec * weight;
>>>>> + ? ? tmp = (u64)delta_exec * (weight >> SCHED_LOAD_RESOLUTION);
>>>>
>>>> Should we be fixing inv_weight rather to account for SCHED_LOAD_RESOLUTION here?
>>>>
>>>
>>> Yes, I have been looking into fixing inv_weight and calc_delta_mine()
>>> calculations based on the assumption that we have u64 weights. IMO the
>>> function is complicated because the return value needs to be
>>> calculated to fit into unsigned long. I would like to update users of
>>> calc_delta_mine() to use u64 instead of unsigned longs and I think
>>> this can be easily done (quick inspection of the code shows two call
>>> sites that need to be updated - update_curr() and wakeup_gran()).
>>> Without the restriction to fit into unsigned long, I think we can make
>>> calc_delta_mine() and the inv_weight calculations simpler.
>>>
>>
>> I don't think you have much room to maneuver here, the calculations in
>> c_d_m() are already u64 based, even on 32bit. ?Changing the external
>> load factors to 64 bit doesn't change this.
>>
>> We lose fairness in cdm beyond 32 bits, at the old LOAD_SCALE=10
>> you've got 22 bits with which you can maintain fairness. This gives
>> total accuracy in total curr on any delta <= ~4ms (for a NICE_0 task).
>> ?If you bump this up (and don't downshift before computing the inverse
>> as you are) then you start introducing rounding errors beyond ~4us.
>>
>> This would also be further exacerbated in sched_period() since that's
>> using the total cfs_rq weight.
>>
>
> Hmm... yeah, I think you have nicely explained the problem at hand. As
> you said, we can't bump up inv_weight without losing accuracy. We want
> to keep the 32-bit bound because it avoids the division. I was
> exploring other optimizations in this functions, but I think what Paul
> pointed out is key here.
>

Correcting myself a tiny bit here: the 32 bit inverse covers the load
weight, at 28 bit load weights it's this (the inverse) that goes to
hell not the numerator (time). The end effect of mushy fairness is
the same.

> -Thanks
> Nikhil
>
>>> -Thanks,
>>> Nikhil
>>>
>>
>