LinuxLists.cc - [RFC PATCH v2 0/5] sched: modular find_busiest

2008-10-09 12:05:20

Subject: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

Hi,

I have been building tunable sched_mc=N patches on top of existing
sched_mc_power_savings code and adding more stuff to
find_busiest_group().

Reference:

[1]Making power policy just work
http://lwn.net/Articles/287924/

[2][RFC v1] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/287882/

[3][RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/297306/

Peter Zijlstra had suggested that it is a good idea to cleanup the
current code in find_busiest_group() before building on the existing
power saving balance infrastructure [4]. This becomes even more
important from the fact that there have been recent bugs in the power
savings code that was hard to detect and fix [5][6].

[4] http://lkml.org/lkml/2008/9/8/103

Reference to bugs:

[5] sched: Fix __load_balance_iterator() for cfq with only one task
http://lkml.org/lkml/2008/9/5/135

[6]sched: arch_reinit_sched_domains() must destroy domains to force rebuild
http://lkml.org/lkml/2008/8/29/191
http://lkml.org/lkml/2008/8/29/343

In an attempt to modularize the find_busiest_group() function and make
it extensible to more complex load balance decision, I have defined
new data structures and functions and make the find_busiest_group()
function small and readable.

** This is RFC patch, with limited testing ***

ChangeLog:
---------

v2: Fixed most coding errors, able to run kernbench on 32-bit intel
SMP system. Fixed errors in comments.

v1: Initial post http://lkml.org/lkml/2008/9/24/201

Please let me know if the approach is correct. I will test further and
ensure expected functional.

Thanks,
Vaidy

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>

After applying the patch series, the function will look like this:

/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which
* should be moved to restore balance via the imbalance parameter.
*/
static struct sched_group *
find_busiest_group(struct sched_domain *sd, int this_cpu,
unsigned long *imbalance, enum cpu_idle_type idle,
int *sd_idle, const cpumask_t *cpus, int *balance)
{
struct sched_group *group = sd->groups;
unsigned long max_pull;
int load_idx;
struct group_loads gl;
struct sd_loads sdl;

memset(&sdl, 0, sizeof(sdl));
sdl.sd = sd;

/* Get the load index corresponding to cpu idle state */
load_idx = get_load_idx(sd, idle);

do {
int need_balance;

need_balance = get_group_loads(group, this_cpu, cpus, idle,
load_idx, &gl);

if (*sd_idle && gl.nr_running)
*sd_idle = 0;

if (!need_balance && balance) {
*balance = 0;
*imbalance = 0;
return NULL;
}

/* Compare groups and find busiest non-local group */
update_sd_loads(&sdl, &gl);
/* Compare groups and find power saving candidates */
update_powersavings_group_loads(&sdl, &gl, idle);

group = group->next;
} while (group != sd->groups);

if (!sdl.busiest.group ||
sdl.local.load_per_cpu >= sdl.max_load ||
sdl.busiest.nr_running == 0)
goto out_balanced;

sdl.load_per_cpu = (SCHED_LOAD_SCALE * sdl.load) / sdl.cpu_power;

if (sdl.local.load_per_cpu >= sdl.load_per_cpu ||
100*sdl.busiest.load_per_cpu <=
sd->imbalance_pct*sdl.local.load_per_cpu)
goto out_balanced;

if (sdl.busiest.group_imbalance)
sdl.busiest.avg_load_per_task =
min(sdl.busiest.avg_load_per_task, sdl.load_per_cpu);

/*
* We're trying to get all the cpus to the average_load, so we don't
* want to push ourselves above the average load, nor do we wish to
* reduce the max loaded cpu below the average load, as either of these
* actions would just result in more rebalancing later, and ping-pong
* tasks around. Thus we look for the minimum possible imbalance.
* Negative imbalances (*we* are more loaded than anyone else) will
* be counted as no imbalance for these purposes -- we can't fix that
* by pulling tasks to us. Be careful of negative numbers as they'll
* appear as very large values with unsigned longs.
*/
if (sdl.busiest.load_per_cpu <= sdl.busiest.avg_load_per_task)
goto out_balanced;

/*
* In the presence of smp nice balancing, certain scenarios can have
* max load less than avg load(as we skip the groups at or below
* its cpu_power, while calculating max_load..)
* In this condition attempt to adjust the imbalance parameter
* in the small_imbalance functions.
*
* Now if max_load is more than avg load, balancing is needed,
* find the exact number of tasks to be moved.
*/
if (sdl.busiest.load_per_cpu >= sdl.load_per_cpu) {

/* Don't want to pull so many tasks that
* a group would go idle
*/
max_pull = min(sdl.busiest.load_per_cpu - sdl.load_per_cpu,
sdl.busiest.load_per_cpu -
sdl.busiest.avg_load_per_task);

/* How much load to actually move to equalise the imbalance */
*imbalance = min(max_pull * sdl.busiest.group->__cpu_power,
(sdl.load_per_cpu - sdl.local.load_per_cpu) *
sdl.local.group->__cpu_power) /
SCHED_LOAD_SCALE;

/* If we have adjusted the required imbalance, then return */
if (*imbalance >= sdl.busiest.avg_load_per_task)
return sdl.busiest.group;

}

/*
* if *imbalance is less than the average load per runnable task
* there is no guarantee that any tasks will be moved so we'll have
* a think about bumping its value to force at least one task to be
* moved
*/
*imbalance = 0; /* Will be adjusted below */

if (small_imbalance_one_task(&sdl, imbalance))
return sdl.busiest.group;

/* Further look for effective cpu power utilisation */
small_imbalance_optimize_cpu_power(&sdl, imbalance);

/*
* Unconditional return, we have tries all possible means to adjust
* the imbalance for effective task move
*/
return sdl.busiest.group;

out_balanced:
/* Try opportunity for power save balance */
return powersavings_balance_group(&sdl, &gl, idle, imbalance);
}

---

Vaidyanathan Srinivasan (5):
sched: split find_busiest_group()
sched: small imbalance corrections
sched: collect statistics required for powersave balance
sched: calculate statistics for current load balance domain
sched: load calculation for each group in sched domain

kernel/sched.c | 627 ++++++++++++++++++++++++++++++++++----------------------
1 files changed, 384 insertions(+), 243 deletions(-)

--

2008-10-09 12:05:40

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: [RFC PATCH v2 1/5] sched: load calculation for each group in sched domain

Add data structures for per group stats, and function
to calculate the required per group stats.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 114 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ad1962d..ab77937 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3064,6 +3064,120 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
return 0;
}

+/* Helper functions for find_busiest_group */
+
+int get_load_idx(struct sched_domain *sd, enum cpu_idle_type idle)
+{
+ if (idle == CPU_NOT_IDLE)
+ return sd->busy_idx;
+ else if (idle == CPU_NEWLY_IDLE)
+ return sd->newidle_idx;
+ else
+ return sd->idle_idx;
+}
+
+/* Struct to return group stats */
+
+struct group_loads {
+ struct sched_group *group;
+ unsigned long nr_running;
+ unsigned long load; /* Decay average load */
+ unsigned long load_per_cpu; /* Decay load / cpu_power */
+ unsigned long weighted_load; /* Instantaneous load (load.weight)*/
+ unsigned long avg_load_per_task; /* Instantaneous load/ nr_running */
+ unsigned int group_imbalance;
+ int local_group;
+ int balance_cpu;
+};
+
+/* Helper function to calculate basic group level stats */
+
+int get_group_loads(struct sched_group *group, int this_cpu,
+ const cpumask_t *valid_cpus, enum cpu_idle_type idle,
+ int load_idx,
+ struct group_loads *gl)
+{
+ struct rq *rq;
+ unsigned long load, min_load, max_load, avg_load_per_task_per_cpu;
+ int cpu;
+ int local_group = 0;
+ int first_idle_cpu = -1;
+ int need_balance = 1;
+
+ gl->group = group;
+ gl->nr_running = 0;
+ gl->load = 0;
+ gl->weighted_load = 0;
+ gl->avg_load_per_task = 0;
+ gl->group_imbalance = 0;
+ gl->balance_cpu = -1;
+ max_load = 0;
+ min_load = ~0UL;
+
+ gl->local_group = cpu_isset(this_cpu, group->cpumask);
+
+ for_each_cpu_mask_nr(cpu, group->cpumask) {
+ if (!cpu_isset(cpu, *valid_cpus))
+ continue;
+
+ rq = cpu_rq(cpu);
+
+ /* Bias balancing toward cpus of our domain */
+ if (gl->local_group) {
+ if (idle_cpu(cpu) && first_idle_cpu == -1)
+ first_idle_cpu = cpu;
+
+ load = target_load(cpu, load_idx);
+ } else {
+ load = source_load(cpu, load_idx);
+ if (load > max_load)
+ max_load = load;
+ if (load < min_load)
+ min_load = load;
+ }
+ gl->nr_running += rq->nr_running;
+ gl->load += load;
+ gl->weighted_load += weighted_cpuload(cpu);
+ gl->avg_load_per_task += cpu_avg_load_per_task(cpu);
+ }
+
+ /*
+ * Consider the group unbalanced when the imbalance is larger
+ * than the average weight of two tasks.
+ *
+ * APZ: with cgroup the avg task weight can vary wildly and
+ * might not be a suitable number - should we keep a
+ * normalized nr_running number somewhere that negates
+ * the hierarchy?
+ */
+
+ avg_load_per_task_per_cpu = sg_div_cpu_power(group,
+ gl->avg_load_per_task * SCHED_LOAD_SCALE);
+
+ if (!gl->local_group &&
+ ((max_load - min_load) > 2*avg_load_per_task_per_cpu))
+ gl->group_imbalance = 1;
+
+ if (local_group) {
+ if (first_idle_cpu != -1)
+ gl->balance_cpu = first_idle_cpu;
+ else
+ gl->balance_cpu = first_cpu(group->cpumask);
+
+ /*
+ * First idle cpu or the first cpu(busiest) in this sched group
+ * is eligible for doing load balancing at this and above
+ * domains. In the newly idle case, we will allow all the cpu's
+ * to do the newly idle load balance.
+ */
+ if (idle != CPU_NEWLY_IDLE && gl->balance_cpu != this_cpu)
+ need_balance = 0;
+ }
+ gl->load_per_cpu = sg_div_cpu_power(group, gl->load * SCHED_LOAD_SCALE);
+
+ return need_balance;
+}
+
/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which

2008-10-09 12:05:56

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: [RFC PATCH v2 2/5] sched: calculate statistics for current load balance domain

Add data structures and function to collect per sched domain
statistics required for load balance decision.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ab77937..cfd83d9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3178,6 +3178,57 @@ int get_group_loads(struct sched_group *group, int this_cpu,
return need_balance;
}

+/* Struct to hold sched domain level loads */
+
+struct sd_loads {
+ struct sched_domain *sd;
+ unsigned long load; /* Total decay load */
+ unsigned long cpu_power; /* Total cpu_power */
+ unsigned long max_load; /* load of busiest group */
+ unsigned long load_per_cpu; /* Decay load per cpu in this sd
+ * (load/cpu_power)
+ */
+ struct group_loads local; /* this_cpu's group,
+ * need to pull from busiest
+ */
+ struct group_loads busiest; /* Candidate group to pull */
+ /* sched_mc power_save balance groups */
+ struct group_loads min_load_group; /* Group that is least loaded and
+ * can fit the tasks into the
+ * leader group
+ */
+ struct group_loads power_save_leader_group; /* Group that is almost
+ * full but can still pull
+ * tasks from
+ * min_load_group
+ */
+};
+
+/* Function to aggregate per group loads to sched domain loads */
+
+void update_sd_loads(struct sd_loads *sdl, struct group_loads *gl)
+{
+ int group_capacity = gl->group->__cpu_power / SCHED_LOAD_SCALE;
+ sdl->max_load = 0;
+
+ if (gl->local_group)
+ sdl->local = *gl;
+ else {
+ sdl->load += gl->load;
+ sdl->cpu_power += gl->group->__cpu_power;
+
+
+ /* Find the busiest */
+ if (gl->load > sdl->max_load &&
+ (gl->nr_running > group_capacity ||
+ gl->group_imbalance)) {
+
+ sdl->max_load = gl->load;
+ sdl->busiest = *gl;
+ }
+ }
+}
+
/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which

2008-10-09 12:06:25

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: [RFC PATCH v2 3/5] sched: collect statistics required for powersave balance

Update sched domain level statistics with the minimum load and
group leader who can pull more tasks. Also suggest a powersave
movement if the domain is otherwise balanced.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 96 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index cfd83d9..c99b5bd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3229,6 +3229,102 @@ void update_sd_loads(struct sd_loads *sdl, struct group_loads *gl)
}
}

+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+void update_powersavings_group_loads(struct sd_loads *sdl,
+ struct group_loads *gl,
+ enum cpu_idle_type idle)
+{
+ int group_capacity = gl->group->__cpu_power / SCHED_LOAD_SCALE;
+
+ /*
+ * Busy processors will not participate in power savings
+ * balance.
+ */
+ if (idle == CPU_NOT_IDLE ||
+ !(sdl->sd->flags & SD_POWERSAVINGS_BALANCE))
+ return;
+
+ /*
+ * If the this group is idle or completely loaded, then there
+ * is no opportunity to do power savings balance with this group
+ */
+ if (gl->nr_running >= group_capacity || gl->nr_running == 0)
+ return;
+
+ /*
+ * Calculate the group which has the least non-idle load.
+ * This is the group from where we need to pick up the load
+ * for saving power
+ */
+ if (!sdl->min_load_group.group)
+ sdl->min_load_group = *gl;
+ else {
+ if (gl->nr_running < sdl->min_load_group.nr_running)
+ sdl->min_load_group = *gl;
+ /* If the loads are equal, then prefer the cpu with
+ * less logical number
+ */
+ else if (gl->nr_running == sdl->min_load_group.nr_running &&
+ first_cpu(gl->group->cpumask) <
+ first_cpu(sdl->min_load_group.group->cpumask))
+ sdl->min_load_group = *gl;
+ }
+
+ /*
+ * Calculate the group which is almost near its
+ * capacity but still has some space to pick up some load
+ * from other group and save more power
+ */
+
+ if (gl->nr_running > 0 && gl->nr_running <= group_capacity - 1) {
+ if (!sdl->power_save_leader_group.group)
+ sdl->power_save_leader_group = *gl;
+ else {
+ if (gl->nr_running >
+ sdl->power_save_leader_group.nr_running)
+ sdl->power_save_leader_group = *gl;
+ else if (gl->nr_running ==
+ sdl->power_save_leader_group.nr_running &&
+ first_cpu(gl->group->cpumask) <
+ first_cpu(sdl->min_load_group.group->cpumask))
+ sdl->power_save_leader_group = *gl;
+ }
+ }
+}
+
+static struct sched_group *powersavings_balance_group(struct sd_loads *sdl,
+ struct group_loads *gl, enum cpu_idle_type idle,
+ unsigned long *imbalance)
+{
+ *imbalance = 0;
+ if (idle == CPU_NOT_IDLE || !(sdl->sd->flags & SD_POWERSAVINGS_BALANCE))
+ return NULL;
+
+ if (sdl->local.group == sdl->power_save_leader_group.group &&
+ sdl->power_save_leader_group.group !=
+ sdl->min_load_group.group) {
+ *imbalance = sdl->min_load_group.avg_load_per_task;
+ return sdl->min_load_group.group;
+ }
+
+ return NULL;
+}
+#else
+void update_powersavings_group_loads(struct sd_loads *sdl,
+ struct group_loads *gl, enum cpu_idle_type idle)
+{
+ return;
+}
+
+static struct sched_group *powersavings_balance_group(struct sd_loads *sdl,
+ struct group_loads *gl, enum cpu_idle_type idle,
+ unsigned long *imbalance)
+{
+ *imbalance = 0;
+ return NULL;
+}
+#endif
+
/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which

2008-10-09 12:06:42

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: [RFC PATCH v2 4/5] sched: small imbalance corrections

Add functions to bump up the imbalance to eventually initiate
a task move to balance across groups.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 74 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c99b5bd..cf1aae1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3229,6 +3229,80 @@ void update_sd_loads(struct sd_loads *sdl, struct group_loads *gl)
}
}

+/* Bump up imbalance to one task so that some task movement can happen */
+
+int small_imbalance_one_task(struct sd_loads *sdl, unsigned long *imbalance)
+{
+ unsigned int imbn;
+ imbn = 2;
+ if (sdl->local.nr_running) {
+ if (sdl->busiest.avg_load_per_task >
+ sdl->local.avg_load_per_task)
+ imbn = 1;
+ }
+
+ if (sdl->busiest.load_per_cpu - sdl->local.load_per_cpu +
+ 2*sdl->busiest.avg_load_per_task >=
+ sdl->busiest.avg_load_per_task * imbn) {
+ *imbalance = sdl->busiest.avg_load_per_task;
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * Adjust imbalance to move task if the result of the move will
+ * yield better use of cpu power
+ */
+
+void small_imbalance_optimize_cpu_power(struct sd_loads *sdl,
+ unsigned long *imbalance)
+{
+ unsigned long tmp, pwr_now, pwr_move;
+ pwr_move = pwr_now = 0;
+
+ /*
+ * OK, we don't have enough imbalance to justify moving tasks,
+ * however we may be able to increase total CPU power used by
+ * moving them.
+ */
+
+ pwr_now += sdl->busiest.group->__cpu_power *
+ min(sdl->busiest.avg_load_per_task,
+ sdl->busiest.load_per_cpu);
+ pwr_now += sdl->local.group->__cpu_power *
+ min(sdl->local.avg_load_per_task,
+ sdl->local.load_per_cpu);
+ pwr_now /= SCHED_LOAD_SCALE;
+
+ /* Amount of load we'd subtract */
+ tmp = sg_div_cpu_power(sdl->busiest.group,
+ sdl->busiest.avg_load_per_task * SCHED_LOAD_SCALE);
+ if (sdl->busiest.load_per_cpu > tmp)
+ pwr_move += sdl->busiest.group->__cpu_power *
+ min(sdl->busiest.avg_load_per_task,
+ sdl->busiest.load_per_cpu - tmp);
+
+ /* Amount of load we'd add */
+ if (sdl->busiest.load_per_cpu * sdl->busiest.group->__cpu_power <
+ sdl->busiest.avg_load_per_task * SCHED_LOAD_SCALE)
+ tmp = sg_div_cpu_power(sdl->local.group,
+ sdl->busiest.load_per_cpu *
+ sdl->busiest.group->__cpu_power);
+ else
+ tmp = sg_div_cpu_power(sdl->local.group,
+ sdl->busiest.avg_load_per_task * SCHED_LOAD_SCALE);
+ pwr_move += sdl->local.group->__cpu_power *
+ min(sdl->local.avg_load_per_task,
+ sdl->local.load + tmp);
+ pwr_move /= SCHED_LOAD_SCALE;
+
+ /* Move if we gain throughput */
+ if (pwr_move > pwr_now)
+ *imbalance = sdl->busiest.avg_load_per_task;
+
+}
+
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
void update_powersavings_group_loads(struct sd_loads *sdl,
struct group_loads *gl,

2008-10-09 12:06:56

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: [RFC PATCH v2 5/5] sched: split find_busiest_group()

Make use of the defined helper functions and data structures and
split the find_busiest_group() function into smaller modular functions.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 328 +++++++++++---------------------------------------------
1 files changed, 67 insertions(+), 261 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index cf1aae1..004c065 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3398,7 +3398,6 @@ static struct sched_group *powersavings_balance_group(struct sd_loads *sdl,
return NULL;
}
#endif
-
/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which
@@ -3409,208 +3408,56 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
unsigned long *imbalance, enum cpu_idle_type idle,
int *sd_idle, const cpumask_t *cpus, int *balance)
{
- struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
- unsigned long max_load, avg_load, total_load, this_load, total_pwr;
+ struct sched_group *group = sd->groups;
unsigned long max_pull;
- unsigned long busiest_load_per_task, busiest_nr_running;
- unsigned long this_load_per_task, this_nr_running;
- int load_idx, group_imb = 0;
-#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
- int power_savings_balance = 1;
- unsigned long leader_nr_running = 0, min_load_per_task = 0;
- unsigned long min_nr_running = ULONG_MAX;
- struct sched_group *group_min = NULL, *group_leader = NULL;
-#endif
+ int load_idx;
+ struct group_loads gl;
+ struct sd_loads sdl;

- max_load = this_load = total_load = total_pwr = 0;
- busiest_load_per_task = busiest_nr_running = 0;
- this_load_per_task = this_nr_running = 0;
+ memset(&sdl, 0, sizeof(sdl));
+ sdl.sd = sd;

- if (idle == CPU_NOT_IDLE)
- load_idx = sd->busy_idx;
- else if (idle == CPU_NEWLY_IDLE)
- load_idx = sd->newidle_idx;
- else
- load_idx = sd->idle_idx;
+ /* Get the load index corresponding to cpu idle state */
+ load_idx = get_load_idx(sd, idle);

do {
- unsigned long load, group_capacity, max_cpu_load, min_cpu_load;
- int local_group;
- int i;
- int __group_imb = 0;
- unsigned int balance_cpu = -1, first_idle_cpu = 0;
- unsigned long sum_nr_running, sum_weighted_load;
- unsigned long sum_avg_load_per_task;
- unsigned long avg_load_per_task;
-
- local_group = cpu_isset(this_cpu, group->cpumask);
-
- if (local_group)
- balance_cpu = first_cpu(group->cpumask);
+ int need_balance;

- /* Tally up the load of all CPUs in the group */
- sum_weighted_load = sum_nr_running = avg_load = 0;
- sum_avg_load_per_task = avg_load_per_task = 0;
-
- max_cpu_load = 0;
- min_cpu_load = ~0UL;
-
- for_each_cpu_mask_nr(i, group->cpumask) {
- struct rq *rq;
-
- if (!cpu_isset(i, *cpus))
- continue;
+ need_balance = get_group_loads(group, this_cpu, cpus, idle,
+ load_idx, &gl);

- rq = cpu_rq(i);
+ if (*sd_idle && gl.nr_running)
+ *sd_idle = 0;

- if (*sd_idle && rq->nr_running)
- *sd_idle = 0;
-
- /* Bias balancing toward cpus of our domain */
- if (local_group) {
- if (idle_cpu(i) && !first_idle_cpu) {
- first_idle_cpu = 1;
- balance_cpu = i;
- }
-
- load = target_load(i, load_idx);
- } else {
- load = source_load(i, load_idx);
- if (load > max_cpu_load)
- max_cpu_load = load;
- if (min_cpu_load > load)
- min_cpu_load = load;
- }
-
- avg_load += load;
- sum_nr_running += rq->nr_running;
- sum_weighted_load += weighted_cpuload(i);
-
- sum_avg_load_per_task += cpu_avg_load_per_task(i);
- }
-
- /*
- * First idle cpu or the first cpu(busiest) in this sched group
- * is eligible for doing load balancing at this and above
- * domains. In the newly idle case, we will allow all the cpu's
- * to do the newly idle load balance.
- */
- if (idle != CPU_NEWLY_IDLE && local_group &&
- balance_cpu != this_cpu && balance) {
+ if (!need_balance && balance) {
*balance = 0;
- goto ret;
+ *imbalance = 0;
+ return NULL;
}

- total_load += avg_load;
- total_pwr += group->__cpu_power;
+ /* Compare groups and find busiest non-local group */
+ update_sd_loads(&sdl, &gl);
+ /* Compare groups and find power saving candidates */
+ update_powersavings_group_loads(&sdl, &gl, idle);

- /* Adjust by relative CPU power of the group */
- avg_load = sg_div_cpu_power(group,
- avg_load * SCHED_LOAD_SCALE);
-
-
- /*
- * Consider the group unbalanced when the imbalance is larger
- * than the average weight of two tasks.
- *
- * APZ: with cgroup the avg task weight can vary wildly and
- * might not be a suitable number - should we keep a
- * normalized nr_running number somewhere that negates
- * the hierarchy?
- */
- avg_load_per_task = sg_div_cpu_power(group,
- sum_avg_load_per_task * SCHED_LOAD_SCALE);
-
- if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
- __group_imb = 1;
-
- group_capacity = group->__cpu_power / SCHED_LOAD_SCALE;
-
- if (local_group) {
- this_load = avg_load;
- this = group;
- this_nr_running = sum_nr_running;
- this_load_per_task = sum_weighted_load;
- } else if (avg_load > max_load &&
- (sum_nr_running > group_capacity || __group_imb)) {
- max_load = avg_load;
- busiest = group;
- busiest_nr_running = sum_nr_running;
- busiest_load_per_task = sum_weighted_load;
- group_imb = __group_imb;
- }
-
-#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
- /*
- * Busy processors will not participate in power savings
- * balance.
- */
- if (idle == CPU_NOT_IDLE ||
- !(sd->flags & SD_POWERSAVINGS_BALANCE))
- goto group_next;
-
- /*
- * If the local group is idle or completely loaded
- * no need to do power savings balance at this domain
- */
- if (local_group && (this_nr_running >= group_capacity ||
- !this_nr_running))
- power_savings_balance = 0;
-
- /*
- * If a group is already running at full capacity or idle,
- * don't include that group in power savings calculations
- */
- if (!power_savings_balance || sum_nr_running >= group_capacity
- || !sum_nr_running)
- goto group_next;
-
- /*
- * Calculate the group which has the least non-idle load.
- * This is the group from where we need to pick up the load
- * for saving power
- */
- if ((sum_nr_running < min_nr_running) ||
- (sum_nr_running == min_nr_running &&
- first_cpu(group->cpumask) <
- first_cpu(group_min->cpumask))) {
- group_min = group;
- min_nr_running = sum_nr_running;
- min_load_per_task = sum_weighted_load /
- sum_nr_running;
- }
-
- /*
- * Calculate the group which is almost near its
- * capacity but still has some space to pick up some load
- * from other group and save more power
- */
- if (sum_nr_running <= group_capacity - 1) {
- if (sum_nr_running > leader_nr_running ||
- (sum_nr_running == leader_nr_running &&
- first_cpu(group->cpumask) >
- first_cpu(group_leader->cpumask))) {
- group_leader = group;
- leader_nr_running = sum_nr_running;
- }
- }
-group_next:
-#endif
group = group->next;
} while (group != sd->groups);

- if (!busiest || this_load >= max_load || busiest_nr_running == 0)
+ if (!sdl.busiest.group ||
+ sdl.local.load_per_cpu >= sdl.max_load ||
+ sdl.busiest.nr_running == 0)
goto out_balanced;

- avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
+ sdl.load_per_cpu = (SCHED_LOAD_SCALE * sdl.load) / sdl.cpu_power;

- if (this_load >= avg_load ||
- 100*max_load <= sd->imbalance_pct*this_load)
+ if (sdl.local.load_per_cpu >= sdl.load_per_cpu ||
+ 100*sdl.busiest.load_per_cpu <=
+ sd->imbalance_pct*sdl.local.load_per_cpu)
goto out_balanced;

- busiest_load_per_task /= busiest_nr_running;
- if (group_imb)
- busiest_load_per_task = min(busiest_load_per_task, avg_load);
+ if (sdl.busiest.group_imbalance)
+ sdl.busiest.avg_load_per_task =
+ min(sdl.busiest.avg_load_per_task, sdl.load_per_cpu);

/*
* We're trying to get all the cpus to the average_load, so we don't
@@ -3623,104 +3470,63 @@ group_next:
* by pulling tasks to us. Be careful of negative numbers as they'll
* appear as very large values with unsigned longs.
*/
- if (max_load <= busiest_load_per_task)
+ if (sdl.busiest.load_per_cpu <= sdl.busiest.avg_load_per_task)
goto out_balanced;

/*
* In the presence of smp nice balancing, certain scenarios can have
* max load less than avg load(as we skip the groups at or below
* its cpu_power, while calculating max_load..)
+ * In this condition attempt to adjust the imbalance parameter
+ * in the small_imbalance functions.
+ *
+ * Now if max_load is more than avg load, balancing is needed,
+ * find the exact number of tasks to be moved.
*/
- if (max_load < avg_load) {
- *imbalance = 0;
- goto small_imbalance;
- }
+ if (sdl.busiest.load_per_cpu >= sdl.load_per_cpu) {
+
+ /* Don't want to pull so many tasks that
+ * a group would go idle
+ */
+ max_pull = min(sdl.busiest.load_per_cpu - sdl.load_per_cpu,
+ sdl.busiest.load_per_cpu -
+ sdl.busiest.avg_load_per_task);
+
+ /* How much load to actually move to equalise the imbalance */
+ *imbalance = min(max_pull * sdl.busiest.group->__cpu_power,
+ (sdl.load_per_cpu - sdl.local.load_per_cpu) *
+ sdl.local.group->__cpu_power) /
+ SCHED_LOAD_SCALE;

- /* Don't want to pull so many tasks that a group would go idle */
- max_pull = min(max_load - avg_load, max_load - busiest_load_per_task);
+ /* If we have adjusted the required imbalance, then return */
+ if (*imbalance >= sdl.busiest.avg_load_per_task)
+ return sdl.busiest.group;

- /* How much load to actually move to equalise the imbalance */
- *imbalance = min(max_pull * busiest->__cpu_power,
- (avg_load - this_load) * this->__cpu_power)
- / SCHED_LOAD_SCALE;
+ }

/*
* if *imbalance is less than the average load per runnable task
- * there is no gaurantee that any tasks will be moved so we'll have
+ * there is no guarantee that any tasks will be moved so we'll have
* a think about bumping its value to force at least one task to be
* moved
*/
- if (*imbalance < busiest_load_per_task) {
- unsigned long tmp, pwr_now, pwr_move;
- unsigned int imbn;
-
-small_imbalance:
- pwr_move = pwr_now = 0;
- imbn = 2;
- if (this_nr_running) {
- this_load_per_task /= this_nr_running;
- if (busiest_load_per_task > this_load_per_task)
- imbn = 1;
- } else
- this_load_per_task = cpu_avg_load_per_task(this_cpu);
-
- if (max_load - this_load + 2*busiest_load_per_task >=
- busiest_load_per_task * imbn) {
- *imbalance = busiest_load_per_task;
- return busiest;
- }
+ *imbalance = 0; /* Will be adjusted below */

- /*
- * OK, we don't have enough imbalance to justify moving tasks,
- * however we may be able to increase total CPU power used by
- * moving them.
- */
+ if (small_imbalance_one_task(&sdl, imbalance))
+ return sdl.busiest.group;

- pwr_now += busiest->__cpu_power *
- min(busiest_load_per_task, max_load);
- pwr_now += this->__cpu_power *
- min(this_load_per_task, this_load);
- pwr_now /= SCHED_LOAD_SCALE;
-
- /* Amount of load we'd subtract */
- tmp = sg_div_cpu_power(busiest,
- busiest_load_per_task * SCHED_LOAD_SCALE);
- if (max_load > tmp)
- pwr_move += busiest->__cpu_power *
- min(busiest_load_per_task, max_load - tmp);
-
- /* Amount of load we'd add */
- if (max_load * busiest->__cpu_power <
- busiest_load_per_task * SCHED_LOAD_SCALE)
- tmp = sg_div_cpu_power(this,
- max_load * busiest->__cpu_power);
- else
- tmp = sg_div_cpu_power(this,
- busiest_load_per_task * SCHED_LOAD_SCALE);
- pwr_move += this->__cpu_power *
- min(this_load_per_task, this_load + tmp);
- pwr_move /= SCHED_LOAD_SCALE;
+ /* Further look for effective cpu power utilisation */
+ small_imbalance_optimize_cpu_power(&sdl, imbalance);

- /* Move if we gain throughput */
- if (pwr_move > pwr_now)
- *imbalance = busiest_load_per_task;
- }
-
- return busiest;
+ /*
+ * Unconditional return, we have tries all possible means to adjust
+ * the imbalance for effective task move
+ */
+ return sdl.busiest.group;

out_balanced:
-#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
- if (idle == CPU_NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE))
- goto ret;
-
- if (this == group_leader && group_leader != group_min) {
- *imbalance = min_load_per_task;
- return group_min;
- }
-#endif
-ret:
- *imbalance = 0;
- return NULL;
+ /* Try opportunity for power save balance */
+ return powersavings_balance_group(&sdl, &gl, idle, imbalance);
}

/*

2008-10-09 14:20:22

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

Hi,

I've looked over the last posting, but so far failed to respond :/

While I like the clean-up, I'm atm a bit reluctant to apply, as I'm
trying to rework the sched_domain stuff and the HT balancing and I'm not
quite sure how this will interact with your patches.

I hope to share my proposal (in talk - no patches yet) soonish (this
weekend, early next week) so that we can discuss it.

Thanks for you patience,

Peter

2008-10-10 01:32:43

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

* Peter Zijlstra <[email protected]> [2008-10-09 16:19:28]:

>
> Hi,
>
> I've looked over the last posting, but so far failed to respond :/
>
> While I like the clean-up, I'm atm a bit reluctant to apply, as I'm
> trying to rework the sched_domain stuff and the HT balancing and I'm not
> quite sure how this will interact with your patches.

Hi Peter,

Thanks for reviewing the changes. I just wanted an initial feedback
before I spend too much time on this approach. I am interested in HT
balancing stuff and I wanted to study the behaviour and improve the
power saving heuristics there as well.

> I hope to share my proposal (in talk - no patches yet) soonish (this
> weekend, early next week) so that we can discuss it.

I am looking forward to it. We can definitely rework the
implementation to seamlessly take on power saving heuristics.

Thanks,
Vaidy

2008-10-14 12:09:37

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

Hi,

So the basic issue is sched_group::cpu_power should become more dynamic.

There are two driving factors:
- RT time consumption feedback into CFS
- dynamic per-cpu power manangement like Intel's Dynamic Speed
Technology (formerly know as Turbo Mode).

We currently have sched_group::cpu_power to model SMT. We say that
multiple threads that share a core are not as powerful as two cores.

Therefore, we move a task when doing so results in more of that power
being utilized, resulting in preferring to run tasks on full cores
instead of SMT siblings.

RT time
-------

So the basic issue is that we cannot know how much cpu-time will be
consumed by RT tasks (we used to relate that to the number of running
tasks, but that's utter nonsense).

Therefore the only way is to measure it and assume the near future looks
like the near past.

So why is this an issue.. suppose we have 2 cpus, and 1 cpu is consumed
for 50% by RT tasks, while the other is fully available to regular
tasks.

In that case we'd want to load-balance such that the cpu affected by the
RT task(s) gets half the load the other cpu has.

[ I tried modelling this by scaling the load of cpus up, but that fails
to handle certain cases - for instance 100% RT gets real funny, and it
fails to properly skip the RT-loaded cores in the low-load situation ]

Dynamic Speed Technology
------------------------

With cpus actively fiddling with their processing capacity we get into
similar issues. Again we can measure this, but this would require the
addition of a clock that measures work instead of time.

Having that, we can even acturately measure the old SMT case, which has
always been approximated by a static percentage - even though the actual
gain is very workload dependent.

The idea is to introduce sched_work_clock() so that:

work_delta / time_delta gives the power for a cpu. <1 means we
did less work than a dedicated pipeline, >1 means we did more.

So, if for example our core's canonical freq would be 2.0GHz but we get
boosted to 2.2GHz while the other core would get lowered to 1.8GHz we
can observe and attribute this asymetric power balance.

[ This assumes that the total power is preserved for non-idle situations
- is that true?, if not this gets real interesting ]

Also, an SMT thread, when sharing the core with its sibling will get <1,
but together they might be >1.

Funny corner cases
------------------

Like mentioned in the RT time note, there is the possiblity that a core
has 0 power (left) for SCHED_OTHER. This has a consequence for the
balance cpu. Currently we let the first cpu in the domain do the
balancing, however if that CPU has 0 power it might not be the best
choice (esp since part of the balancing can be done from softirq context
- which would basically starve that group).

Sched domains
-------------

There seems to be a use-case where we need both the cache and the
package levels. So I wanted to have both levels in there.

Currently each domain level can only be one of:

SD_LV_NONE = 0,
SD_LV_SIBLING,
SD_LV_MC,
SD_LV_CPU,
SD_LV_NODE,
SD_LV_ALLNODES,

So to avoid a double domain with 'native' multi-core chips where the
cache and package level have the same span, I want to encode this
information in the sched_domain::flags as bits, which means a level can
be both cache and package.

Over balancing
--------------

Lastly, we might need to introduce SD_OVER_BALANCE, which toggles the
over-balance logic. While over-balancing brings better fairness for a
number of cases, its also hard on power savings.

2008-10-14 13:04:03

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

* Peter Zijlstra <[email protected]> [2008-10-14 14:09:13]:

>
> Hi,
>
> So the basic issue is sched_group::cpu_power should become more dynamic.

Hi Peter,

This is a good idea. Dynamically increasing cpu power to some groups
will automatically help power savings when we want to consolidate
better to one cpu package when overall system utilisation is very low.

> There are two driving factors:
> - RT time consumption feedback into CFS
> - dynamic per-cpu power manangement like Intel's Dynamic Speed
> Technology (formerly know as Turbo Mode).
>
> We currently have sched_group::cpu_power to model SMT. We say that
> multiple threads that share a core are not as powerful as two cores.
>
> Therefore, we move a task when doing so results in more of that power
> being utilized, resulting in preferring to run tasks on full cores
> instead of SMT siblings.
>
>
> RT time
> -------
>
> So the basic issue is that we cannot know how much cpu-time will be
> consumed by RT tasks (we used to relate that to the number of running
> tasks, but that's utter nonsense).
>
> Therefore the only way is to measure it and assume the near future looks
> like the near past.
>
> So why is this an issue.. suppose we have 2 cpus, and 1 cpu is consumed
> for 50% by RT tasks, while the other is fully available to regular
> tasks.
>
> In that case we'd want to load-balance such that the cpu affected by the
> RT task(s) gets half the load the other cpu has.
>
> [ I tried modelling this by scaling the load of cpus up, but that fails
> to handle certain cases - for instance 100% RT gets real funny, and it
> fails to properly skip the RT-loaded cores in the low-load situation ]
>
> Dynamic Speed Technology
> ------------------------
>
> With cpus actively fiddling with their processing capacity we get into
> similar issues. Again we can measure this, but this would require the
> addition of a clock that measures work instead of time.
>
> Having that, we can even acturately measure the old SMT case, which has
> always been approximated by a static percentage - even though the actual
> gain is very workload dependent.
>
> The idea is to introduce sched_work_clock() so that:
>
> work_delta / time_delta gives the power for a cpu. <1 means we
> did less work than a dedicated pipeline, >1 means we did more.

The challenge here is measurement of 'work'. What will be the
parameter that will be fair for most workloads and easy to measure on
most systems?

* Instructions completion count
* APERF or similar CPU specific counter on x86
* POWER has PURR and SPURR to have a measure of relative work done

> So, if for example our core's canonical freq would be 2.0GHz but we get
> boosted to 2.2GHz while the other core would get lowered to 1.8GHz we
> can observe and attribute this asymetric power balance.
>
> [ This assumes that the total power is preserved for non-idle situations
> - is that true?, if not this gets real interesting ]

I would assume total compute power will be preserved over a long
period of time. But certain workloads can benefit more from acceleration
on the same system challenging the above assumption.

> Also, an SMT thread, when sharing the core with its sibling will get <1,
> but together they might be >1.

In this case what is the normalised value of '1' It is difficult to
estimate the nominal cpu power with threads. If we can assume
normalised value to be theoretical max, then sum of both threads can
be less than 1 and will never achieve 1 in practice :)

> Funny corner cases
> ------------------
>
> Like mentioned in the RT time note, there is the possiblity that a core
> has 0 power (left) for SCHED_OTHER. This has a consequence for the
> balance cpu. Currently we let the first cpu in the domain do the
> balancing, however if that CPU has 0 power it might not be the best
> choice (esp since part of the balancing can be done from softirq context
> - which would basically starve that group).

Agreed, but relative easy to solve compared to other challenges :)

> Sched domains
> -------------
>
> There seems to be a use-case where we need both the cache and the
> package levels. So I wanted to have both levels in there.
>
> Currently each domain level can only be one of:
>
> SD_LV_NONE = 0,
> SD_LV_SIBLING,
> SD_LV_MC,
> SD_LV_CPU,
> SD_LV_NODE,
> SD_LV_ALLNODES,
>
> So to avoid a double domain with 'native' multi-core chips where the
> cache and package level have the same span, I want to encode this
> information in the sched_domain::flags as bits, which means a level can
> be both cache and package.

This will help power savings balance and make the implementation
clean. You have suggested this previously also.
Similarly collapse the NODE level if it is redundant?

> Over balancing
> --------------
>
> Lastly, we might need to introduce SD_OVER_BALANCE, which toggles the
> over-balance logic. While over-balancing brings better fairness for a
> number of cases, its also hard on power savings.

I did not understand this over balance. Can you please explain.

Thanks,
Vaidy

2008-10-14 13:25:25

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

On Tue, 2008-10-14 at 18:37 +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <[email protected]> [2008-10-14 14:09:13]:
>
> >
> > Hi,
> >
> > So the basic issue is sched_group::cpu_power should become more dynamic.
>
> Hi Peter,
>
> This is a good idea. Dynamically increasing cpu power to some groups
> will automatically help power savings when we want to consolidate
> better to one cpu package when overall system utilisation is very low.

Ah, yes another use case of this ;-)

> > Dynamic Speed Technology
> > ------------------------
> >
> > With cpus actively fiddling with their processing capacity we get into
> > similar issues. Again we can measure this, but this would require the
> > addition of a clock that measures work instead of time.
> >
> > Having that, we can even acturately measure the old SMT case, which has
> > always been approximated by a static percentage - even though the actual
> > gain is very workload dependent.
> >
> > The idea is to introduce sched_work_clock() so that:
> >
> > work_delta / time_delta gives the power for a cpu. <1 means we
> > did less work than a dedicated pipeline, >1 means we did more.
>
> The challenge here is measurement of 'work'. What will be the
> parameter that will be fair for most workloads and easy to measure on
> most systems?
>
> * Instructions completion count
> * APERF or similar CPU specific counter on x86
> * POWER has PURR and SPURR to have a measure of relative work done

Right - I was hoping for some feedback from the arch folks (maybe I
should have CC'ed linux-arch) on this issue.

> > So, if for example our core's canonical freq would be 2.0GHz but we get
> > boosted to 2.2GHz while the other core would get lowered to 1.8GHz we
> > can observe and attribute this asymetric power balance.
> >
> > [ This assumes that the total power is preserved for non-idle situations
> > - is that true?, if not this gets real interesting ]
>
> I would assume total compute power will be preserved over a long
> period of time. But certain workloads can benefit more from acceleration
> on the same system challenging the above assumption.

Yes, trouble is that as soon as its not a constant, we get into a
generic optimisation problem, which I'd rather not try to solve in the
CPU scheduler.

> > Also, an SMT thread, when sharing the core with its sibling will get <1,
> > but together they might be >1.
>
> In this case what is the normalised value of '1' It is difficult to
> estimate the nominal cpu power with threads. If we can assume
> normalised value to be theoretical max, then sum of both threads can
> be less than 1 and will never achieve 1 in practice :)

Agreed, getting a normalized value is possibly non-trivial. If we'd look
at things like (avg) cpu-speed over a measured time interval its doable,
but once we start looking at instructions completed and similar things
this might be a little more difficult.

Then again, we could perhaps re-normalize the value such that the avg
value over all cpus ends up being 1 - then again, SMT might ruin this
scheme.

> > Funny corner cases
> > ------------------
> >
> > Like mentioned in the RT time note, there is the possiblity that a core
> > has 0 power (left) for SCHED_OTHER. This has a consequence for the
> > balance cpu. Currently we let the first cpu in the domain do the
> > balancing, however if that CPU has 0 power it might not be the best
> > choice (esp since part of the balancing can be done from softirq context
> > - which would basically starve that group).
>
> Agreed, but relative easy to solve compared to other challenges :)

Yes, I just tossed it in to not forget about it ;-)

The thing is, I know I wanted to write about two corner cases, but I've
already forgotten one.. I'm still hoping it will again occur to me :-)

> > Sched domains
> > -------------
> >
> > There seems to be a use-case where we need both the cache and the
> > package levels. So I wanted to have both levels in there.
> >
> > Currently each domain level can only be one of:
> >
> > SD_LV_NONE = 0,
> > SD_LV_SIBLING,
> > SD_LV_MC,
> > SD_LV_CPU,
> > SD_LV_NODE,
> > SD_LV_ALLNODES,
> >
> > So to avoid a double domain with 'native' multi-core chips where the
> > cache and package level have the same span, I want to encode this
> > information in the sched_domain::flags as bits, which means a level can
> > be both cache and package.
>
> This will help power savings balance and make the implementation
> clean. You have suggested this previously also.
> Similarly collapse the NODE level if it is redundant?

Exactly, using a bitmask type systems allows doing that more easily,
because then a domain can be multiple types at once.

We can even consider adding more information like: shared_l1, shared_l2
and shared_l3. So that we have the full cache hierarchy available.

> > Over balancing
> > --------------
> >
> > Lastly, we might need to introduce SD_OVER_BALANCE, which toggles the
> > over-balance logic. While over-balancing brings better fairness for a
> > number of cases, its also hard on power savings.
>
> I did not understand this over balance. Can you please explain.

Ah, lets assume a 2 cpu system.

When presented with 2 tasks of weight 1024 and 1536, this constitues an
infeasible weight distribution.

There is no work-conserving way to schedule those two tasks, such that
their received cpu-time is proportionally fair.

However, when presented with 3 tasks each of weight 1024, this is
statically-infeasible but dynamically-feasible.

That is, there is no static distribution of tasks such that each tasks
receives a proportionally fair share of the cpu-time. However, by
rotating the excess task between the 2 cpus, we can ensure fairness.

In the latter case, we always have an imbalance between runqueues which
is smaller than 1 task.

In this case we can do two things, not schedule in which case we choose
to maintain the static distribution, this is called under-scheduling.

Or we move 1 task despite the fact that we'll tip the imbalance the
other way around, this is called over-scheduling.

As you can see, over-scheduling allows for fairness in more cases, but
at some expense.

A lot of people prefer the added determinism of the extra fairness, but
not everybody. Hence the tunable.

2008-10-24 10:02:08

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

* Peter Zijlstra <[email protected]> [2008-10-14 15:25:03]:

> On Tue, 2008-10-14 at 18:37 +0530, Vaidyanathan Srinivasan wrote:
> > * Peter Zijlstra <[email protected]> [2008-10-14 14:09:13]:
> >
> > >
> > > Hi,
> > >
> > > So the basic issue is sched_group::cpu_power should become more dynamic.
> >
> > Hi Peter,
> >
> > This is a good idea. Dynamically increasing cpu power to some groups
> > will automatically help power savings when we want to consolidate
> > better to one cpu package when overall system utilisation is very low.
>
> Ah, yes another use case of this ;-)
>
> > > Dynamic Speed Technology
> > > ------------------------
> > >
> > > With cpus actively fiddling with their processing capacity we get into
> > > similar issues. Again we can measure this, but this would require the
> > > addition of a clock that measures work instead of time.
> > >
> > > Having that, we can even acturately measure the old SMT case, which has
> > > always been approximated by a static percentage - even though the actual
> > > gain is very workload dependent.
> > >
> > > The idea is to introduce sched_work_clock() so that:
> > >
> > > work_delta / time_delta gives the power for a cpu. <1 means we
> > > did less work than a dedicated pipeline, >1 means we did more.
> >
> > The challenge here is measurement of 'work'. What will be the
> > parameter that will be fair for most workloads and easy to measure on
> > most systems?
> >
> > * Instructions completion count
> > * APERF or similar CPU specific counter on x86
> > * POWER has PURR and SPURR to have a measure of relative work done
>
> Right - I was hoping for some feedback from the arch folks (maybe I
> should have CC'ed linux-arch) on this issue.

Hi Peter,

Do you want to post this RFD again to get more feedback?

--Vaidy