This patch set implement/consummate the rough power aware scheduling
proposal: https://lkml.org/lkml/2012/8/13/139.
The code also on this git tree:
https://github.com/alexshi/power-scheduling.git power-scheduling
The patch defines a new policy 'powersaving', that try to pack tasks on
each sched groups level. Then it can save much power when task number in
system is no more than LCPU number.
As mentioned in the power aware scheduling proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, less active sched groups will reduce cpu power consumption
The first assumption make performance policy take over scheduling when
any group is busy.
The second assumption make power aware scheduling try to pack disperse
tasks into fewer groups.
Compare to the removed power balance, this power balance has the following
advantages:
1, simpler sys interface
only 2 sysfs interface VS 2 interface for each of LCPU
2, cover on all cpu topology
effect on all domain level VS only work on SMT/MC domain
3, Less task migration
mutual exclusive perf/power LB VS balance power on balanced performance
4, considered system load threshing
yes VS no
5, transitory task considered
yes VS no
BTW, like sched numa, Power aware scheduling is also a kind of cpu
locality oriented scheduling.
Thanks comments/suggestions from PeterZ, Linus Torvalds, Andrew Morton,
Ingo, Len Brown, Arjan, Borislav Petkov, PJT, Namhyung Kim, Mike
Galbraith, Greg, Preeti, Morten Rasmussen, Rafael etc.
Since the patch can perfect pack tasks into fewer groups, I just show
some performance/power testing data here:
=========================================
$for ((i = 0; i < x; i++)) ; do while true; do :; done & done
On my SNB laptop with 4 core* HT: the data is avg Watts
powersaving performance
x = 8 72.9482 72.6702
x = 4 61.2737 66.7649
x = 2 44.8491 59.0679
x = 1 43.225 43.0638
on SNB EP machine with 2 sockets * 8 cores * HT:
powersaving performance
x = 32 393.062 395.134
x = 16 277.438 376.152
x = 8 209.33 272.398
x = 4 199 238.309
x = 2 175.245 210.739
x = 1 174.264 173.603
tasks number keep waving benchmark, 'make -j <x> vmlinux'
on my SNB EP 2 sockets machine with 8 cores * HT:
powersaving performance
x = 2 189.416 /228 23 193.355 /209 24
x = 4 215.728 /132 35 219.69 /122 37
x = 8 244.31 /75 54 252.709 /68 58
x = 16 299.915 /43 77 259.127 /58 66
x = 32 341.221 /35 83 323.418 /38 81
data explains: 189.416 /228 23
189.416: average Watts during compilation
228: seconds(compile time)
23: scaled performance/watts = 1000000 / seconds / watts
The performance value of kbuild is better on threads 16/32, that's due
to lazy power balance reduced the context switch and CPU has more boost
chance on powersaving balance.
Some performance testing results:
---------------------------------
Tested benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
loopback netperf. on my core2, nhm, wsm, snb, platforms.
results:
A, no clear performance change found on 'performance' policy.
B, specjbb2005 drop 5~7% on both of policy whenever with openjdk or
jrockit on powersaving polocy
C, hackbench drops 40% with powersaving policy on snb 4 sockets platforms.
Others has no clear change.
===
Changelog:
V6 change:
a, remove 'balance' policy.
b, consider RT task effect in balancing
c, use avg_idle as burst wakeup indicator
d, balance on task utilization in fork/exec/wakeup.
e, no power balancing on SMT domain.
V5 change:
a, change sched_policy to sched_balance_policy
b, split fork/exec/wake power balancing into 3 patches and refresh
commit logs
c, others minors clean up
V4 change:
a, fix few bugs and clean up code according to Morten Rasmussen, Mike
Galbraith and Namhyung Kim. Thanks!
b, take Morten Rasmussen's suggestion to use different criteria for
different policy in transitory task packing.
c, shorter latency in power aware scheduling.
V3 change:
a, engaged nr_running and utilisation in periodic power balancing.
b, try packing small exec/wake tasks on running cpu not idle cpu.
V2 change:
a, add lazy power scheduling to deal with kbuild like benchmark.
-- Thanks Alex
[patch v6 01/21] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[patch v6 02/21] sched: set initial value of runnable avg for new
[patch v6 03/21] sched: only count runnable avg on cfs_rq's
[patch v6 04/21] sched: add sched balance policies in kernel
[patch v6 05/21] sched: add sysfs interface for sched_balance_policy
[patch v6 06/21] sched: log the cpu utilization at rq
[patch v6 07/21] sched: add new sg/sd_lb_stats fields for incoming
[patch v6 08/21] sched: move sg/sd_lb_stats struct ahead
[patch v6 09/21] sched: scale_rt_power rename and meaning change
[patch v6 10/21] sched: get rq potential maximum utilization
[patch v6 11/21] sched: detect wakeup burst with rq->avg_idle
[patch v6 12/21] sched: add power aware scheduling in fork/exec/wake
[patch v6 13/21] sched: using avg_idle to detect bursty wakeup
[patch v6 14/21] sched: packing transitory tasks in wakeup power
[patch v6 15/21] sched: add power/performance balance allow flag
[patch v6 16/21] sched: pull all tasks from source group
[patch v6 17/21] sched: no balance for prefer_sibling in power
[patch v6 18/21] sched: add new members of sd_lb_stats
[patch v6 19/21] sched: power aware load balance
[patch v6 20/21] sched: lazy power balance
[patch v6 21/21] sched: don't do power balance on share cpu power
Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.
Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 7 +------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 13 ++-----------
kernel/sched/sched.h | 9 +--------
4 files changed, 5 insertions(+), 31 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..5a4cf37 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1160,12 +1160,7 @@ struct sched_entity {
struct cfs_rq *my_q;
#endif
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..54eaaa2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1561,12 +1561,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..9c2f726 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3394,12 +3393,6 @@ unlock:
}
/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3422,7 +3415,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */
static unsigned long
@@ -6114,9 +6106,8 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..7f36024f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -227,12 +227,6 @@ struct cfs_rq {
#endif
#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -242,8 +236,7 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
u32 tg_runnable_contrib;
u64 tg_load_contrib;
--
1.7.12
Power aware fork/exec/wake balancing needs both of structs in incoming
patches. So move ahead before it.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 99 +++++++++++++++++++++++++++--------------------------
1 file changed, 50 insertions(+), 49 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dda6809..28bff43 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3301,6 +3301,56 @@ done:
}
/*
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ * during load balancing.
+ */
+struct sd_lb_stats {
+ struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *this; /* Local group in this sd */
+ unsigned long total_load; /* Total load of all groups in sd */
+ unsigned long total_pwr; /* Total power of all groups in sd */
+ unsigned long avg_load; /* Average load across all groups in sd */
+
+ /** Statistics of this group */
+ unsigned long this_load;
+ unsigned long this_load_per_task;
+ unsigned long this_nr_running;
+ unsigned int this_has_capacity;
+ unsigned int this_idle_cpus;
+
+ /* Statistics of the busiest group */
+ unsigned int busiest_idle_cpus;
+ unsigned long max_load;
+ unsigned long busiest_load_per_task;
+ unsigned long busiest_nr_running;
+ unsigned long busiest_group_capacity;
+ unsigned int busiest_has_capacity;
+ unsigned int busiest_group_weight;
+
+ int group_imb; /* Is there imbalance in this sd */
+
+ /* Varibles of power awaring scheduling */
+ unsigned int sd_util; /* sum utilization of this domain */
+ struct sched_group *group_leader; /* Group which relieves group_min */
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ */
+struct sg_lb_stats {
+ unsigned long avg_load; /*Avg load across the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long sum_nr_running; /* Nr tasks running in the group */
+ unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+ unsigned long group_capacity;
+ unsigned long idle_cpus;
+ unsigned long group_weight;
+ int group_imb; /* Is there an imbalance in the group ? */
+ int group_has_capacity; /* Is there extra capacity in the group? */
+ unsigned int group_util; /* sum utilization of group */
+};
+
+/*
* sched_balance_self: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
* SD_BALANCE_EXEC.
@@ -4175,55 +4225,6 @@ static unsigned long task_h_load(struct task_struct *p)
#endif
/********** Helpers for find_busiest_group ************************/
-/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
- */
-struct sd_lb_stats {
- struct sched_group *busiest; /* Busiest group in this sd */
- struct sched_group *this; /* Local group in this sd */
- unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_pwr; /* Total power of all groups in sd */
- unsigned long avg_load; /* Average load across all groups in sd */
-
- /** Statistics of this group */
- unsigned long this_load;
- unsigned long this_load_per_task;
- unsigned long this_nr_running;
- unsigned long this_has_capacity;
- unsigned int this_idle_cpus;
-
- /* Statistics of the busiest group */
- unsigned int busiest_idle_cpus;
- unsigned long max_load;
- unsigned long busiest_load_per_task;
- unsigned long busiest_nr_running;
- unsigned long busiest_group_capacity;
- unsigned long busiest_has_capacity;
- unsigned int busiest_group_weight;
-
- int group_imb; /* Is there imbalance in this sd */
-
- /* Varibles of power awaring scheduling */
- unsigned int sd_util; /* sum utilization of this domain */
- struct sched_group *group_leader; /* Group which relieves group_min */
-};
-
-/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
- */
-struct sg_lb_stats {
- unsigned long avg_load; /*Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long sum_nr_running; /* Nr tasks running in the group */
- unsigned long sum_weighted_load; /* Weighted load of group's tasks */
- unsigned long group_capacity;
- unsigned long idle_cpus;
- unsigned long group_weight;
- int group_imb; /* Is there an imbalance in the group ? */
- int group_has_capacity; /* Is there extra capacity in the group? */
- unsigned int group_util; /* sum utilization of group */
-};
/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
--
1.7.12
If the waked task is transitory enough, it will has a chance to be
packed into a cpu which is busy but still has time to care it.
For powersaving policy, only the history util < 25% task has chance to
be packed. If there is no cpu eligible to handle it, will use a idlest
cpu in leader group.
Morten Rasmussen catch a type bug. And PeterZ reminder to consider
rt_util. thanks you!
Inspired-by: Vincent Guittot <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 48 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae07190..0e48e55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3459,19 +3459,60 @@ static inline int get_sd_sched_balance_policy(struct sched_domain *sd,
}
/*
+ * find_leader_cpu - find the busiest but still has enough free time cpu
+ * among the cpus in group.
+ */
+static int
+find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu,
+ int policy)
+{
+ int vacancy, min_vacancy = INT_MAX;
+ int leader_cpu = -1;
+ int i;
+ /* percentage of the task's util */
+ unsigned putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_POWER_SHIFT)
+ / (p->se.avg.runnable_avg_period + 1);
+
+ /* bias toward local cpu */
+ if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(p)) &&
+ FULL_UTIL - max_rq_util(this_cpu) - (putil << 2) > 0)
+ return this_cpu;
+
+ /* Traverse only the allowed CPUs */
+ for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+ if (i == this_cpu)
+ continue;
+
+ /* only light task allowed, putil < 25% */
+ vacancy = FULL_UTIL - max_rq_util(i) - (putil << 2);
+
+ if (vacancy > 0 && vacancy < min_vacancy) {
+ min_vacancy = vacancy;
+ leader_cpu = i;
+ }
+ }
+ return leader_cpu;
+}
+
+/*
* If power policy is eligible for this domain, and it has task allowed cpu.
* we will select CPU from this domain.
*/
static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
- struct task_struct *p, struct sd_lb_stats *sds)
+ struct task_struct *p, struct sd_lb_stats *sds, int wakeup)
{
int policy;
int new_cpu = -1;
policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
- if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
- new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
-
+ if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
+ if (wakeup)
+ new_cpu = find_leader_cpu(sds->group_leader,
+ p, cpu, policy);
+ /* for fork balancing and a little busy task */
+ if (new_cpu == -1)
+ new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+ }
return new_cpu;
}
@@ -3522,14 +3563,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
if (tmp->flags & sd_flag) {
sd = tmp;
- new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+ new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
+ sd_flag & SD_BALANCE_WAKE);
if (new_cpu != -1)
goto unlock;
}
}
if (affine_sd) {
- new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+ new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 1);
if (new_cpu != -1)
goto unlock;
--
1.7.12
Sleeping task has no utiliation, when they were bursty waked up, the
zero utilization make scheduler out of balance, like aim7 benchmark.
rq->avg_idle is 'to used to accommodate bursty loads in a dirt simple
dirt cheap manner' -- Mike Galbraith.
With this cheap and smart bursty indicator, we can find the wake up
burst, and use nr_running as instant utilization in this scenario.
For other scenarios, we still use the precise CPU utilization to
judage if a domain is eligible for power scheduling.
Thanks for Mike Galbraith's idea!
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 33 ++++++++++++++++++++++++++-------
1 file changed, 26 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83b2c39..ae07190 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3371,12 +3371,19 @@ static unsigned int max_rq_util(int cpu)
* Try to collect the task running number and capacity of the group.
*/
static void get_sg_power_stats(struct sched_group *group,
- struct sched_domain *sd, struct sg_lb_stats *sgs)
+ struct sched_domain *sd, struct sg_lb_stats *sgs, int burst)
{
int i;
- for_each_cpu(i, sched_group_cpus(group))
- sgs->group_util += max_rq_util(i);
+ for_each_cpu(i, sched_group_cpus(group)) {
+ struct rq *rq = cpu_rq(i);
+
+ if (burst && rq->nr_running > 1)
+ /* use nr_running as instant utilization */
+ sgs->group_util += rq->nr_running;
+ else
+ sgs->group_util += max_rq_util(i);
+ }
sgs->group_weight = group->group_weight;
}
@@ -3390,6 +3397,8 @@ static int is_sd_full(struct sched_domain *sd,
struct sched_group *group;
struct sg_lb_stats sgs;
long sd_min_delta = LONG_MAX;
+ int cpu = task_cpu(p);
+ int burst = 0;
unsigned int putil;
if (p->se.load.weight == p->se.avg.load_avg_contrib)
@@ -3399,15 +3408,21 @@ static int is_sd_full(struct sched_domain *sd,
putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_POWER_SHIFT)
/ (p->se.avg.runnable_avg_period + 1);
+ if (cpu_rq(cpu)->avg_idle < sysctl_sched_burst_threshold)
+ burst = 1;
+
/* Try to collect the domain's utilization */
group = sd->groups;
do {
long g_delta;
memset(&sgs, 0, sizeof(sgs));
- get_sg_power_stats(group, sd, &sgs);
+ get_sg_power_stats(group, sd, &sgs, burst);
- g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util;
+ if (burst)
+ g_delta = sgs.group_weight - sgs.group_util;
+ else
+ g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util;
if (g_delta > 0 && g_delta < sd_min_delta) {
sd_min_delta = g_delta;
@@ -3417,8 +3432,12 @@ static int is_sd_full(struct sched_domain *sd,
sds->sd_util += sgs.group_util;
} while (group = group->next, group != sd->groups);
- if (sds->sd_util + putil < sd->span_weight * FULL_UTIL)
- return 0;
+ if (burst) {
+ if (sds->sd_util < sd->span_weight)
+ return 0;
+ } else
+ if (sds->sd_util + putil < sd->span_weight * FULL_UTIL)
+ return 0;
/* can not hold one more task in this domain */
return 1;
--
1.7.12
Added 4 new members in sb_lb_stats, that will be used in incomming
power aware balance.
group_min; //least utliszation group in domain
min_load_per_task; //load_per_task in group_min
leader_util; // sum utilizations of group_leader
min_util; // sum utilizations of group_min
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a53e2a..8605c28 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3333,6 +3333,10 @@ struct sd_lb_stats {
/* Varibles of power awaring scheduling */
unsigned int sd_util; /* sum utilization of this domain */
struct sched_group *group_leader; /* Group which relieves group_min */
+ struct sched_group *group_min; /* Least loaded group in sd */
+ unsigned long min_load_per_task; /* load_per_task in group_min */
+ unsigned int leader_util; /* sum utilizations of group_leader */
+ unsigned int min_util; /* sum utilizations of group_min */
};
/*
--
1.7.12
When active task number in sched domain waves around the power friendly
scheduling creteria, scheduling will thresh between the power friendly
balance and performance balance, bring unnecessary task migration.
The typical benchmark is 'make -j x'.
To remove such issue, introduce a u64 perf_lb_record variable to record
performance load balance history. If there is no performance LB for
continuing 32 times load balancing, or no LB for 8 times max_interval ms,
or only 4 times performance LB in last 64 times load balancing, then we
accept a power friendly LB. Otherwise, give up this time power friendly
LB chance, do nothing.
With this patch, the worst case for power scheduling -- kbuild, gets
similar performance/watts value among different policy.
Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 57 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 226a515..4b9b810 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -906,6 +906,7 @@ struct sched_domain {
unsigned long last_balance; /* init to jiffies. units in jiffies */
unsigned int balance_interval; /* initialise to 1. units in ms. */
unsigned int nr_balance_failed; /* initialise to 0 */
+ u64 perf_lb_record; /* performance balance record */
u64 last_update;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8019106..430904b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4505,6 +4505,60 @@ static inline void update_sd_lb_power_stats(struct lb_env *env,
}
}
+#define PERF_LB_HH_MASK 0xffffffff00000000ULL
+#define PERF_LB_LH_MASK 0xffffffffULL
+
+/**
+ * need_perf_balance - Check if the performance load balance needed
+ * in the sched_domain.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics of the sched_domain
+ */
+static int need_perf_balance(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ env->sd->perf_lb_record <<= 1;
+
+ if (env->flags & LBF_PERF_BAL) {
+ env->sd->perf_lb_record |= 0x1;
+ return 1;
+ }
+
+ /*
+ * The situation isn't eligible for performance balance. If this_cpu
+ * is not eligible or the timing is not suitable for lazy powersaving
+ * balance, we will stop both powersaving and performance balance.
+ */
+ if (env->flags & LBF_POWER_BAL && sds->this == sds->group_leader
+ && sds->group_leader != sds->group_min) {
+ int interval;
+
+ /* powersaving balance interval set as 8 * max_interval */
+ interval = msecs_to_jiffies(8 * env->sd->max_interval);
+ if (time_after(jiffies, env->sd->last_balance + interval))
+ env->sd->perf_lb_record = 0;
+
+ /*
+ * A eligible timing is no performance balance in last 32
+ * balance and performance balance is no more than 4 times
+ * in last 64 balance, or no balance in powersaving interval
+ * time.
+ */
+ if ((hweight64(env->sd->perf_lb_record & PERF_LB_HH_MASK) <= 4)
+ && !(env->sd->perf_lb_record & PERF_LB_LH_MASK)) {
+
+ env->imbalance = sds->min_load_per_task;
+ return 0;
+ }
+
+ }
+
+ /* give up this time power balancing, do nothing */
+ env->flags &= ~LBF_POWER_BAL;
+ sds->group_min = NULL;
+ return 0;
+}
+
/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
* @sd: The sched_domain whose load_idx is to be obtained.
@@ -5124,18 +5178,8 @@ find_busiest_group(struct lb_env *env, int *balance)
*/
update_sd_lb_stats(env, balance, &sds);
- if (!(env->flags & LBF_POWER_BAL) && !(env->flags & LBF_PERF_BAL))
- return NULL;
-
- if (env->flags & LBF_POWER_BAL) {
- if (sds.this == sds.group_leader &&
- sds.group_leader != sds.group_min) {
- env->imbalance = sds.min_load_per_task;
- return sds.group_min;
- }
- env->flags &= ~LBF_POWER_BAL;
- return NULL;
- }
+ if (!need_perf_balance(env, &sds))
+ return sds.group_min;
/*
* this_cpu is not the appropriate cpu to perform load balancing at
--
1.7.12
In power aware scheduling, we don't want to balance 'prefer_sibling'
groups just because local group has capacity.
If the local group has no tasks at the time, that is the power
balance hope so.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d25fb3b..0a53e2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4772,8 +4772,12 @@ static inline void update_sd_lb_stats(struct lb_env *env,
* extra check prevents the case where you always pull from the
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
+ *
+ * In power aware scheduling, we don't care load weight and
+ * want not to pull tasks just because local group has capacity.
*/
- if (prefer_sibling && !local_group && sds->this_has_capacity)
+ if (prefer_sibling && !local_group && sds->this_has_capacity
+ && env->flags & LBF_PERF_BAL)
sgs.group_capacity = min(sgs.group_capacity, 1UL);
if (local_group) {
--
1.7.12
Packing tasks among such domain can't save power, just performance
losing. So no power balance on them.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 430904b..88c8bd6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3513,7 +3513,7 @@ static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
- if (wakeup)
+ if (wakeup && !(sd->flags & SD_SHARE_CPUPOWER))
new_cpu = find_leader_cpu(sds->group_leader,
p, cpu, policy);
/* for fork balancing and a little busy task */
@@ -4420,8 +4420,9 @@ static unsigned long task_h_load(struct task_struct *p)
static inline void init_sd_lb_power_stats(struct lb_env *env,
struct sd_lb_stats *sds)
{
- if (sched_balance_policy == SCHED_POLICY_PERFORMANCE ||
- env->idle == CPU_NOT_IDLE) {
+ if (sched_balance_policy == SCHED_POLICY_PERFORMANCE
+ || env->sd->flags & SD_SHARE_CPUPOWER
+ || env->idle == CPU_NOT_IDLE) {
env->flags &= ~LBF_POWER_BAL;
env->flags |= LBF_PERF_BAL;
return;
--
1.7.12
This patch enabled the power aware consideration in load balance.
As mentioned in the power aware scheduler proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, less active sched_groups will reduce power consumption
The first assumption make performance policy take over scheduling when
any scheduler group is busy.
The second assumption make power aware scheduling try to pack disperse
tasks into fewer groups.
The enabling logical summary here:
1, Collect power aware scheduler statistics during performance load
balance statistics collection.
2, If the balance cpu is eligible for power load balance, just do it
and forget performance load balance. If the domain is suitable for
power balance, but the cpu is inappropriate(idle or full), stop both
power/performance balance in this domain. If using performance balance
or any group is busy, do performance balance.
Above logical is mainly implemented in update_sd_lb_power_stats(). It
decides if a domain is suitable for power aware scheduling. If so,
it will fill the dst group and source group accordingly.
This patch reused some of Suresh's power saving load balance code.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 118 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8605c28..8019106 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ *
+ * Powersaving balance policy added by Alex Shi
+ * Copyright (C) 2013 Intel, Alex Shi <[email protected]>
*/
#include <linux/latencytop.h>
@@ -4408,6 +4411,101 @@ static unsigned long task_h_load(struct task_struct *p)
/********** Helpers for find_busiest_group ************************/
/**
+ * init_sd_lb_power_stats - Initialize power savings statistics for
+ * the given sched_domain, during load balancing.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics for sd.
+ */
+static inline void init_sd_lb_power_stats(struct lb_env *env,
+ struct sd_lb_stats *sds)
+{
+ if (sched_balance_policy == SCHED_POLICY_PERFORMANCE ||
+ env->idle == CPU_NOT_IDLE) {
+ env->flags &= ~LBF_POWER_BAL;
+ env->flags |= LBF_PERF_BAL;
+ return;
+ }
+ env->flags &= ~LBF_PERF_BAL;
+ env->flags |= LBF_POWER_BAL;
+ sds->min_util = UINT_MAX;
+ sds->leader_util = 0;
+}
+
+/**
+ * update_sd_lb_power_stats - Update the power saving stats for a
+ * sched_domain while performing load balancing.
+ *
+ * @env: The load balancing environment.
+ * @group: sched_group belonging to the sched_domain under consideration.
+ * @sds: Variable containing the statistics of the sched_domain
+ * @local_group: Does group contain the CPU for which we're performing
+ * load balancing?
+ * @sgs: Variable containing the statistics of the group.
+ */
+static inline void update_sd_lb_power_stats(struct lb_env *env,
+ struct sched_group *group, struct sd_lb_stats *sds,
+ int local_group, struct sg_lb_stats *sgs)
+{
+ unsigned long threshold_util;
+
+ if (env->flags & LBF_PERF_BAL)
+ return;
+
+ threshold_util = sgs->group_weight * FULL_UTIL;
+
+ /*
+ * If the local group is idle or full loaded
+ * no need to do power savings balance at this domain
+ */
+ if (local_group && (!sgs->sum_nr_running ||
+ sgs->group_util + FULL_UTIL > threshold_util))
+ env->flags &= ~LBF_POWER_BAL;
+
+ /* Do performance load balance if any group overload */
+ if (sgs->group_util > threshold_util) {
+ env->flags |= LBF_PERF_BAL;
+ env->flags &= ~LBF_POWER_BAL;
+ }
+
+ /*
+ * If a group is idle,
+ * don't include that group in power savings calculations
+ */
+ if (!(env->flags & LBF_POWER_BAL) || !sgs->sum_nr_running)
+ return;
+
+ /*
+ * Calculate the group which has the least non-idle load.
+ * This is the group from where we need to pick up the load
+ * for saving power
+ */
+ if ((sgs->group_util < sds->min_util) ||
+ (sgs->group_util == sds->min_util &&
+ group_first_cpu(group) > group_first_cpu(sds->group_min))) {
+ sds->group_min = group;
+ sds->min_util = sgs->group_util;
+ sds->min_load_per_task = sgs->sum_weighted_load /
+ sgs->sum_nr_running;
+ }
+
+ /*
+ * Calculate the group which is almost near its
+ * capacity but still has some space to pick up some load
+ * from other group and save more power
+ */
+ if (sgs->group_util + FULL_UTIL > threshold_util)
+ return;
+
+ if (sgs->group_util > sds->leader_util ||
+ (sgs->group_util == sds->leader_util && sds->group_leader &&
+ group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
+ sds->group_leader = group;
+ sds->leader_util = sgs->group_util;
+ }
+}
+
+/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
* @sd: The sched_domain whose load_idx is to be obtained.
* @idle: The Idle status of the CPU for whose sd load_icx is obtained.
@@ -4644,6 +4742,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_load += load;
sgs->sum_nr_running += nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);
+
+ /* add scaled rt utilization */
+ sgs->group_util += max_rq_util(i);
+
if (idle_cpu(i))
sgs->idle_cpus++;
}
@@ -4752,6 +4854,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
+ init_sd_lb_power_stats(env, sds);
load_idx = get_sd_load_idx(env->sd, env->idle);
do {
@@ -4803,6 +4906,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->group_imb = sgs.group_imb;
}
+ update_sd_lb_power_stats(env, sg, sds, local_group, &sgs);
sg = sg->next;
} while (sg != env->sd->groups);
}
@@ -5020,6 +5124,19 @@ find_busiest_group(struct lb_env *env, int *balance)
*/
update_sd_lb_stats(env, balance, &sds);
+ if (!(env->flags & LBF_POWER_BAL) && !(env->flags & LBF_PERF_BAL))
+ return NULL;
+
+ if (env->flags & LBF_POWER_BAL) {
+ if (sds.this == sds.group_leader &&
+ sds.group_leader != sds.group_min) {
+ env->imbalance = sds.min_load_per_task;
+ return sds.group_min;
+ }
+ env->flags &= ~LBF_POWER_BAL;
+ return NULL;
+ }
+
/*
* this_cpu is not the appropriate cpu to perform load balancing at
* this level.
@@ -5197,7 +5314,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
- .flags = LBF_PERF_BAL,
+ .flags = LBF_POWER_BAL,
};
cpumask_copy(cpus, cpu_active_mask);
@@ -6275,7 +6392,6 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu) { }
#endif /* CONFIG_FAIR_GROUP_SCHED */
-
static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
{
struct sched_entity *se = &task->se;
--
1.7.12
If a sched domain is idle enough for regular power balance, LBF_POWER_BAL
will be set, LBF_PERF_BAL will be clean. If a sched domain is busy,
their value will be set oppositely.
If the domain is suitable for power balance, but balance should not
be down by this cpu(this cpu is already idle or full), both of flags
are cleared to wait a suitable cpu to do power balance.
That mean no any balance, neither power balance nor performance balance
will be done on this cpu.
Above logical will be implemented by incoming patches.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0e48e55..adccdb8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4023,6 +4023,8 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
#define LBF_SOME_PINNED 0x04
+#define LBF_POWER_BAL 0x08 /* if power balance allowed */
+#define LBF_PERF_BAL 0x10 /* if performance balance allowed */
struct lb_env {
struct sched_domain *sd;
@@ -5185,6 +5187,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
+ .flags = LBF_PERF_BAL,
};
cpumask_copy(cpus, cpu_active_mask);
--
1.7.12
In power balance, we hope some sched groups are fully empty to save
CPU power of them. So, we want to move any tasks from them.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index adccdb8..d25fb3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5115,7 +5115,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* When comparing with imbalance, use weighted_cpuload()
* which is not scaled with the cpu power.
*/
- if (capacity && rq->nr_running == 1 && wl > env->imbalance)
+ if (rq->nr_running == 0 ||
+ (!(env->flags & LBF_POWER_BAL) && capacity &&
+ rq->nr_running == 1 && wl > env->imbalance))
continue;
/*
@@ -5218,7 +5220,8 @@ redo:
ld_moved = 0;
lb_iterations = 1;
- if (busiest->nr_running > 1) {
+ if (busiest->nr_running > 1 ||
+ (busiest->nr_running == 1 && env.flags & LBF_POWER_BAL)) {
/*
* Attempt to move tasks. If find_busiest_group has found
* an imbalance but busiest->nr_running <= 1, the group is
--
1.7.12
For power aware balancing, we care the sched domain/group's utilization.
So add: sd_lb_stats.sd_util and sg_lb_stats.group_util.
And want to know which group is busiest but still has capability to
handle more tasks, so add: sd_lb_stats.group_leader
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a165f52..dda6809 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4203,6 +4203,10 @@ struct sd_lb_stats {
unsigned int busiest_group_weight;
int group_imb; /* Is there imbalance in this sd */
+
+ /* Varibles of power awaring scheduling */
+ unsigned int sd_util; /* sum utilization of this domain */
+ struct sched_group *group_leader; /* Group which relieves group_min */
};
/*
@@ -4218,6 +4222,7 @@ struct sg_lb_stats {
unsigned long group_weight;
int group_imb; /* Is there an imbalance in the group ? */
int group_has_capacity; /* Is there extra capacity in the group? */
+ unsigned int group_util; /* sum utilization of group */
};
/**
--
1.7.12
This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest while still has utilization group. That's
will save power since it leaves more groups idle in system.
The trade off is adding a power aware statistics collection in group
seeking. But since the collection just happened in power scheduling
eligible condition, the worst case of hackbench testing just drops
about 2% with powersaving policy. No clear change for performance
policy.
The main function in this patch is get_cpu_for_power_policy(), that
will try to get a idlest cpu from the busiest while still has
utilization group, if the system is using power aware policy and
has such group.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 103 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ac021f..83b2c39 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3368,25 +3368,113 @@ static unsigned int max_rq_util(int cpu)
}
/*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * Try to collect the task running number and capacity of the group.
+ */
+static void get_sg_power_stats(struct sched_group *group,
+ struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+ int i;
+
+ for_each_cpu(i, sched_group_cpus(group))
+ sgs->group_util += max_rq_util(i);
+
+ sgs->group_weight = group->group_weight;
+}
+
+/*
+ * Is this domain full of utilization with the task?
+ */
+static int is_sd_full(struct sched_domain *sd,
+ struct task_struct *p, struct sd_lb_stats *sds)
+{
+ struct sched_group *group;
+ struct sg_lb_stats sgs;
+ long sd_min_delta = LONG_MAX;
+ unsigned int putil;
+
+ if (p->se.load.weight == p->se.avg.load_avg_contrib)
+ /* p maybe a new forked task */
+ putil = FULL_UTIL;
+ else
+ putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_POWER_SHIFT)
+ / (p->se.avg.runnable_avg_period + 1);
+
+ /* Try to collect the domain's utilization */
+ group = sd->groups;
+ do {
+ long g_delta;
+
+ memset(&sgs, 0, sizeof(sgs));
+ get_sg_power_stats(group, sd, &sgs);
+
+ g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util;
+
+ if (g_delta > 0 && g_delta < sd_min_delta) {
+ sd_min_delta = g_delta;
+ sds->group_leader = group;
+ }
+
+ sds->sd_util += sgs.group_util;
+ } while (group = group->next, group != sd->groups);
+
+ if (sds->sd_util + putil < sd->span_weight * FULL_UTIL)
+ return 0;
+
+ /* can not hold one more task in this domain */
+ return 1;
+}
+
+/*
+ * Execute power policy if this domain is not full.
+ */
+static inline int get_sd_sched_balance_policy(struct sched_domain *sd,
+ int cpu, struct task_struct *p, struct sd_lb_stats *sds)
+{
+ if (sched_balance_policy == SCHED_POLICY_PERFORMANCE)
+ return SCHED_POLICY_PERFORMANCE;
+
+ memset(sds, 0, sizeof(*sds));
+ if (is_sd_full(sd, p, sds))
+ return SCHED_POLICY_PERFORMANCE;
+ return sched_balance_policy;
+}
+
+/*
+ * If power policy is eligible for this domain, and it has task allowed cpu.
+ * we will select CPU from this domain.
+ */
+static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
+ struct task_struct *p, struct sd_lb_stats *sds)
+{
+ int policy;
+ int new_cpu = -1;
+
+ policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
+ if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
+ new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+
+ return new_cpu;
+}
+
+/*
+ * select_task_rq_fair: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
* SD_BALANCE_EXEC.
*
- * Balance, ie. select the least loaded group.
- *
* Returns the target CPU number, or the same CPU if no balancing is needed.
*
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
int new_cpu = cpu;
int want_affine = 0;
- int sync = wake_flags & WF_SYNC;
+ int sync = flags & WF_SYNC;
+ struct sd_lb_stats sds;
if (p->nr_cpus_allowed == 1)
return prev_cpu;
@@ -3412,11 +3500,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
break;
}
- if (tmp->flags & sd_flag)
+ if (tmp->flags & sd_flag) {
sd = tmp;
+
+ new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+ if (new_cpu != -1)
+ goto unlock;
+ }
}
if (affine_sd) {
+ new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+ if (new_cpu != -1)
+ goto unlock;
+
if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
prev_cpu = cpu;
--
1.7.12
The cpu's utilization is to measure how busy is the cpu.
util = cpu_rq(cpu)->avg.runnable_avg_sum * SCHED_POEWR_SCALE
/ cpu_rq(cpu)->avg.runnable_avg_period;
Since the util is no more than 1, we scale its value with 1024, same as
SCHED_POWER_SCALE and set the FULL_UTIL as 1024.
In later power aware scheduling, we are sensitive for how busy of the
cpu. Since as to power consuming, it is tight related with cpu busy
time.
BTW, rq->util can be used for any purposes if needed, not only power
scheduling.
Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 5 +++++
kernel/sched/sched.h | 4 ++++
4 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5a4cf37..226a515 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -793,7 +793,7 @@ enum cpu_idle_type {
#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
/*
- * Increase resolution of cpu_power calculations
+ * Increase resolution of cpu_power and rq->util calculations
*/
#define SCHED_POWER_SHIFT 10
#define SCHED_POWER_SCALE (1L << SCHED_POWER_SHIFT)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 75024a6..f5db759 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -311,6 +311,7 @@ do { \
P(ttwu_count);
P(ttwu_local);
+ P(util);
#undef P
#undef P64
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 90106ad..a165f52 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1495,8 +1495,13 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
+ u32 period;
__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+
+ period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
+ rq->util = (u64)(rq->avg.runnable_avg_sum << SCHED_POWER_SHIFT)
+ / period;
}
/* Add the load generated by se into cfs_rq's child load-average */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 804ee41..8682110 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -351,6 +351,9 @@ extern struct root_domain def_root_domain;
#endif /* CONFIG_SMP */
+/* full cpu utilization */
+#define FULL_UTIL SCHED_POWER_SCALE
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -482,6 +485,7 @@ struct rq {
#endif
struct sched_avg avg;
+ unsigned int util;
};
static inline int cpu_of(struct rq *rq)
--
1.7.12
rq->avg_idle is 'to used to accommodate bursty loads in a dirt simple
dirt cheap manner' -- Mike Galbraith.
With this cheap and smart bursty indicator, we can find the wake up
burst, and just use nr_running as instant utilization only.
The 'sysctl_sched_burst_threshold' used for wakeup burst, set it as
double of sysctl_sched_migration_cost.
Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched/sysctl.h | 1 +
kernel/sched/fair.c | 1 +
kernel/sysctl.c | 7 +++++++
3 files changed, 9 insertions(+)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..a3c3d43 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -53,6 +53,7 @@ extern unsigned int sysctl_numa_balancing_settle_count;
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
+extern unsigned int sysctl_sched_burst_threshold;
extern unsigned int sysctl_sched_nr_migrate;
extern unsigned int sysctl_sched_time_avg;
extern unsigned int sysctl_timer_migration;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0feeaee..8ac021f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -91,6 +91,7 @@ unsigned int sysctl_sched_wakeup_granularity = 1000000UL;
unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;
const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
+const_debug unsigned int sysctl_sched_burst_threshold = 1000000UL;
/*
* The exponential sliding window over which load is averaged for shares
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..1f23457 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -327,6 +327,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "sched_burst_threshold_ns",
+ .data = &sysctl_sched_burst_threshold,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_nr_migrate",
.data = &sysctl_sched_nr_migrate,
.maxlen = sizeof(unsigned int),
--
1.7.12
Current scheduler behavior is just consider for larger performance
of system. So it try to spread tasks on more cpu sockets and cpu cores
To adding the consideration of power awareness, the patchset adds
a powersaving scheduler policy. It will use runnable load util in
scheduler balancing. The current scheduling is taken as performance
policy.
performance: the current scheduling behaviour, try to spread tasks
on more CPU sockets or cores. performance oriented.
powersaving: will pack tasks into few sched group until all LCPU in the
group is full, power oriented.
The incoming patches will enable powersaving scheduling in CFS.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 3 +++
kernel/sched/sched.h | 5 +++++
2 files changed, 8 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 026e959..61ee178 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6093,6 +6093,9 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
return rr_interval;
}
+/* The default scheduler policy is 'performance'. */
+int __read_mostly sched_balance_policy = SCHED_POLICY_PERFORMANCE;
+
/*
* All the scheduling class methods:
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7f36024f..804ee41 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -10,6 +10,11 @@
extern __read_mostly int scheduler_running;
+#define SCHED_POLICY_PERFORMANCE (0x1)
+#define SCHED_POLICY_POWERSAVING (0x2)
+
+extern int __read_mostly sched_balance_policy;
+
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
* to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
--
1.7.12
Old function count the runnable avg on rq's nr_running even there is
only rt task in rq. That is incorrect, so correct it to cfs_rq's
nr_running.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2881d42..026e959 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2829,7 +2829,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
}
if (!se) {
- update_rq_runnable_avg(rq, rq->nr_running);
+ update_rq_runnable_avg(rq, rq->cfs.nr_running);
inc_nr_running(rq);
}
hrtick_update(rq);
--
1.7.12
Since the rt task priority is higher than fair tasks, cfs_rq utilization
is just the left of rt utilization.
When there are some cfs tasks in queue, the potential utilization may
be yielded, so mulitiplying cfs task number to get max potential
utilization of cfs. Then the rq utilization is sum of rt util and cfs
util.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae87dab..0feeaee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3350,6 +3350,22 @@ struct sg_lb_stats {
unsigned int group_util; /* sum utilization of group */
};
+static unsigned long scale_rt_util(int cpu);
+
+static unsigned int max_rq_util(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned int rt_util = scale_rt_util(cpu);
+ unsigned int cfs_util;
+ unsigned int nr_running;
+
+ cfs_util = (FULL_UTIL - rt_util) > rq->util ? rq->util
+ : (FULL_UTIL - rt_util);
+ nr_running = rq->nr_running ? rq->nr_running : 1;
+
+ return rt_util + cfs_util * nr_running;
+}
+
/*
* sched_balance_self: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
--
1.7.12
We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg
and make forking balancing imbalance since incorrect load_avg_contrib.
set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
resolve such issues.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 6 ++++++
kernel/sched/fair.c | 4 ++++
2 files changed, 10 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 54eaaa2..8843cd3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1564,6 +1564,7 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
+ p->se.avg.decay_count = 0;
#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
@@ -1651,6 +1652,11 @@ void sched_fork(struct task_struct *p)
p->sched_reset_on_fork = 0;
}
+ /* New forked task assumed with full utilization */
+#if defined(CONFIG_SMP)
+ p->se.avg.load_avg_contrib = p->se.load.weight;
+#endif
+
if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..2881d42 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1508,6 +1508,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * When enqueue a new forked task, the se->avg.decay_count == 0, so
+ * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
+ * value: se->load.weight.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
--
1.7.12
The scale_rt_power() used to represent the left CPU utilization
after reduce rt utilization. so named it as scale_rt_power has a bit
inappropriate.
Since we need to use the rt utilization in some incoming patches, we
just change return value of this function to rt utilization, and
rename it as scale_rt_util(). Then, its usage is changed in
update_cpu_power() accordingly.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 28bff43..ae87dab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4277,10 +4277,10 @@ unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
return default_scale_smt_power(sd, cpu);
}
-unsigned long scale_rt_power(int cpu)
+unsigned long scale_rt_util(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- u64 total, available, age_stamp, avg;
+ u64 total, age_stamp, avg;
/*
* Since we're reading these variables without serialization make sure
@@ -4292,10 +4292,8 @@ unsigned long scale_rt_power(int cpu)
total = sched_avg_period() + (rq->clock - age_stamp);
if (unlikely(total < avg)) {
- /* Ensures that power won't end up being negative */
- available = 0;
- } else {
- available = total - avg;
+ /* Ensures rt utilization won't beyond full scaled value */
+ return SCHED_POWER_SCALE;
}
if (unlikely((s64)total < SCHED_POWER_SCALE))
@@ -4303,7 +4301,7 @@ unsigned long scale_rt_power(int cpu)
total >>= SCHED_POWER_SHIFT;
- return div_u64(available, total);
+ return div_u64(avg, total);
}
static void update_cpu_power(struct sched_domain *sd, int cpu)
@@ -4330,7 +4328,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
power >>= SCHED_POWER_SHIFT;
- power *= scale_rt_power(cpu);
+ power *= SCHED_POWER_SCALE - scale_rt_util(cpu);
power >>= SCHED_POWER_SHIFT;
if (!power)
--
1.7.12
This patch add the power aware scheduler knob into sysfs:
$cat /sys/devices/system/cpu/sched_balance_policy/available_sched_balance_policy
performance powersaving
$cat /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy
powersaving
This means the using sched balance policy is 'powersaving'.
User can change the policy by commend 'echo':
echo performance > /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy
Signed-off-by: Alex Shi <[email protected]>
---
Documentation/ABI/testing/sysfs-devices-system-cpu | 22 +++++++
kernel/sched/fair.c | 69 ++++++++++++++++++++++
2 files changed, 91 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 9c978dc..b602882 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -53,6 +53,28 @@ Description: Dynamic addition and removal of CPU's. This is not hotplug
the system. Information writtento the file to remove CPU's
is architecture specific.
+What: /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy
+ /sys/devices/system/cpu/sched_balance_policy/available_sched_balance_policy
+Date: Oct 2012
+Contact: Linux kernel mailing list <[email protected]>
+Description: CFS balance policy show and set interface.
+
+ available_sched_balance_policy: shows there are 2 kinds of
+ policies:
+ performance powersaving.
+ current_sched_balance_policy: shows current scheduler policy.
+ User can change the policy by writing it.
+
+ Policy decides the CFS scheduler how to balance tasks onto
+ different CPU unit.
+
+ performance: try to spread tasks onto more CPU sockets,
+ more CPU cores. performance oriented.
+
+ powersaving: try to pack tasks onto same core or same CPU
+ until every LCPUs are busy in the core or CPU socket.
+ powersaving oriented.
+
What: /sys/devices/system/cpu/cpu#/node
Date: October 2009
Contact: Linux memory management mailing list <[email protected]>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61ee178..90106ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6096,6 +6096,75 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
/* The default scheduler policy is 'performance'. */
int __read_mostly sched_balance_policy = SCHED_POLICY_PERFORMANCE;
+#ifdef CONFIG_SYSFS
+static ssize_t show_available_sched_balance_policy(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "performance powersaving\n");
+}
+
+static ssize_t show_current_sched_balance_policy(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ if (sched_balance_policy == SCHED_POLICY_PERFORMANCE)
+ return sprintf(buf, "performance\n");
+ else if (sched_balance_policy == SCHED_POLICY_POWERSAVING)
+ return sprintf(buf, "powersaving\n");
+ return 0;
+}
+
+static ssize_t set_sched_balance_policy(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ unsigned int ret = -EINVAL;
+ char str_policy[16];
+
+ ret = sscanf(buf, "%15s", str_policy);
+ if (ret != 1)
+ return -EINVAL;
+
+ if (!strcmp(str_policy, "performance"))
+ sched_balance_policy = SCHED_POLICY_PERFORMANCE;
+ else if (!strcmp(str_policy, "powersaving"))
+ sched_balance_policy = SCHED_POLICY_POWERSAVING;
+ else
+ return -EINVAL;
+
+ return count;
+}
+
+/*
+ * * Sysfs setup bits:
+ * */
+static DEVICE_ATTR(current_sched_balance_policy, 0644,
+ show_current_sched_balance_policy, set_sched_balance_policy);
+
+static DEVICE_ATTR(available_sched_balance_policy, 0444,
+ show_available_sched_balance_policy, NULL);
+
+static struct attribute *sched_balance_policy_default_attrs[] = {
+ &dev_attr_current_sched_balance_policy.attr,
+ &dev_attr_available_sched_balance_policy.attr,
+ NULL
+};
+static struct attribute_group sched_balance_policy_attr_group = {
+ .attrs = sched_balance_policy_default_attrs,
+ .name = "sched_balance_policy",
+};
+
+int __init create_sysfs_sched_balance_policy_group(struct device *dev)
+{
+ return sysfs_create_group(&dev->kobj, &sched_balance_policy_attr_group);
+}
+
+static int __init sched_balance_policy_sysfs_init(void)
+{
+ return create_sysfs_sched_balance_policy_group(cpu_subsys.dev_root);
+}
+
+core_initcall(sched_balance_policy_sysfs_init);
+#endif /* CONFIG_SYSFS */
+
/*
* All the scheduling class methods:
*/
--
1.7.12
On 03/30/2013 10:34 PM, Alex Shi wrote:
[snip]
>
> Some performance testing results:
> ---------------------------------
>
> Tested benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> loopback netperf. on my core2, nhm, wsm, snb, platforms.
Hi, Alex
I've tested the patch on my 12 cpu X86 box with 3.9.0-rc2, here is the
results of pgbench (a rough test with little float):
base performance powersaving
| db_size | clients | tps | | tps | | tps |
+---------+---------+-------+ +-------+ +-------+
| 22 MB | 1 | 10662 | | 10497 | | 10124 |
| 22 MB | 2 | 21483 | | 21398 | | 17400 |
| 22 MB | 4 | 42046 | | 41974 | | 33473 |
| 22 MB | 8 | 55807 | | 53504 | | 45320 |
| 22 MB | 12 | 50768 | | 49657 | | 47469 |
| 22 MB | 16 | 49880 | | 49189 | | 48328 |
| 22 MB | 24 | 45904 | | 45870 | | 44756 |
| 22 MB | 32 | 43420 | | 44183 | | 43552 |
| 7484 MB | 1 | 7965 | | 9045 | | 8221 |
| 7484 MB | 2 | 19354 | | 19593 | | 14525 |
| 7484 MB | 4 | 37552 | | 37459 | | 28348 |
| 7484 MB | 8 | 48655 | | 46974 | | 42360 |
| 7484 MB | 12 | 45778 | | 45410 | | 43800 |
| 7484 MB | 16 | 45659 | | 44303 | | 42265 |
| 7484 MB | 24 | 42192 | | 40571 | | 39197 |
| 7484 MB | 32 | 36385 | | 36535 | | 36066 |
| 15 GB | 1 | 7677 | | 7362 | | 8075 |
| 15 GB | 2 | 19227 | | 19033 | | 14796 |
| 15 GB | 4 | 37335 | | 37186 | | 28923 |
| 15 GB | 8 | 48130 | | 50232 | | 42281 |
| 15 GB | 12 | 45393 | | 44266 | | 42763 |
| 15 GB | 16 | 45110 | | 43973 | | 42647 |
| 15 GB | 24 | 41415 | | 39389 | | 38844 |
| 15 GB | 32 | 35988 | | 36175 | | 35247 |
For the performance one, a bit win here and a bit lost there, well,
since little float is there, I think at least, no regression.
But the powersaving one suffered some regression in low-end, is that the
sacrifice we supposed to do for power saving?
Regards,
Michael Wang
>
> results:
> A, no clear performance change found on 'performance' policy.
> B, specjbb2005 drop 5~7% on both of policy whenever with openjdk or
> jrockit on powersaving polocy
> C, hackbench drops 40% with powersaving policy on snb 4 sockets platforms.
> Others has no clear change.
>
> ===
> Changelog:
> V6 change:
> a, remove 'balance' policy.
> b, consider RT task effect in balancing
> c, use avg_idle as burst wakeup indicator
> d, balance on task utilization in fork/exec/wakeup.
> e, no power balancing on SMT domain.
>
> V5 change:
> a, change sched_policy to sched_balance_policy
> b, split fork/exec/wake power balancing into 3 patches and refresh
> commit logs
> c, others minors clean up
>
> V4 change:
> a, fix few bugs and clean up code according to Morten Rasmussen, Mike
> Galbraith and Namhyung Kim. Thanks!
> b, take Morten Rasmussen's suggestion to use different criteria for
> different policy in transitory task packing.
> c, shorter latency in power aware scheduling.
>
> V3 change:
> a, engaged nr_running and utilisation in periodic power balancing.
> b, try packing small exec/wake tasks on running cpu not idle cpu.
>
> V2 change:
> a, add lazy power scheduling to deal with kbuild like benchmark.
>
>
> -- Thanks Alex
> [patch v6 01/21] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
> [patch v6 02/21] sched: set initial value of runnable avg for new
> [patch v6 03/21] sched: only count runnable avg on cfs_rq's
> [patch v6 04/21] sched: add sched balance policies in kernel
> [patch v6 05/21] sched: add sysfs interface for sched_balance_policy
> [patch v6 06/21] sched: log the cpu utilization at rq
> [patch v6 07/21] sched: add new sg/sd_lb_stats fields for incoming
> [patch v6 08/21] sched: move sg/sd_lb_stats struct ahead
> [patch v6 09/21] sched: scale_rt_power rename and meaning change
> [patch v6 10/21] sched: get rq potential maximum utilization
> [patch v6 11/21] sched: detect wakeup burst with rq->avg_idle
> [patch v6 12/21] sched: add power aware scheduling in fork/exec/wake
> [patch v6 13/21] sched: using avg_idle to detect bursty wakeup
> [patch v6 14/21] sched: packing transitory tasks in wakeup power
> [patch v6 15/21] sched: add power/performance balance allow flag
> [patch v6 16/21] sched: pull all tasks from source group
> [patch v6 17/21] sched: no balance for prefer_sibling in power
> [patch v6 18/21] sched: add new members of sd_lb_stats
> [patch v6 19/21] sched: power aware load balance
> [patch v6 20/21] sched: lazy power balance
> [patch v6 21/21] sched: don't do power balance on share cpu power
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 04/01/2013 01:05 PM, Michael Wang wrote:
>
> For the performance one, a bit win here and a bit lost there, well,
> since little float is there, I think at least, no regression.
>
> But the powersaving one suffered some regression in low-end, is that the
> sacrifice we supposed to do for power saving?
Thanks for testing!
Yes, depends on different benchmark, powersaving policy may lose some
performance.
--
Thanks Alex
BTW, the patch set passed fengguang's 0day kbuild testing.
Thanks fengguang!
Alex
Hi Alex,
The below patch has an issue in the select_task_rq_fair(), which I have mentioned in
the fix for it below.
On 03/30/2013 08:04 PM, Alex Shi wrote:
> This patch add power aware scheduling in fork/exec/wake. It try to
> select cpu from the busiest while still has utilization group. That's
> will save power since it leaves more groups idle in system.
>
> The trade off is adding a power aware statistics collection in group
> seeking. But since the collection just happened in power scheduling
> eligible condition, the worst case of hackbench testing just drops
> about 2% with powersaving policy. No clear change for performance
> policy.
>
> The main function in this patch is get_cpu_for_power_policy(), that
> will try to get a idlest cpu from the busiest while still has
> utilization group, if the system is using power aware policy and
> has such group.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 103 insertions(+), 6 deletions(-)
+/*
> + * select_task_rq_fair: balance the current task (running on cpu) in domains
> * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
> * SD_BALANCE_EXEC.
> *
> - * Balance, ie. select the least loaded group.
> - *
> * Returns the target CPU number, or the same CPU if no balancing is needed.
> *
> * preempt must be disabled.
> */
> static int
> -select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> +select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
> {
> struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> int cpu = smp_processor_id();
> int prev_cpu = task_cpu(p);
> int new_cpu = cpu;
> int want_affine = 0;
> - int sync = wake_flags & WF_SYNC;
> + int sync = flags & WF_SYNC;
> + struct sd_lb_stats sds;
>
> if (p->nr_cpus_allowed == 1)
> return prev_cpu;
> @@ -3412,11 +3500,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> break;
> }
>
> - if (tmp->flags & sd_flag)
> + if (tmp->flags & sd_flag) {
> sd = tmp;
> +
> + new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
> + if (new_cpu != -1)
> + goto unlock;
> + }
> }
>
> if (affine_sd) {
> + new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
> + if (new_cpu != -1)
> + goto unlock;
> +
> if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> prev_cpu = cpu;
>
sched: Fix power aware scheduling in fork/wake/exec
From: Preeti U Murthy <[email protected]>
Problem:
select_task_rq_fair() returns a target CPU/ waking CPU if no balancing is
required. However with the current power aware scheduling in this path, an
invalid CPU might be returned.
If get_cpu_for_power_policy() fails to find a new_cpu for the forked task,
then there is a possibility that the new_cpu could continue to be -1, till
the end of the select_task_rq_fair() if the search for a new cpu ahead in
this function also fails. Since this scenario is unexpected by
the callers of select_task_rq_fair(),this needs to be fixed.
Fix:
Do not intermix the variables meant to reflect the target CPU of power save
and performance policies. If the target CPU of powersave is successful in being
found, return it. Else allow the performance policy to take a call on the
target CPU.
The above scenario was caught when a kernel crash resulted with a bad data access
interrupt, during a kernbench run on a 2 socket,16 core machine,with each core
having SMT-4
---
kernel/sched/fair.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56b96a7..54d4400 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3532,6 +3532,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
int new_cpu = cpu;
+ int new_cpu_for_power = -1;
int want_affine = 0;
int sync = flags & WF_SYNC;
struct sd_lb_stats sds;
@@ -3563,16 +3564,16 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
if (tmp->flags & sd_flag) {
sd = tmp;
- new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
+ new_cpu_for_power = get_cpu_for_power_policy(sd, cpu, p, &sds,
flags & SD_BALANCE_FORK);
- if (new_cpu != -1)
+ if (new_cpu_for_power != -1)
goto unlock;
}
}
if (affine_sd) {
- new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 0);
- if (new_cpu != -1)
+ new_cpu_for_power = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 0);
+ if (new_cpu_for_power != -1)
goto unlock;
if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
@@ -3622,8 +3623,10 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
}
unlock:
rcu_read_unlock();
+ if (new_cpu_for_power == -1)
+ return new_cpu;
- return new_cpu;
+ return new_cpu_for_power;
}
/*
Regards
Preeti U Murthy
On 04/01/2013 05:50 PM, Preeti U Murthy wrote:
> sched: Fix power aware scheduling in fork/wake/exec
>
> From: Preeti U Murthy <[email protected]>
Good caught! Thanks Preeti!
Acked-by: Alex Shi <[email protected]>
--
Thanks
Alex
Hi Alex,
On Sat, 30 Mar 2013 22:34:57 +0800, Alex Shi wrote:
> Since the rt task priority is higher than fair tasks, cfs_rq utilization
> is just the left of rt utilization.
>
> When there are some cfs tasks in queue, the potential utilization may
> be yielded, so mulitiplying cfs task number to get max potential
> utilization of cfs. Then the rq utilization is sum of rt util and cfs
> util.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ae87dab..0feeaee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3350,6 +3350,22 @@ struct sg_lb_stats {
> unsigned int group_util; /* sum utilization of group */
> };
>
> +static unsigned long scale_rt_util(int cpu);
> +
> +static unsigned int max_rq_util(int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> + unsigned int rt_util = scale_rt_util(cpu);
> + unsigned int cfs_util;
> + unsigned int nr_running;
> +
> + cfs_util = (FULL_UTIL - rt_util) > rq->util ? rq->util
> + : (FULL_UTIL - rt_util);
> + nr_running = rq->nr_running ? rq->nr_running : 1;
This can be cleaned up with proper min/max().
> +
> + return rt_util + cfs_util * nr_running;
Should this nr_running consider tasks in cfs_rq only? Also it seems
there's no upper bound so that it can possibly exceed FULL_UTIL.
Thanks,
Namhyung
> +}
> +
> /*
> * sched_balance_self: balance the current task (running on cpu) in domains
> * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
>> + nr_running = rq->nr_running ? rq->nr_running : 1;
>
> This can be cleaned up with proper min/max().
yes, thanks
>
>> +
>> + return rt_util + cfs_util * nr_running;
>
> Should this nr_running consider tasks in cfs_rq only? Also it seems
> there's no upper bound so that it can possibly exceed FULL_UTIL.
Yes.
we need to consider task number:nr_running, otherwise cpu utilization
will never beyond FULL_UTIL, then we can not know if the cpu is overload.
So we need the possible maximum utilization bigger than FULL_UTIL.
>
--
Thanks
Alex
On 30 March 2013 15:34, Alex Shi <[email protected]> wrote:
> Old function count the runnable avg on rq's nr_running even there is
> only rt task in rq. That is incorrect, so correct it to cfs_rq's
> nr_running.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2881d42..026e959 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2829,7 +2829,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> }
>
> if (!se) {
> - update_rq_runnable_avg(rq, rq->nr_running);
> + update_rq_runnable_avg(rq, rq->cfs.nr_running);
A RT task that preempts your CFS task will be accounted in the
runnable_avg fields. So whatever you do, RT task will impact your
runnable_avg statistics. Instead of trying to get only CFS tasks, you
should take into account all tasks activity in the rq.
Vincent
> inc_nr_running(rq);
> }
> hrtick_update(rq);
> --
> 1.7.12
>
On 30 March 2013 15:34, Alex Shi <[email protected]> wrote:
> Since the rt task priority is higher than fair tasks, cfs_rq utilization
> is just the left of rt utilization.
>
> When there are some cfs tasks in queue, the potential utilization may
> be yielded, so mulitiplying cfs task number to get max potential
> utilization of cfs. Then the rq utilization is sum of rt util and cfs
> util.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ae87dab..0feeaee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3350,6 +3350,22 @@ struct sg_lb_stats {
> unsigned int group_util; /* sum utilization of group */
> };
>
> +static unsigned long scale_rt_util(int cpu);
> +
> +static unsigned int max_rq_util(int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> + unsigned int rt_util = scale_rt_util(cpu);
> + unsigned int cfs_util;
> + unsigned int nr_running;
> +
> + cfs_util = (FULL_UTIL - rt_util) > rq->util ? rq->util
> + : (FULL_UTIL - rt_util);
rt_util and rq->util don't use the same computation algorithm so the
results are hardly comparable or addable. In addition, some RT tasks
can have impacted the rq->util, so they will be accounted in both
side.
Vincent
> + nr_running = rq->nr_running ? rq->nr_running : 1;
> +
> + return rt_util + cfs_util * nr_running;
> +}
> +
> /*
> * sched_balance_self: balance the current task (running on cpu) in domains
> * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
> --
> 1.7.12
>
On 04/02/2013 10:30 PM, Vincent Guittot wrote:
> On 30 March 2013 15:34, Alex Shi <[email protected]> wrote:
>> Old function count the runnable avg on rq's nr_running even there is
>> only rt task in rq. That is incorrect, so correct it to cfs_rq's
>> nr_running.
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> ---
>> kernel/sched/fair.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2881d42..026e959 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2829,7 +2829,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> }
>>
>> if (!se) {
>> - update_rq_runnable_avg(rq, rq->nr_running);
>> + update_rq_runnable_avg(rq, rq->cfs.nr_running);
>
> A RT task that preempts your CFS task will be accounted in the
> runnable_avg fields. So whatever you do, RT task will impact your
> runnable_avg statistics. Instead of trying to get only CFS tasks, you
> should take into account all tasks activity in the rq.
Thanks for comments, Vincent!
Yes, I know some rt task time was counted into cfs, but now we have no
good idea to remove them clearly. So I just want to a bit more precise
cfs runnable load here.
On the other side, periodic LB balance on combined the cfs/rt load, but
removed the RT utilisation in cpu_power.
So, PJT, Peter, what's your idea of this point?
>
> Vincent
>> inc_nr_running(rq);
>> }
>> hrtick_update(rq);
>> --
>> 1.7.12
>>
--
Thanks Alex
On 04/02/2013 10:38 PM, Vincent Guittot wrote:
>> +static unsigned int max_rq_util(int cpu)
>> > +{
>> > + struct rq *rq = cpu_rq(cpu);
>> > + unsigned int rt_util = scale_rt_util(cpu);
>> > + unsigned int cfs_util;
>> > + unsigned int nr_running;
>> > +
>> > + cfs_util = (FULL_UTIL - rt_util) > rq->util ? rq->util
>> > + : (FULL_UTIL - rt_util);
> rt_util and rq->util don't use the same computation algorithm so the
> results are hardly comparable or addable. In addition, some RT tasks
> can have impacted the rq->util, so they will be accounted in both
> side.
Thanks Vincent!
Yes, rt_util calculated with different way, but it has very similar
meaning with rq->util. So compare them make sense.
Yes, the rq->util and rt_util have some overlap, so we need to remove
the overlap part, otherwise the total utlization of this cpu will beyond
100%. that's not make sense. And since RT task always has higher
priority than cfs task, here I keep the RT utilization and yield the cfs
utilization.
>
> Vincent
>
--
Thanks Alex
Nack:
Vincent is correct, rq->avg is supposed to be the average time that an
rq is runnable; this includes (for example) SCHED_RT.
It's intended to be more useful as a hint towards something like a
power governor which wants to know how busy the CPU is in general.
> On the other side, periodic LB balance on combined the cfs/rt load, but
> removed the RT utilisation in cpu_power.
This I don't quite understand; these inputs are already time scaled (by decay).
Stated alternatively, what you want is:
"average load" / "available power", which is:
(rq->cfs.runnable_load_avg + rq->cfs.blocked_load_avg) / (cpu power
scaled for rt)
Where do you propose mixing rq->avg into that?
On Tue, Apr 2, 2013 at 6:02 PM, Alex Shi <[email protected]> wrote:
> On 04/02/2013 10:30 PM, Vincent Guittot wrote:
>> On 30 March 2013 15:34, Alex Shi <[email protected]> wrote:
>>> Old function count the runnable avg on rq's nr_running even there is
>>> only rt task in rq. That is incorrect, so correct it to cfs_rq's
>>> nr_running.
>>>
>>> Signed-off-by: Alex Shi <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 2881d42..026e959 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2829,7 +2829,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>> }
>>>
>>> if (!se) {
>>> - update_rq_runnable_avg(rq, rq->nr_running);
>>> + update_rq_runnable_avg(rq, rq->cfs.nr_running);
>>
>> A RT task that preempts your CFS task will be accounted in the
>> runnable_avg fields. So whatever you do, RT task will impact your
>> runnable_avg statistics. Instead of trying to get only CFS tasks, you
>> should take into account all tasks activity in the rq.
>
> Thanks for comments, Vincent!
>
> Yes, I know some rt task time was counted into cfs, but now we have no
> good idea to remove them clearly. So I just want to a bit more precise
> cfs runnable load here.
> On the other side, periodic LB balance on combined the cfs/rt load, but
> removed the RT utilisation in cpu_power.
>
> So, PJT, Peter, what's your idea of this point?
>>
>> Vincent
>>> inc_nr_running(rq);
>>> }
>>> hrtick_update(rq);
>>> --
>>> 1.7.12
>>>
>
>
> --
> Thanks Alex
On 04/03/2013 09:23 AM, Paul Turner wrote:
> Nack:
> Vincent is correct, rq->avg is supposed to be the average time that an
> rq is runnable; this includes (for example) SCHED_RT.
>
> It's intended to be more useful as a hint towards something like a
> power governor which wants to know how busy the CPU is in general.
Thanks PJT&Vincent. agree with your thought.
>
>> On the other side, periodic LB balance on combined the cfs/rt load, but
>> removed the RT utilisation in cpu_power.
>
> This I don't quite understand; these inputs are already time scaled (by decay).
>
> Stated alternatively, what you want is:
> "average load" / "available power", which is:
> (rq->cfs.runnable_load_avg + rq->cfs.blocked_load_avg) / (cpu power
> scaled for rt)
Right. understand,
--
Thanks Alex
On 04/02/2013 05:02 PM, Namhyung Kim wrote:
>> > + cfs_util = (FULL_UTIL - rt_util) > rq->util ? rq->util
>> > + : (FULL_UTIL - rt_util);
>> > + nr_running = rq->nr_running ? rq->nr_running : 1;
> This can be cleaned up with proper min/max().
>
>> > +
>> > + return rt_util + cfs_util * nr_running;
> Should this nr_running consider tasks in cfs_rq only?
use nr_running of cfs_rq seems better, but when use sched autogroup,
only cfs->nr_running just the active group number, not the total active
task number. :(
Also it seems
> there's no upper bound so that it can possibly exceed FULL_UTIL.
--
Thanks Alex
On Tue, Apr 2, 2013 at 7:15 PM, Alex Shi <[email protected]> wrote:
> On 04/02/2013 05:02 PM, Namhyung Kim wrote:
>>> > + cfs_util = (FULL_UTIL - rt_util) > rq->util ? rq->util
>>> > + : (FULL_UTIL - rt_util);
>>> > + nr_running = rq->nr_running ? rq->nr_running : 1;
>> This can be cleaned up with proper min/max().
>>
>>> > +
>>> > + return rt_util + cfs_util * nr_running;
>> Should this nr_running consider tasks in cfs_rq only?
>
> use nr_running of cfs_rq seems better, but when use sched autogroup,
> only cfs->nr_running just the active group number, not the total active
> task number. :(
Why not just use cfs_rq->h_nr_running? This is always the total
*tasks* in he hierarchy parented that cfs_rq. (This also has the nice property
of not including group_entities.)
>
> Also it seems
>> there's no upper bound so that it can possibly exceed FULL_UTIL.
>
>
> --
> Thanks Alex
On 04/03/2013 10:22 AM, Paul Turner wrote:
> On Tue, Apr 2, 2013 at 7:15 PM, Alex Shi <[email protected]> wrote:
>> On 04/02/2013 05:02 PM, Namhyung Kim wrote:
>>>>> + cfs_util = (FULL_UTIL - rt_util) > rq->util ? rq->util
>>>>> + : (FULL_UTIL - rt_util);
>>>>> + nr_running = rq->nr_running ? rq->nr_running : 1;
>>> This can be cleaned up with proper min/max().
>>>
>>>>> +
>>>>> + return rt_util + cfs_util * nr_running;
>>> Should this nr_running consider tasks in cfs_rq only?
>>
>> use nr_running of cfs_rq seems better, but when use sched autogroup,
>> only cfs->nr_running just the active group number, not the total active
>> task number. :(
>
> Why not just use cfs_rq->h_nr_running? This is always the total
> *tasks* in he hierarchy parented that cfs_rq. (This also has the nice property
> of not including group_entities.)
oh, yes, Thanks for the reminder! :)
>
>>
>> Also it seems
>>> there's no upper bound so that it can possibly exceed FULL_UTIL.
>>
>>
>> --
>> Thanks Alex
--
Thanks Alex
Hi Alex,
On Sat, 30 Mar 2013 22:35:00 +0800, Alex Shi wrote:
> Sleeping task has no utiliation, when they were bursty waked up, the
> zero utilization make scheduler out of balance, like aim7 benchmark.
>
> rq->avg_idle is 'to used to accommodate bursty loads in a dirt simple
> dirt cheap manner' -- Mike Galbraith.
>
> With this cheap and smart bursty indicator, we can find the wake up
> burst, and use nr_running as instant utilization in this scenario.
>
> For other scenarios, we still use the precise CPU utilization to
> judage if a domain is eligible for power scheduling.
>
> Thanks for Mike Galbraith's idea!
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 33 ++++++++++++++++++++++++++-------
> 1 file changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 83b2c39..ae07190 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3371,12 +3371,19 @@ static unsigned int max_rq_util(int cpu)
> * Try to collect the task running number and capacity of the group.
> */
> static void get_sg_power_stats(struct sched_group *group,
> - struct sched_domain *sd, struct sg_lb_stats *sgs)
> + struct sched_domain *sd, struct sg_lb_stats *sgs, int burst)
> {
> int i;
>
> - for_each_cpu(i, sched_group_cpus(group))
> - sgs->group_util += max_rq_util(i);
> + for_each_cpu(i, sched_group_cpus(group)) {
> + struct rq *rq = cpu_rq(i);
> +
> + if (burst && rq->nr_running > 1)
> + /* use nr_running as instant utilization */
> + sgs->group_util += rq->nr_running;
I guess multiplying FULL_UTIL to rq->nr_running here will remove
special-casing the burst in is_sd_full(). Also moving this logic to
max_rq_util() looks better IMHO.
> + else
> + sgs->group_util += max_rq_util(i);
> + }
>
> sgs->group_weight = group->group_weight;
> }
> @@ -3390,6 +3397,8 @@ static int is_sd_full(struct sched_domain *sd,
> struct sched_group *group;
> struct sg_lb_stats sgs;
> long sd_min_delta = LONG_MAX;
> + int cpu = task_cpu(p);
> + int burst = 0;
> unsigned int putil;
>
> if (p->se.load.weight == p->se.avg.load_avg_contrib)
> @@ -3399,15 +3408,21 @@ static int is_sd_full(struct sched_domain *sd,
> putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_POWER_SHIFT)
> / (p->se.avg.runnable_avg_period + 1);
>
> + if (cpu_rq(cpu)->avg_idle < sysctl_sched_burst_threshold)
> + burst = 1;
Sorry, I don't understand this.
Given that sysctl_sched_burst_threshold is twice of
sysctl_sched_migration_cost which is max value of rq->avg_idle, the
avg_idle will be almost always less than the threshold, right?
So how does it find out the burst case? I thought it's the case of a
cpu is in idle for a while and then wakes number of tasks at once. If
so, shouldn't it check whether the avg_idle is *longer* than certain
threshold? What am I missing?
Thanks,
Namhyung
> +
> /* Try to collect the domain's utilization */
> group = sd->groups;
> do {
> long g_delta;
>
> memset(&sgs, 0, sizeof(sgs));
> - get_sg_power_stats(group, sd, &sgs);
> + get_sg_power_stats(group, sd, &sgs, burst);
>
> - g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util;
> + if (burst)
> + g_delta = sgs.group_weight - sgs.group_util;
> + else
> + g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util;
>
> if (g_delta > 0 && g_delta < sd_min_delta) {
> sd_min_delta = g_delta;
> @@ -3417,8 +3432,12 @@ static int is_sd_full(struct sched_domain *sd,
> sds->sd_util += sgs.group_util;
> } while (group = group->next, group != sd->groups);
>
> - if (sds->sd_util + putil < sd->span_weight * FULL_UTIL)
> - return 0;
> + if (burst) {
> + if (sds->sd_util < sd->span_weight)
> + return 0;
> + } else
> + if (sds->sd_util + putil < sd->span_weight * FULL_UTIL)
> + return 0;
>
> /* can not hold one more task in this domain */
> return 1;
struct rq *rq = cpu_rq(i);
>> +
>> + if (burst && rq->nr_running > 1)
>> + /* use nr_running as instant utilization */
>> + sgs->group_util += rq->nr_running;
>
> I guess multiplying FULL_UTIL to rq->nr_running here will remove
> special-casing the burst in is_sd_full(). Also moving this logic to
> max_rq_util() looks better IMHO.
Yes, right! Thanks a lot!
>
>
>> + if (cpu_rq(cpu)->avg_idle < sysctl_sched_burst_threshold)
>> + burst = 1;
>
> Sorry, I don't understand this.
>
> Given that sysctl_sched_burst_threshold is twice of
> sysctl_sched_migration_cost which is max value of rq->avg_idle, the
> avg_idle will be almost always less than the threshold, right?
In fact, lots of time we have avg_idle at the max value. so won't always
have burst.
>
> So how does it find out the burst case? I thought it's the case of a
> cpu is in idle for a while and then wakes number of tasks at once.
Yes.
If
> so, shouldn't it check whether the avg_idle is *longer* than certain
> threshold? What am I missing?
avg_idle is smaller when many wakeup happens. update_avg() is not always
increase it.
>
> Thanks,
> Namhyung
>
--
Thanks Alex
On 04/03/2013 10:22 AM, Paul Turner wrote:
> On Tue, Apr 2, 2013 at 7:15 PM, Alex Shi <[email protected]> wrote:
>> On 04/02/2013 05:02 PM, Namhyung Kim wrote:
>>>>> + cfs_util = (FULL_UTIL - rt_util) > rq->util ? rq->util
>>>>> + : (FULL_UTIL - rt_util);
>>>>> + nr_running = rq->nr_running ? rq->nr_running : 1;
>>> This can be cleaned up with proper min/max().
>>>
>>>>> +
>>>>> + return rt_util + cfs_util * nr_running;
>>> Should this nr_running consider tasks in cfs_rq only?
>>
>> use nr_running of cfs_rq seems better, but when use sched autogroup,
>> only cfs->nr_running just the active group number, not the total active
>> task number. :(
>
> Why not just use cfs_rq->h_nr_running? This is always the total
> *tasks* in he hierarchy parented that cfs_rq. (This also has the nice property
> of not including group_entities.)
>
Thanks for Namhyung and PJT's suggestions!
patch updated!
>From 5f6fc3129784db5fb96b8bb7014fe41ee7e059c5 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Sun, 24 Mar 2013 21:47:59 +0800
Subject: [PATCH 09/21] sched: get rq potential maximum utilization
Since the rt task priority is higher than fair tasks, cfs_rq utilization
is just the left of rt utilization.
When there are some cfs tasks in queue, the potential utilization may
be yielded, so mulitiplying cfs task number to get max potential
utilization of cfs. Then the rq utilization is sum of rt util and cfs
util.
Thanks for Paul Turner and Namhyung's reminder!
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c47933f..70a99c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3350,6 +3350,27 @@ struct sg_lb_stats {
unsigned int group_util; /* sum utilization of group */
};
+static unsigned long scale_rt_util(int cpu);
+
+/*
+ * max_rq_util - get the possible maximum cpu utilization
+ */
+static unsigned int max_rq_util(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned int rt_util = scale_rt_util(cpu);
+ unsigned int cfs_util;
+ unsigned int nr_running;
+
+ /* yield cfs utilization to rt's, if total utilization > 100% */
+ cfs_util = min(rq->util, (unsigned int)(FULL_UTIL - rt_util));
+
+ /* count transitory task utilization */
+ nr_running = max(rq->cfs.h_nr_running, (unsigned int)1);
+
+ return rt_util + cfs_util * nr_running;
+}
+
/*
* sched_balance_self: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
--
1.7.12
--
Thanks Alex
On 04/03/2013 01:08 PM, Namhyung Kim wrote:
>> > - for_each_cpu(i, sched_group_cpus(group))
>> > - sgs->group_util += max_rq_util(i);
>> > + for_each_cpu(i, sched_group_cpus(group)) {
>> > + struct rq *rq = cpu_rq(i);
>> > +
>> > + if (burst && rq->nr_running > 1)
>> > + /* use nr_running as instant utilization */
>> > + sgs->group_util += rq->nr_running;
> I guess multiplying FULL_UTIL to rq->nr_running here will remove
> special-casing the burst in is_sd_full(). Also moving this logic to
> max_rq_util() looks better IMHO.
>
>
Thanks Namhyung! patch updated.
>From b6384f4e3294e19103f706195a95265faf1ea7ef Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Mon, 25 Mar 2013 16:07:39 +0800
Subject: [PATCH 12/21] sched: using avg_idle to detect bursty wakeup
Sleeping task has no utiliation, when they were bursty waked up, the
zero utilization make scheduler out of balance, like aim7 benchmark.
rq->avg_idle is 'to used to accommodate bursty loads in a dirt simple
dirt cheap manner' -- Mike Galbraith.
With this cheap and smart bursty indicator, we can find the wake up
burst, and use nr_running as instant utilization in this scenario.
For other scenarios, we still use the precise CPU utilization to
judage if a domain is eligible for power scheduling.
Thanks for Mike Galbraith's idea!
Thanks for Namhyung's suggestion to compact the burst into
max_rq_util()!
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f610313..a729939 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3363,6 +3363,10 @@ static unsigned int max_rq_util(int cpu)
unsigned int cfs_util;
unsigned int nr_running;
+ /* use nr_running as instant utilization for burst cpu */
+ if (cpu_rq(cpu)->avg_idle < sysctl_sched_burst_threshold)
+ return rq->nr_running * FULL_UTIL;
+
/* yield cfs utilization to rt's, if total utilization > 100% */
cfs_util = min(rq->util, (unsigned int)(FULL_UTIL - rt_util));
--
1.7.12
--
Thanks Alex
On 03/30/2013 10:34 PM, Alex Shi wrote:
The 'sysctl_sched_burst_threshold' used for wakeup burst, since we will
check every cpu's avg_idle, the value get be more precise, set 0.1ms.
patch updated.
========
>From 67192d87f9656c3e72bc108ef4648de55d470f4a Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Thu, 21 Mar 2013 14:23:00 +0800
Subject: [PATCH 11/21] sched: add sched_burst_threshold_ns as wakeup burst
indicator
rq->avg_idle is 'to used to accommodate bursty loads in a dirt simple
dirt cheap manner' -- Mike Galbraith.
With this cheap and smart bursty indicator, we can find the wake up
burst, and just use nr_running as instant utilization only.
The 'sysctl_sched_burst_threshold' used for wakeup burst, since we will
check every cpu's avg_idle, the value get be more precise, set 0.1ms.
Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched/sysctl.h | 1 +
kernel/sched/fair.c | 1 +
kernel/sysctl.c | 7 +++++++
3 files changed, 9 insertions(+)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..a3c3d43 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -53,6 +53,7 @@ extern unsigned int sysctl_numa_balancing_settle_count;
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
+extern unsigned int sysctl_sched_burst_threshold;
extern unsigned int sysctl_sched_nr_migrate;
extern unsigned int sysctl_sched_time_avg;
extern unsigned int sysctl_timer_migration;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c08773..f610313 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -91,6 +91,7 @@ unsigned int sysctl_sched_wakeup_granularity = 1000000UL;
unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;
const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
+const_debug unsigned int sysctl_sched_burst_threshold = 100000UL;
/*
* The exponential sliding window over which load is averaged for shares
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..1f23457 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -327,6 +327,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "sched_burst_threshold_ns",
+ .data = &sysctl_sched_burst_threshold,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_nr_migrate",
.data = &sysctl_sched_nr_migrate,
.maxlen = sizeof(unsigned int),
--
1.7.12
--
Thanks Alex
On 03/30/2013 10:34 PM, Alex Shi wrote:
> This patch set implement/consummate the rough power aware scheduling
> proposal: https://lkml.org/lkml/2012/8/13/139.
>
> The code also on this git tree:
> https://github.com/alexshi/power-scheduling.git power-scheduling
Many many thanks for PJT, Vincent, Preeti, Namhyung and Michael Wang's
testing and suggestions!
patch set were updated in above git tree.
Few quick testing show the new code has slight better performance on
aim7/hackbench with powersaving policy while keep the task packing
behaviour.
Thanks all!
>
On 03/30/2013 10:34 PM, Alex Shi wrote:
> This patch set implement/consummate the rough power aware scheduling
> proposal: https://lkml.org/lkml/2012/8/13/139.
BTW, this task packing feature causes more cpu freq boost because part
cores idle. And since cpu freq boost is more power efficient.
that is much helpful on performance/watts. like the 16/32 thread kbuild
results show:
powersaving performance
> x = 2 189.416 /228 23 193.355 /209 24
> x = 4 215.728 /132 35 219.69 /122 37
> x = 8 244.31 /75 54 252.709 /68 58
> x = 16 299.915 /43 77 259.127 /58 66
> x = 32 341.221 /35 83 323.418 /38 81
>
> data explains: 189.416 /228 23
> 189.416: average Watts during compilation
> 228: seconds(compile time)
> 23: scaled performance/watts = 1000000 / seconds / watts
>
>
> The code also on this git tree:
> https://github.com/alexshi/power-scheduling.git power-scheduling
>
> The patch defines a new policy 'powersaving', that try to pack tasks on
> each sched groups level. Then it can save much power when task number in
> system is no more than LCPU number.
>
> As mentioned in the power aware scheduling proposal, Power aware
> scheduling has 2 assumptions:
> 1, race to idle is helpful for power saving
> 2, less active sched groups will reduce cpu power consumption
>
> The first assumption make performance policy take over scheduling when
> any group is busy.
> The second assumption make power aware scheduling try to pack disperse
> tasks into fewer groups.
>
> Compare to the removed power balance, this power balance has the following
> advantages:
> 1, simpler sys interface
> only 2 sysfs interface VS 2 interface for each of LCPU
> 2, cover on all cpu topology
> effect on all domain level VS only work on SMT/MC domain
> 3, Less task migration
> mutual exclusive perf/power LB VS balance power on balanced performance
> 4, considered system load threshing
> yes VS no
> 5, transitory task considered
> yes VS no
>
> BTW, like sched numa, Power aware scheduling is also a kind of cpu
> locality oriented scheduling.
>
> Thanks comments/suggestions from PeterZ, Linus Torvalds, Andrew Morton,
> Ingo, Len Brown, Arjan, Borislav Petkov, PJT, Namhyung Kim, Mike
> Galbraith, Greg, Preeti, Morten Rasmussen, Rafael etc.
>
> Since the patch can perfect pack tasks into fewer groups, I just show
> some performance/power testing data here:
> =========================================
> $for ((i = 0; i < x; i++)) ; do while true; do :; done & done
>
> On my SNB laptop with 4 core* HT: the data is avg Watts
> powersaving performance
> x = 8 72.9482 72.6702
> x = 4 61.2737 66.7649
> x = 2 44.8491 59.0679
> x = 1 43.225 43.0638
>
> on SNB EP machine with 2 sockets * 8 cores * HT:
> powersaving performance
> x = 32 393.062 395.134
> x = 16 277.438 376.152
> x = 8 209.33 272.398
> x = 4 199 238.309
> x = 2 175.245 210.739
> x = 1 174.264 173.603
>
>
> tasks number keep waving benchmark, 'make -j <x> vmlinux'
> on my SNB EP 2 sockets machine with 8 cores * HT:
> powersaving performance
> x = 2 189.416 /228 23 193.355 /209 24
> x = 4 215.728 /132 35 219.69 /122 37
> x = 8 244.31 /75 54 252.709 /68 58
> x = 16 299.915 /43 77 259.127 /58 66
> x = 32 341.221 /35 83 323.418 /38 81
>
> data explains: 189.416 /228 23
> 189.416: average Watts during compilation
> 228: seconds(compile time)
> 23: scaled performance/watts = 1000000 / seconds / watts
> The performance value of kbuild is better on threads 16/32, that's due
> to lazy power balance reduced the context switch and CPU has more boost
> chance on powersaving balance.
>
> Some performance testing results:
> ---------------------------------
>
> Tested benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> loopback netperf. on my core2, nhm, wsm, snb, platforms.
>
> results:
> A, no clear performance change found on 'performance' policy.
> B, specjbb2005 drop 5~7% on both of policy whenever with openjdk or
> jrockit on powersaving polocy
> C, hackbench drops 40% with powersaving policy on snb 4 sockets platforms.
> Others has no clear change.
>
> ===
> Changelog:
> V6 change:
> a, remove 'balance' policy.
> b, consider RT task effect in balancing
> c, use avg_idle as burst wakeup indicator
> d, balance on task utilization in fork/exec/wakeup.
> e, no power balancing on SMT domain.
>
> V5 change:
> a, change sched_policy to sched_balance_policy
> b, split fork/exec/wake power balancing into 3 patches and refresh
> commit logs
> c, others minors clean up
>
> V4 change:
> a, fix few bugs and clean up code according to Morten Rasmussen, Mike
> Galbraith and Namhyung Kim. Thanks!
> b, take Morten Rasmussen's suggestion to use different criteria for
> different policy in transitory task packing.
> c, shorter latency in power aware scheduling.
>
> V3 change:
> a, engaged nr_running and utilisation in periodic power balancing.
> b, try packing small exec/wake tasks on running cpu not idle cpu.
>
> V2 change:
> a, add lazy power scheduling to deal with kbuild like benchmark.
>
>
> -- Thanks Alex
> [patch v6 01/21] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
> [patch v6 02/21] sched: set initial value of runnable avg for new
> [patch v6 03/21] sched: only count runnable avg on cfs_rq's
> [patch v6 04/21] sched: add sched balance policies in kernel
> [patch v6 05/21] sched: add sysfs interface for sched_balance_policy
> [patch v6 06/21] sched: log the cpu utilization at rq
> [patch v6 07/21] sched: add new sg/sd_lb_stats fields for incoming
> [patch v6 08/21] sched: move sg/sd_lb_stats struct ahead
> [patch v6 09/21] sched: scale_rt_power rename and meaning change
> [patch v6 10/21] sched: get rq potential maximum utilization
> [patch v6 11/21] sched: detect wakeup burst with rq->avg_idle
> [patch v6 12/21] sched: add power aware scheduling in fork/exec/wake
> [patch v6 13/21] sched: using avg_idle to detect bursty wakeup
> [patch v6 14/21] sched: packing transitory tasks in wakeup power
> [patch v6 15/21] sched: add power/performance balance allow flag
> [patch v6 16/21] sched: pull all tasks from source group
> [patch v6 17/21] sched: no balance for prefer_sibling in power
> [patch v6 18/21] sched: add new members of sd_lb_stats
> [patch v6 19/21] sched: power aware load balance
> [patch v6 20/21] sched: lazy power balance
> [patch v6 21/21] sched: don't do power balance on share cpu power
>
--
Thanks
Alex