Many many thanks for Namhyung, PJT, Vicent and Preeti's comments and suggestion!
This version included the following changes:
a, remove the patch 3th to recover the runnable load avg recording on rt
b, check avg_idle for each cpu wakeup burst not only the waking CPU.
c, fix select_task_rq_fair return -1 bug by Preeti.
--------------
This patch set implement/consummate the rough power aware scheduling
proposal: https://lkml.org/lkml/2012/8/13/139.
The code also on this git tree:
https://github.com/alexshi/power-scheduling.git power-scheduling
The patch defines a new policy 'powersaving', that try to pack tasks on
each sched groups level. Then it can save much power when task number in
system is no more than LCPU number.
As mentioned in the power aware scheduling proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, less active sched groups will reduce cpu power consumption
The first assumption make performance policy take over scheduling when
any group is busy.
The second assumption make power aware scheduling try to pack disperse
tasks into fewer groups.
This feature will cause more cpu cores idle, the give more chances to have
cpu freq boost on active cores. CPU freq boost gives better performance and
better power efficient. The following kbuild test result show this point.
Compare to the removed power balance, this power balance has the following
advantages:
1, simpler sys interface
only 2 sysfs interface VS 2 interface for each of LCPU
2, cover on all cpu topology
effect on all domain level VS only work on SMT/MC domain
3, Less task migration
mutual exclusive perf/power LB VS balance power on balanced performance
4, considered system load threshing
yes VS no
5, transitory task considered
yes VS no
BTW, like sched numa, Power aware scheduling is also a kind of cpu
locality oriented scheduling.
Thanks comments/suggestions from PeterZ, Linus Torvalds, Andrew Morton,
Ingo, Len Brown, Arjan, Borislav Petkov, PJT, Namhyung Kim, Mike
Galbraith, Greg, Preeti, Morten Rasmussen, Rafael etc.
Since the patch can perfect pack tasks into fewer groups, I just show
some performance/power testing data here:
=========================================
$for ((i = 0; i < x; i++)) ; do while true; do :; done & done
On my SNB laptop with 4 core* HT: the data is avg Watts
powersaving performance
x = 8 72.9482 72.6702
x = 4 61.2737 66.7649
x = 2 44.8491 59.0679
x = 1 43.225 43.0638
on SNB EP machine with 2 sockets * 8 cores * HT:
powersaving performance
x = 32 393.062 395.134
x = 16 277.438 376.152
x = 8 209.33 272.398
x = 4 199 238.309
x = 2 175.245 210.739
x = 1 174.264 173.603
tasks number keep waving benchmark, 'make -j <x> vmlinux'
on my SNB EP 2 sockets machine with 8 cores * HT:
powersaving performance
x = 2 189.416 /228 23 193.355 /209 24
x = 4 215.728 /132 35 219.69 /122 37
x = 8 244.31 /75 54 252.709 /68 58
x = 16 299.915 /43 77 259.127 /58 66
x = 32 341.221 /35 83 323.418 /38 81
data explains: 189.416 /228 23
189.416: average Watts during compilation
228: seconds(compile time)
23: scaled performance/watts = 1000000 / seconds / watts
The performance value of kbuild is better on threads 16/32, that's due
to lazy power balance reduced the context switch and CPU has more boost
chance on powersaving balance.
Some performance testing results:
---------------------------------
Tested benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
loopback netperf. on my core2, nhm, wsm, snb, platforms.
results:
A, no clear performance change found on 'performance' policy.
B, specjbb2005 drop 5~7% on both of policy whenever with openjdk or
jrockit on powersaving polocy
C, hackbench drops 40% with powersaving policy on snb 4 sockets platforms.
Others has no clear change.
===
Changelog:
V7 change:
a, remove the patch 3th to recover the runnable load avg recording on rt
b, check avg_idle for each cpu wakeup burst not only the waking CPU.
c, fix select_task_rq_fair return -1 bug by Preeti.
Changelog:
V6 change:
a, remove 'balance' policy.
b, consider RT task effect in balancing
c, use avg_idle as burst wakeup indicator
d, balance on task utilization in fork/exec/wakeup.
e, no power balancing on SMT domain.
V5 change:
a, change sched_policy to sched_balance_policy
b, split fork/exec/wake power balancing into 3 patches and refresh
commit logs
c, others minors clean up
V4 change:
a, fix few bugs and clean up code according to Morten Rasmussen, Mike
Galbraith and Namhyung Kim. Thanks!
b, take Morten Rasmussen's suggestion to use different criteria for
different policy in transitory task packing.
c, shorter latency in power aware scheduling.
V3 change:
a, engaged nr_running and utilisation in periodic power balancing.
b, try packing small exec/wake tasks on running cpu not idle cpu.
V2 change:
a, add lazy power scheduling to deal with kbuild like benchmark.
-- Thanks Alex
[patch v7 01/21] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[patch v7 02/21] sched: set initial value of runnable avg for new
[patch v7 03/21] sched: add sched balance policies in kernel
[patch v7 04/21] sched: add sysfs interface for sched_balance_policy
[patch v7 05/21] sched: log the cpu utilization at rq
[patch v7 06/21] sched: add new sg/sd_lb_stats fields for incoming
[patch v7 07/21] sched: move sg/sd_lb_stats struct ahead
[patch v7 08/21] sched: scale_rt_power rename and meaning change
[patch v7 09/21] sched: get rq potential maximum utilization
[patch v7 10/21] sched: add power aware scheduling in fork/exec/wake
[patch v7 11/21] sched: add sched_burst_threshold_ns as wakeup burst
[patch v7 12/21] sched: using avg_idle to detect bursty wakeup
[patch v7 13/21] sched: packing transitory tasks in wakeup power
[patch v7 14/21] sched: add power/performance balance allow flag
[patch v7 15/21] sched: pull all tasks from source group
[patch v7 16/21] sched: no balance for prefer_sibling in power
[patch v7 17/21] sched: add new members of sd_lb_stats
[patch v7 18/21] sched: power aware load balance
[patch v7 19/21] sched: lazy power balance
[patch v7 20/21] sched: don't do power balance on share cpu power
[patch v7 21/21] sched: make sure select_tas_rq_fair get a cpu
Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.
Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 7 +------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 13 ++-----------
kernel/sched/sched.h | 9 +--------
4 files changed, 5 insertions(+), 31 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..5a4cf37 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1160,12 +1160,7 @@ struct sched_entity {
struct cfs_rq *my_q;
#endif
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..54eaaa2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1561,12 +1561,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..9c2f726 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3394,12 +3393,6 @@ unlock:
}
/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3422,7 +3415,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */
static unsigned long
@@ -6114,9 +6106,8 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..7f36024f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -227,12 +227,6 @@ struct cfs_rq {
#endif
#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -242,8 +236,7 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
u32 tg_runnable_contrib;
u64 tg_load_contrib;
--
1.7.12
Current scheduler behavior is just consider for larger performance
of system. So it try to spread tasks on more cpu sockets and cpu cores
To adding the consideration of power awareness, the patchset adds
a powersaving scheduler policy. It will use runnable load util in
scheduler balancing. The current scheduling is taken as performance
policy.
performance: the current scheduling behaviour, try to spread tasks
on more CPU sockets or cores. performance oriented.
powersaving: will pack tasks into few sched group until all LCPU in the
group is full, power oriented.
The incoming patches will enable powersaving scheduling in CFS.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 3 +++
kernel/sched/sched.h | 5 +++++
2 files changed, 8 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2881d42..04dd319 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6093,6 +6093,9 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
return rr_interval;
}
+/* The default scheduler policy is 'performance'. */
+int __read_mostly sched_balance_policy = SCHED_POLICY_PERFORMANCE;
+
/*
* All the scheduling class methods:
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7f36024f..804ee41 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -10,6 +10,11 @@
extern __read_mostly int scheduler_running;
+#define SCHED_POLICY_PERFORMANCE (0x1)
+#define SCHED_POLICY_POWERSAVING (0x2)
+
+extern int __read_mostly sched_balance_policy;
+
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
* to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
--
1.7.12
The cpu's utilization is to measure how busy is the cpu.
util = cpu_rq(cpu)->avg.runnable_avg_sum * SCHED_POEWR_SCALE
/ cpu_rq(cpu)->avg.runnable_avg_period;
Since the util is no more than 1, we scale its value with 1024, same as
SCHED_POWER_SCALE and set the FULL_UTIL as 1024.
In later power aware scheduling, we are sensitive for how busy of the
cpu. Since as to power consuming, it is tight related with cpu busy
time.
BTW, rq->util can be used for any purposes if needed, not only power
scheduling.
Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 5 +++++
kernel/sched/sched.h | 4 ++++
4 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5a4cf37..226a515 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -793,7 +793,7 @@ enum cpu_idle_type {
#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
/*
- * Increase resolution of cpu_power calculations
+ * Increase resolution of cpu_power and rq->util calculations
*/
#define SCHED_POWER_SHIFT 10
#define SCHED_POWER_SCALE (1L << SCHED_POWER_SHIFT)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 75024a6..f5db759 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -311,6 +311,7 @@ do { \
P(ttwu_count);
P(ttwu_local);
+ P(util);
#undef P
#undef P64
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2e49c3f..7124244 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1495,8 +1495,13 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
+ u32 period;
__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+
+ period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
+ rq->util = (u64)(rq->avg.runnable_avg_sum << SCHED_POWER_SHIFT)
+ / period;
}
/* Add the load generated by se into cfs_rq's child load-average */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 804ee41..8682110 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -351,6 +351,9 @@ extern struct root_domain def_root_domain;
#endif /* CONFIG_SMP */
+/* full cpu utilization */
+#define FULL_UTIL SCHED_POWER_SCALE
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -482,6 +485,7 @@ struct rq {
#endif
struct sched_avg avg;
+ unsigned int util;
};
static inline int cpu_of(struct rq *rq)
--
1.7.12
This patch add the power aware scheduler knob into sysfs:
$cat /sys/devices/system/cpu/sched_balance_policy/available_sched_balance_policy
performance powersaving
$cat /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy
powersaving
This means the using sched balance policy is 'powersaving'.
User can change the policy by commend 'echo':
echo performance > /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy
Signed-off-by: Alex Shi <[email protected]>
---
Documentation/ABI/testing/sysfs-devices-system-cpu | 22 +++++++
kernel/sched/fair.c | 69 ++++++++++++++++++++++
2 files changed, 91 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 9c978dc..b602882 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -53,6 +53,28 @@ Description: Dynamic addition and removal of CPU's. This is not hotplug
the system. Information writtento the file to remove CPU's
is architecture specific.
+What: /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy
+ /sys/devices/system/cpu/sched_balance_policy/available_sched_balance_policy
+Date: Oct 2012
+Contact: Linux kernel mailing list <[email protected]>
+Description: CFS balance policy show and set interface.
+
+ available_sched_balance_policy: shows there are 2 kinds of
+ policies:
+ performance powersaving.
+ current_sched_balance_policy: shows current scheduler policy.
+ User can change the policy by writing it.
+
+ Policy decides the CFS scheduler how to balance tasks onto
+ different CPU unit.
+
+ performance: try to spread tasks onto more CPU sockets,
+ more CPU cores. performance oriented.
+
+ powersaving: try to pack tasks onto same core or same CPU
+ until every LCPUs are busy in the core or CPU socket.
+ powersaving oriented.
+
What: /sys/devices/system/cpu/cpu#/node
Date: October 2009
Contact: Linux memory management mailing list <[email protected]>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04dd319..2e49c3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6096,6 +6096,75 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
/* The default scheduler policy is 'performance'. */
int __read_mostly sched_balance_policy = SCHED_POLICY_PERFORMANCE;
+#ifdef CONFIG_SYSFS
+static ssize_t show_available_sched_balance_policy(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "performance powersaving\n");
+}
+
+static ssize_t show_current_sched_balance_policy(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ if (sched_balance_policy == SCHED_POLICY_PERFORMANCE)
+ return sprintf(buf, "performance\n");
+ else if (sched_balance_policy == SCHED_POLICY_POWERSAVING)
+ return sprintf(buf, "powersaving\n");
+ return 0;
+}
+
+static ssize_t set_sched_balance_policy(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ unsigned int ret = -EINVAL;
+ char str_policy[16];
+
+ ret = sscanf(buf, "%15s", str_policy);
+ if (ret != 1)
+ return -EINVAL;
+
+ if (!strcmp(str_policy, "performance"))
+ sched_balance_policy = SCHED_POLICY_PERFORMANCE;
+ else if (!strcmp(str_policy, "powersaving"))
+ sched_balance_policy = SCHED_POLICY_POWERSAVING;
+ else
+ return -EINVAL;
+
+ return count;
+}
+
+/*
+ * * Sysfs setup bits:
+ * */
+static DEVICE_ATTR(current_sched_balance_policy, 0644,
+ show_current_sched_balance_policy, set_sched_balance_policy);
+
+static DEVICE_ATTR(available_sched_balance_policy, 0444,
+ show_available_sched_balance_policy, NULL);
+
+static struct attribute *sched_balance_policy_default_attrs[] = {
+ &dev_attr_current_sched_balance_policy.attr,
+ &dev_attr_available_sched_balance_policy.attr,
+ NULL
+};
+static struct attribute_group sched_balance_policy_attr_group = {
+ .attrs = sched_balance_policy_default_attrs,
+ .name = "sched_balance_policy",
+};
+
+int __init create_sysfs_sched_balance_policy_group(struct device *dev)
+{
+ return sysfs_create_group(&dev->kobj, &sched_balance_policy_attr_group);
+}
+
+static int __init sched_balance_policy_sysfs_init(void)
+{
+ return create_sysfs_sched_balance_policy_group(cpu_subsys.dev_root);
+}
+
+core_initcall(sched_balance_policy_sysfs_init);
+#endif /* CONFIG_SYSFS */
+
/*
* All the scheduling class methods:
*/
--
1.7.12
Power aware fork/exec/wake balancing needs both of structs in incoming
patches. So move ahead before it.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 99 +++++++++++++++++++++++++++--------------------------
1 file changed, 50 insertions(+), 49 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b7917b..a0bd2f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3301,6 +3301,56 @@ done:
}
/*
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ * during load balancing.
+ */
+struct sd_lb_stats {
+ struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *this; /* Local group in this sd */
+ unsigned long total_load; /* Total load of all groups in sd */
+ unsigned long total_pwr; /* Total power of all groups in sd */
+ unsigned long avg_load; /* Average load across all groups in sd */
+
+ /** Statistics of this group */
+ unsigned long this_load;
+ unsigned long this_load_per_task;
+ unsigned long this_nr_running;
+ unsigned int this_has_capacity;
+ unsigned int this_idle_cpus;
+
+ /* Statistics of the busiest group */
+ unsigned int busiest_idle_cpus;
+ unsigned long max_load;
+ unsigned long busiest_load_per_task;
+ unsigned long busiest_nr_running;
+ unsigned long busiest_group_capacity;
+ unsigned int busiest_has_capacity;
+ unsigned int busiest_group_weight;
+
+ int group_imb; /* Is there imbalance in this sd */
+
+ /* Varibles of power awaring scheduling */
+ unsigned int sd_util; /* sum utilization of this domain */
+ struct sched_group *group_leader; /* Group which relieves group_min */
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ */
+struct sg_lb_stats {
+ unsigned long avg_load; /*Avg load across the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long sum_nr_running; /* Nr tasks running in the group */
+ unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+ unsigned long group_capacity;
+ unsigned long idle_cpus;
+ unsigned long group_weight;
+ int group_imb; /* Is there an imbalance in the group ? */
+ int group_has_capacity; /* Is there extra capacity in the group? */
+ unsigned int group_util; /* sum utilization of group */
+};
+
+/*
* sched_balance_self: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
* SD_BALANCE_EXEC.
@@ -4175,55 +4225,6 @@ static unsigned long task_h_load(struct task_struct *p)
#endif
/********** Helpers for find_busiest_group ************************/
-/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
- */
-struct sd_lb_stats {
- struct sched_group *busiest; /* Busiest group in this sd */
- struct sched_group *this; /* Local group in this sd */
- unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_pwr; /* Total power of all groups in sd */
- unsigned long avg_load; /* Average load across all groups in sd */
-
- /** Statistics of this group */
- unsigned long this_load;
- unsigned long this_load_per_task;
- unsigned long this_nr_running;
- unsigned long this_has_capacity;
- unsigned int this_idle_cpus;
-
- /* Statistics of the busiest group */
- unsigned int busiest_idle_cpus;
- unsigned long max_load;
- unsigned long busiest_load_per_task;
- unsigned long busiest_nr_running;
- unsigned long busiest_group_capacity;
- unsigned long busiest_has_capacity;
- unsigned int busiest_group_weight;
-
- int group_imb; /* Is there imbalance in this sd */
-
- /* Varibles of power awaring scheduling */
- unsigned int sd_util; /* sum utilization of this domain */
- struct sched_group *group_leader; /* Group which relieves group_min */
-};
-
-/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
- */
-struct sg_lb_stats {
- unsigned long avg_load; /*Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long sum_nr_running; /* Nr tasks running in the group */
- unsigned long sum_weighted_load; /* Weighted load of group's tasks */
- unsigned long group_capacity;
- unsigned long idle_cpus;
- unsigned long group_weight;
- int group_imb; /* Is there an imbalance in the group ? */
- int group_has_capacity; /* Is there extra capacity in the group? */
- unsigned int group_util; /* sum utilization of group */
-};
/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
--
1.7.12
This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest while still has utilization group. That's
will save power since it leaves more groups idle in system.
The trade off is adding a power aware statistics collection in group
seeking. But since the collection just happened in power scheduling
eligible condition, the worst case of hackbench testing just drops
about 2% with powersaving policy. No clear change for performance
policy.
The main function in this patch is get_cpu_for_power_policy(), that
will try to get a idlest cpu from the busiest while still has
utilization group, if the system is using power aware policy and
has such group.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 103 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 70a99c9..0c08773 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3372,25 +3372,113 @@ static unsigned int max_rq_util(int cpu)
}
/*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * Try to collect the task running number and capacity of the group.
+ */
+static void get_sg_power_stats(struct sched_group *group,
+ struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+ int i;
+
+ for_each_cpu(i, sched_group_cpus(group))
+ sgs->group_util += max_rq_util(i);
+
+ sgs->group_weight = group->group_weight;
+}
+
+/*
+ * Is this domain full of utilization with the task?
+ */
+static int is_sd_full(struct sched_domain *sd,
+ struct task_struct *p, struct sd_lb_stats *sds)
+{
+ struct sched_group *group;
+ struct sg_lb_stats sgs;
+ long sd_min_delta = LONG_MAX;
+ unsigned int putil;
+
+ if (p->se.load.weight == p->se.avg.load_avg_contrib)
+ /* p maybe a new forked task */
+ putil = FULL_UTIL;
+ else
+ putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_POWER_SHIFT)
+ / (p->se.avg.runnable_avg_period + 1);
+
+ /* Try to collect the domain's utilization */
+ group = sd->groups;
+ do {
+ long g_delta;
+
+ memset(&sgs, 0, sizeof(sgs));
+ get_sg_power_stats(group, sd, &sgs);
+
+ g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util;
+
+ if (g_delta > 0 && g_delta < sd_min_delta) {
+ sd_min_delta = g_delta;
+ sds->group_leader = group;
+ }
+
+ sds->sd_util += sgs.group_util;
+ } while (group = group->next, group != sd->groups);
+
+ if (sds->sd_util + putil < sd->span_weight * FULL_UTIL)
+ return 0;
+
+ /* can not hold one more task in this domain */
+ return 1;
+}
+
+/*
+ * Execute power policy if this domain is not full.
+ */
+static inline int get_sd_sched_balance_policy(struct sched_domain *sd,
+ int cpu, struct task_struct *p, struct sd_lb_stats *sds)
+{
+ if (sched_balance_policy == SCHED_POLICY_PERFORMANCE)
+ return SCHED_POLICY_PERFORMANCE;
+
+ memset(sds, 0, sizeof(*sds));
+ if (is_sd_full(sd, p, sds))
+ return SCHED_POLICY_PERFORMANCE;
+ return sched_balance_policy;
+}
+
+/*
+ * If power policy is eligible for this domain, and it has task allowed cpu.
+ * we will select CPU from this domain.
+ */
+static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
+ struct task_struct *p, struct sd_lb_stats *sds)
+{
+ int policy;
+ int new_cpu = -1;
+
+ policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
+ if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
+ new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+
+ return new_cpu;
+}
+
+/*
+ * select_task_rq_fair: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
* SD_BALANCE_EXEC.
*
- * Balance, ie. select the least loaded group.
- *
* Returns the target CPU number, or the same CPU if no balancing is needed.
*
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
int new_cpu = cpu;
int want_affine = 0;
- int sync = wake_flags & WF_SYNC;
+ int sync = flags & WF_SYNC;
+ struct sd_lb_stats sds;
if (p->nr_cpus_allowed == 1)
return prev_cpu;
@@ -3416,11 +3504,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
break;
}
- if (tmp->flags & sd_flag)
+ if (tmp->flags & sd_flag) {
sd = tmp;
+
+ new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+ if (new_cpu != -1)
+ goto unlock;
+ }
}
if (affine_sd) {
+ new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+ if (new_cpu != -1)
+ goto unlock;
+
if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
prev_cpu = cpu;
--
1.7.12
In power aware scheduling, we don't want to balance 'prefer_sibling'
groups just because local group has capacity.
If the local group has no tasks at the time, that is the power
balance hope so.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0dd29f4..86221e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4762,8 +4762,12 @@ static inline void update_sd_lb_stats(struct lb_env *env,
* extra check prevents the case where you always pull from the
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
+ *
+ * In power aware scheduling, we don't care load weight and
+ * want not to pull tasks just because local group has capacity.
*/
- if (prefer_sibling && !local_group && sds->this_has_capacity)
+ if (prefer_sibling && !local_group && sds->this_has_capacity
+ && env->flags & LBF_PERF_BAL)
sgs.group_capacity = min(sgs.group_capacity, 1UL);
if (local_group) {
--
1.7.12
In power balance, we hope some sched groups are fully empty to save
CPU power of them. So, we want to move any tasks from them.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f861427..0dd29f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5105,7 +5105,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* When comparing with imbalance, use weighted_cpuload()
* which is not scaled with the cpu power.
*/
- if (capacity && rq->nr_running == 1 && wl > env->imbalance)
+ if (rq->nr_running == 0 ||
+ (!(env->flags & LBF_POWER_BAL) && capacity &&
+ rq->nr_running == 1 && wl > env->imbalance))
continue;
/*
@@ -5208,7 +5210,8 @@ redo:
ld_moved = 0;
lb_iterations = 1;
- if (busiest->nr_running > 1) {
+ if (busiest->nr_running > 1 ||
+ (busiest->nr_running == 1 && env.flags & LBF_POWER_BAL)) {
/*
* Attempt to move tasks. If find_busiest_group has found
* an imbalance but busiest->nr_running <= 1, the group is
--
1.7.12
This patch enabled the power aware consideration in load balance.
As mentioned in the power aware scheduler proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, less active sched_groups will reduce power consumption
The first assumption make performance policy take over scheduling when
any scheduler group is busy.
The second assumption make power aware scheduling try to pack disperse
tasks into fewer groups.
The enabling logical summary here:
1, Collect power aware scheduler statistics during performance load
balance statistics collection.
2, If the balance cpu is eligible for power load balance, just do it
and forget performance load balance. If the domain is suitable for
power balance, but the cpu is inappropriate(idle or full), stop both
power/performance balance in this domain.
3, If using performance balance policy or using power balance but any
group is busy, do performance balance.
Above logical is mainly implemented in update_sd_lb_power_stats(). It
decides if a domain is suitable for power aware scheduling. If so,
it will fill the dst group and source group accordingly.
This patch reused some of Suresh's power saving load balance code.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 118 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9326c22..0682472 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ *
+ * Powersaving balance policy added by Alex Shi
+ * Copyright (C) 2013 Intel, Alex Shi <[email protected]>
*/
#include <linux/latencytop.h>
@@ -4398,6 +4401,101 @@ static unsigned long task_h_load(struct task_struct *p)
/********** Helpers for find_busiest_group ************************/
/**
+ * init_sd_lb_power_stats - Initialize power savings statistics for
+ * the given sched_domain, during load balancing.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics for sd.
+ */
+static inline void init_sd_lb_power_stats(struct lb_env *env,
+ struct sd_lb_stats *sds)
+{
+ if (sched_balance_policy == SCHED_POLICY_PERFORMANCE ||
+ env->idle == CPU_NOT_IDLE) {
+ env->flags &= ~LBF_POWER_BAL;
+ env->flags |= LBF_PERF_BAL;
+ return;
+ }
+ env->flags &= ~LBF_PERF_BAL;
+ env->flags |= LBF_POWER_BAL;
+ sds->min_util = UINT_MAX;
+ sds->leader_util = 0;
+}
+
+/**
+ * update_sd_lb_power_stats - Update the power saving stats for a
+ * sched_domain while performing load balancing.
+ *
+ * @env: The load balancing environment.
+ * @group: sched_group belonging to the sched_domain under consideration.
+ * @sds: Variable containing the statistics of the sched_domain
+ * @local_group: Does group contain the CPU for which we're performing
+ * load balancing?
+ * @sgs: Variable containing the statistics of the group.
+ */
+static inline void update_sd_lb_power_stats(struct lb_env *env,
+ struct sched_group *group, struct sd_lb_stats *sds,
+ int local_group, struct sg_lb_stats *sgs)
+{
+ unsigned long threshold_util;
+
+ if (env->flags & LBF_PERF_BAL)
+ return;
+
+ threshold_util = sgs->group_weight * FULL_UTIL;
+
+ /*
+ * If the local group is idle or full loaded
+ * no need to do power savings balance at this domain
+ */
+ if (local_group && (!sgs->sum_nr_running ||
+ sgs->group_util + FULL_UTIL > threshold_util))
+ env->flags &= ~LBF_POWER_BAL;
+
+ /* Do performance load balance if any group overload */
+ if (sgs->group_util > threshold_util) {
+ env->flags |= LBF_PERF_BAL;
+ env->flags &= ~LBF_POWER_BAL;
+ }
+
+ /*
+ * If a group is idle,
+ * don't include that group in power savings calculations
+ */
+ if (!(env->flags & LBF_POWER_BAL) || !sgs->sum_nr_running)
+ return;
+
+ /*
+ * Calculate the group which has the least non-idle load.
+ * This is the group from where we need to pick up the load
+ * for saving power
+ */
+ if ((sgs->group_util < sds->min_util) ||
+ (sgs->group_util == sds->min_util &&
+ group_first_cpu(group) > group_first_cpu(sds->group_min))) {
+ sds->group_min = group;
+ sds->min_util = sgs->group_util;
+ sds->min_load_per_task = sgs->sum_weighted_load /
+ sgs->sum_nr_running;
+ }
+
+ /*
+ * Calculate the group which is almost near its
+ * capacity but still has some space to pick up some load
+ * from other group and save more power
+ */
+ if (sgs->group_util + FULL_UTIL > threshold_util)
+ return;
+
+ if (sgs->group_util > sds->leader_util ||
+ (sgs->group_util == sds->leader_util && sds->group_leader &&
+ group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
+ sds->group_leader = group;
+ sds->leader_util = sgs->group_util;
+ }
+}
+
+/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
* @sd: The sched_domain whose load_idx is to be obtained.
* @idle: The Idle status of the CPU for whose sd load_icx is obtained.
@@ -4634,6 +4732,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_load += load;
sgs->sum_nr_running += nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);
+
+ /* add scaled rq utilization */
+ sgs->group_util += max_rq_util(i);
+
if (idle_cpu(i))
sgs->idle_cpus++;
}
@@ -4742,6 +4844,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
+ init_sd_lb_power_stats(env, sds);
load_idx = get_sd_load_idx(env->sd, env->idle);
do {
@@ -4793,6 +4896,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->group_imb = sgs.group_imb;
}
+ update_sd_lb_power_stats(env, sg, sds, local_group, &sgs);
sg = sg->next;
} while (sg != env->sd->groups);
}
@@ -5010,6 +5114,19 @@ find_busiest_group(struct lb_env *env, int *balance)
*/
update_sd_lb_stats(env, balance, &sds);
+ if (!(env->flags & LBF_POWER_BAL) && !(env->flags & LBF_PERF_BAL))
+ return NULL;
+
+ if (env->flags & LBF_POWER_BAL) {
+ if (sds.this == sds.group_leader &&
+ sds.group_leader != sds.group_min) {
+ env->imbalance = sds.min_load_per_task;
+ return sds.group_min;
+ }
+ env->flags &= ~LBF_POWER_BAL;
+ return NULL;
+ }
+
/*
* this_cpu is not the appropriate cpu to perform load balancing at
* this level.
@@ -5187,7 +5304,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
- .flags = LBF_PERF_BAL,
+ .flags = LBF_POWER_BAL,
};
cpumask_copy(cpus, cpu_active_mask);
@@ -6265,7 +6382,6 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu) { }
#endif /* CONFIG_FAIR_GROUP_SCHED */
-
static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
{
struct sched_entity *se = &task->se;
--
1.7.12
Packing tasks among such domain can't save power, just performance
losing. So no power balance on them.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 047a1b3..3a0284b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3503,7 +3503,7 @@ static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
- if (wakeup)
+ if (wakeup && !(sd->flags & SD_SHARE_CPUPOWER))
new_cpu = find_leader_cpu(sds->group_leader,
p, cpu, policy);
/* for fork balancing and a little busy task */
@@ -4410,8 +4410,9 @@ static unsigned long task_h_load(struct task_struct *p)
static inline void init_sd_lb_power_stats(struct lb_env *env,
struct sd_lb_stats *sds)
{
- if (sched_balance_policy == SCHED_POLICY_PERFORMANCE ||
- env->idle == CPU_NOT_IDLE) {
+ if (sched_balance_policy == SCHED_POLICY_PERFORMANCE
+ || env->sd->flags & SD_SHARE_CPUPOWER
+ || env->idle == CPU_NOT_IDLE) {
env->flags &= ~LBF_POWER_BAL;
env->flags |= LBF_PERF_BAL;
return;
--
1.7.12
When active task number in sched domain waves around the power friendly
scheduling creteria, scheduling will thresh between the power friendly
balance and performance balance, bring unnecessary task migration.
The typical benchmark is 'make -j x'.
To remove such issue, introduce a u64 perf_lb_record variable to record
performance load balance history. If there is no performance LB for
continuing 32 times load balancing, or no LB for 8 times max_interval ms,
or only 4 times performance LB in last 64 times load balancing, then we
accept a power friendly LB. Otherwise, give up this time power friendly
LB chance, do nothing.
With this patch, the worst case for power scheduling -- kbuild, gets
similar performance/watts value among different policy.
Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 57 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 226a515..4b9b810 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -906,6 +906,7 @@ struct sched_domain {
unsigned long last_balance; /* init to jiffies. units in jiffies */
unsigned int balance_interval; /* initialise to 1. units in ms. */
unsigned int nr_balance_failed; /* initialise to 0 */
+ u64 perf_lb_record; /* performance balance record */
u64 last_update;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0682472..047a1b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4495,6 +4495,60 @@ static inline void update_sd_lb_power_stats(struct lb_env *env,
}
}
+#define PERF_LB_HH_MASK 0xffffffff00000000ULL
+#define PERF_LB_LH_MASK 0xffffffffULL
+
+/**
+ * need_perf_balance - Check if the performance load balance needed
+ * in the sched_domain.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics of the sched_domain
+ */
+static int need_perf_balance(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ env->sd->perf_lb_record <<= 1;
+
+ if (env->flags & LBF_PERF_BAL) {
+ env->sd->perf_lb_record |= 0x1;
+ return 1;
+ }
+
+ /*
+ * The situation isn't eligible for performance balance. If this_cpu
+ * is not eligible or the timing is not suitable for lazy powersaving
+ * balance, we will stop both powersaving and performance balance.
+ */
+ if (env->flags & LBF_POWER_BAL && sds->this == sds->group_leader
+ && sds->group_leader != sds->group_min) {
+ int interval;
+
+ /* powersaving balance interval set as 8 * max_interval */
+ interval = msecs_to_jiffies(8 * env->sd->max_interval);
+ if (time_after(jiffies, env->sd->last_balance + interval))
+ env->sd->perf_lb_record = 0;
+
+ /*
+ * A eligible timing is no performance balance in last 32
+ * balance and performance balance is no more than 4 times
+ * in last 64 balance, or no balance in powersaving interval
+ * time.
+ */
+ if ((hweight64(env->sd->perf_lb_record & PERF_LB_HH_MASK) <= 4)
+ && !(env->sd->perf_lb_record & PERF_LB_LH_MASK)) {
+
+ env->imbalance = sds->min_load_per_task;
+ return 0;
+ }
+
+ }
+
+ /* give up this time power balancing, do nothing */
+ env->flags &= ~LBF_POWER_BAL;
+ sds->group_min = NULL;
+ return 0;
+}
+
/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
* @sd: The sched_domain whose load_idx is to be obtained.
@@ -5114,18 +5168,8 @@ find_busiest_group(struct lb_env *env, int *balance)
*/
update_sd_lb_stats(env, balance, &sds);
- if (!(env->flags & LBF_POWER_BAL) && !(env->flags & LBF_PERF_BAL))
- return NULL;
-
- if (env->flags & LBF_POWER_BAL) {
- if (sds.this == sds.group_leader &&
- sds.group_leader != sds.group_min) {
- env->imbalance = sds.min_load_per_task;
- return sds.group_min;
- }
- env->flags &= ~LBF_POWER_BAL;
- return NULL;
- }
+ if (!need_perf_balance(env, &sds))
+ return sds.group_min;
/*
* this_cpu is not the appropriate cpu to perform load balancing at
--
1.7.12
From: Preeti U Murthy <[email protected]>
Problem:
select_task_rq_fair() returns a target CPU/ waking CPU if no balancing is
required. However with the current power aware scheduling in this path, an
invalid CPU might be returned.
If get_cpu_for_power_policy() fails to find a new_cpu for the forked task,
then there is a possibility that the new_cpu could continue to be -1, till
the end of the select_task_rq_fair() if the search for a new cpu ahead in
this function also fails. Since this scenario is unexpected by
the callers of select_task_rq_fair(),this needs to be fixed.
Fix:
Do not intermix the variables meant to reflect the target CPU of power save
and performance policies. If the target CPU of powersave is successful in being
found, return it. Else allow the performance policy to take a call on the
target CPU.
The above scenario was caught when a kernel crash resulted with a bad data access
interrupt, during a kernbench run on a 2 socket,16 core machine,with each core
having SMT-4
Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a0284b..142c1ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3529,6 +3529,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
int new_cpu = cpu;
+ int power_cpu = -1;
int want_affine = 0;
int sync = flags & WF_SYNC;
struct sd_lb_stats sds;
@@ -3560,16 +3561,16 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
if (tmp->flags & sd_flag) {
sd = tmp;
- new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
+ power_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
sd_flag & SD_BALANCE_WAKE);
- if (new_cpu != -1)
+ if (power_cpu != -1)
goto unlock;
}
}
if (affine_sd) {
- new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 1);
- if (new_cpu != -1)
+ power_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 1);
+ if (power_cpu != -1)
goto unlock;
if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
@@ -3619,8 +3620,10 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
}
unlock:
rcu_read_unlock();
+ if (power_cpu == -1)
+ return new_cpu;
- return new_cpu;
+ return power_cpu;
}
/*
--
1.7.12
Added 4 new members in sb_lb_stats, that will be used in incomming
power aware balance.
group_min; //least utliszation group in domain
min_load_per_task; //load_per_task in group_min
leader_util; // sum utilizations of group_leader
min_util; // sum utilizations of group_min
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 86221e7..9326c22 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3333,6 +3333,10 @@ struct sd_lb_stats {
/* Varibles of power awaring scheduling */
unsigned int sd_util; /* sum utilization of this domain */
struct sched_group *group_leader; /* Group which relieves group_min */
+ struct sched_group *group_min; /* Least loaded group in sd */
+ unsigned long min_load_per_task; /* load_per_task in group_min */
+ unsigned int leader_util; /* sum utilizations of group_leader */
+ unsigned int min_util; /* sum utilizations of group_min */
};
/*
--
1.7.12
If the waked task is transitory enough, it will has a chance to be
packed into a cpu which is busy but still has time to care it.
For powersaving policy, only the history util < 25% task has chance to
be packed. If there is no cpu eligible to handle it, will use a idlest
cpu in leader group.
Morten Rasmussen catch a type bug. And PeterZ reminder to consider
rt_util. thanks you!
Inspired-by: Vincent Guittot <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 48 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a729939..6145ed2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3449,19 +3449,60 @@ static inline int get_sd_sched_balance_policy(struct sched_domain *sd,
}
/*
+ * find_leader_cpu - find the busiest but still has enough free time cpu
+ * among the cpus in group.
+ */
+static int
+find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu,
+ int policy)
+{
+ int vacancy, min_vacancy = INT_MAX;
+ int leader_cpu = -1;
+ int i;
+ /* percentage of the task's util */
+ unsigned putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_POWER_SHIFT)
+ / (p->se.avg.runnable_avg_period + 1);
+
+ /* bias toward local cpu */
+ if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(p)) &&
+ FULL_UTIL - max_rq_util(this_cpu) - (putil << 2) > 0)
+ return this_cpu;
+
+ /* Traverse only the allowed CPUs */
+ for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+ if (i == this_cpu)
+ continue;
+
+ /* only light task allowed, putil < 25% */
+ vacancy = FULL_UTIL - max_rq_util(i) - (putil << 2);
+
+ if (vacancy > 0 && vacancy < min_vacancy) {
+ min_vacancy = vacancy;
+ leader_cpu = i;
+ }
+ }
+ return leader_cpu;
+}
+
+/*
* If power policy is eligible for this domain, and it has task allowed cpu.
* we will select CPU from this domain.
*/
static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
- struct task_struct *p, struct sd_lb_stats *sds)
+ struct task_struct *p, struct sd_lb_stats *sds, int wakeup)
{
int policy;
int new_cpu = -1;
policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
- if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
- new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
-
+ if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
+ if (wakeup)
+ new_cpu = find_leader_cpu(sds->group_leader,
+ p, cpu, policy);
+ /* for fork balancing and a little busy task */
+ if (new_cpu == -1)
+ new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+ }
return new_cpu;
}
@@ -3512,14 +3553,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
if (tmp->flags & sd_flag) {
sd = tmp;
- new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+ new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
+ sd_flag & SD_BALANCE_WAKE);
if (new_cpu != -1)
goto unlock;
}
}
if (affine_sd) {
- new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+ new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 1);
if (new_cpu != -1)
goto unlock;
--
1.7.12
If a sched domain is idle enough for regular power balance, LBF_POWER_BAL
will be set, LBF_PERF_BAL will be clean. If a sched domain is busy,
their value will be set oppositely.
If the domain is suitable for power balance, but balance should not
be down by this cpu(this cpu is already idle or full), both of flags
are cleared to wait a suitable cpu to do power balance.
That mean no any balance, neither power balance nor performance balance
will be done on this cpu.
Above logical will be implemented by incoming patches.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6145ed2..f861427 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4013,6 +4013,8 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
#define LBF_SOME_PINNED 0x04
+#define LBF_POWER_BAL 0x08 /* if power balance allowed */
+#define LBF_PERF_BAL 0x10 /* if performance balance allowed */
struct lb_env {
struct sched_domain *sd;
@@ -5175,6 +5177,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
+ .flags = LBF_PERF_BAL,
};
cpumask_copy(cpus, cpu_active_mask);
--
1.7.12
Sleeping task has no utiliation, when they were bursty waked up, the
zero utilization make scheduler out of balance, like aim7 benchmark.
rq->avg_idle is 'to used to accommodate bursty loads in a dirt simple
dirt cheap manner' -- Mike Galbraith.
With this cheap and smart bursty indicator, we can find the wake up
burst, and use nr_running as instant utilization in this scenario.
For other scenarios, we still use the precise CPU utilization to
judage if a domain is eligible for power scheduling.
Thanks for Mike Galbraith's idea!
Thanks for Namhyung's suggestion to compact the burst into
max_rq_util()!
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f610313..a729939 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3363,6 +3363,10 @@ static unsigned int max_rq_util(int cpu)
unsigned int cfs_util;
unsigned int nr_running;
+ /* use nr_running as instant utilization for burst cpu */
+ if (cpu_rq(cpu)->avg_idle < sysctl_sched_burst_threshold)
+ return rq->nr_running * FULL_UTIL;
+
/* yield cfs utilization to rt's, if total utilization > 100% */
cfs_util = min(rq->util, (unsigned int)(FULL_UTIL - rt_util));
--
1.7.12
rq->avg_idle is 'to used to accommodate bursty loads in a dirt simple
dirt cheap manner' -- Mike Galbraith.
With this cheap and smart bursty indicator, we can find the wake up
burst, and just use nr_running as instant utilization only.
The 'sysctl_sched_burst_threshold' used for wakeup burst, since we will
check every cpu's avg_idle, the value get be more precise, set 0.1ms.
Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched/sysctl.h | 1 +
kernel/sched/fair.c | 1 +
kernel/sysctl.c | 7 +++++++
3 files changed, 9 insertions(+)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..a3c3d43 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -53,6 +53,7 @@ extern unsigned int sysctl_numa_balancing_settle_count;
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
+extern unsigned int sysctl_sched_burst_threshold;
extern unsigned int sysctl_sched_nr_migrate;
extern unsigned int sysctl_sched_time_avg;
extern unsigned int sysctl_timer_migration;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c08773..f610313 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -91,6 +91,7 @@ unsigned int sysctl_sched_wakeup_granularity = 1000000UL;
unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;
const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
+const_debug unsigned int sysctl_sched_burst_threshold = 100000UL;
/*
* The exponential sliding window over which load is averaged for shares
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..1f23457 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -327,6 +327,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "sched_burst_threshold_ns",
+ .data = &sysctl_sched_burst_threshold,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_nr_migrate",
.data = &sysctl_sched_nr_migrate,
.maxlen = sizeof(unsigned int),
--
1.7.12
Since the rt task priority is higher than fair tasks, cfs_rq utilization
is just the left of rt utilization.
When there are some cfs tasks in queue, the potential utilization may
be yielded, so mulitiplying cfs task number to get max potential
utilization of cfs. Then the rq utilization is sum of rt util and cfs
util.
Thanks for Paul Turner and Namhyung's reminder!
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c47933f..70a99c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3350,6 +3350,27 @@ struct sg_lb_stats {
unsigned int group_util; /* sum utilization of group */
};
+static unsigned long scale_rt_util(int cpu);
+
+/*
+ * max_rq_util - get the possible maximum cpu utilization
+ */
+static unsigned int max_rq_util(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned int rt_util = scale_rt_util(cpu);
+ unsigned int cfs_util;
+ unsigned int nr_running;
+
+ /* yield cfs utilization to rt's, if total utilization > 100% */
+ cfs_util = min(rq->util, (unsigned int)(FULL_UTIL - rt_util));
+
+ /* count transitory task utilization */
+ nr_running = max(rq->cfs.h_nr_running, (unsigned int)1);
+
+ return rt_util + cfs_util * nr_running;
+}
+
/*
* sched_balance_self: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
--
1.7.12
For power aware balancing, we care the sched domain/group's utilization.
So add: sd_lb_stats.sd_util and sg_lb_stats.group_util.
And want to know which group is busiest but still has capability to
handle more tasks, so add: sd_lb_stats.group_leader
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7124244..6b7917b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4203,6 +4203,10 @@ struct sd_lb_stats {
unsigned int busiest_group_weight;
int group_imb; /* Is there imbalance in this sd */
+
+ /* Varibles of power awaring scheduling */
+ unsigned int sd_util; /* sum utilization of this domain */
+ struct sched_group *group_leader; /* Group which relieves group_min */
};
/*
@@ -4218,6 +4222,7 @@ struct sg_lb_stats {
unsigned long group_weight;
int group_imb; /* Is there an imbalance in the group ? */
int group_has_capacity; /* Is there extra capacity in the group? */
+ unsigned int group_util; /* sum utilization of group */
};
/**
--
1.7.12
The scale_rt_power() used to represent the left CPU utilization
after reduce rt utilization. so named it as scale_rt_power has a bit
inappropriate.
Since we need to use the rt utilization in some incoming patches, we
just change return value of this function to rt utilization, and
rename it as scale_rt_util(). Then, its usage is changed in
update_cpu_power() accordingly.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0bd2f3..c47933f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4277,10 +4277,10 @@ unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
return default_scale_smt_power(sd, cpu);
}
-unsigned long scale_rt_power(int cpu)
+unsigned long scale_rt_util(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- u64 total, available, age_stamp, avg;
+ u64 total, age_stamp, avg;
/*
* Since we're reading these variables without serialization make sure
@@ -4292,10 +4292,8 @@ unsigned long scale_rt_power(int cpu)
total = sched_avg_period() + (rq->clock - age_stamp);
if (unlikely(total < avg)) {
- /* Ensures that power won't end up being negative */
- available = 0;
- } else {
- available = total - avg;
+ /* Ensures rt utilization won't beyond full scaled value */
+ return SCHED_POWER_SCALE;
}
if (unlikely((s64)total < SCHED_POWER_SCALE))
@@ -4303,7 +4301,7 @@ unsigned long scale_rt_power(int cpu)
total >>= SCHED_POWER_SHIFT;
- return div_u64(available, total);
+ return div_u64(avg, total);
}
static void update_cpu_power(struct sched_domain *sd, int cpu)
@@ -4330,7 +4328,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
power >>= SCHED_POWER_SHIFT;
- power *= scale_rt_power(cpu);
+ power *= SCHED_POWER_SCALE - scale_rt_util(cpu);
power >>= SCHED_POWER_SHIFT;
if (!power)
--
1.7.12
We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg
and make forking balancing imbalance since incorrect load_avg_contrib.
set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
resolve such issues.
Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 6 ++++++
kernel/sched/fair.c | 4 ++++
2 files changed, 10 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 54eaaa2..8843cd3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1564,6 +1564,7 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
+ p->se.avg.decay_count = 0;
#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
@@ -1651,6 +1652,11 @@ void sched_fork(struct task_struct *p)
p->sched_reset_on_fork = 0;
}
+ /* New forked task assumed with full utilization */
+#if defined(CONFIG_SMP)
+ p->se.avg.load_avg_contrib = p->se.load.weight;
+#endif
+
if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..2881d42 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1508,6 +1508,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * When enqueue a new forked task, the se->avg.decay_count == 0, so
+ * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
+ * value: se->load.weight.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
--
1.7.12
Hi Alex,
On Thu, 4 Apr 2013 10:00:56 +0800, Alex Shi wrote:
> In power balance, we hope some sched groups are fully empty to save
> CPU power of them. So, we want to move any tasks from them.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f861427..0dd29f4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5105,7 +5105,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> * When comparing with imbalance, use weighted_cpuload()
> * which is not scaled with the cpu power.
> */
> - if (capacity && rq->nr_running == 1 && wl > env->imbalance)
> + if (rq->nr_running == 0 ||
> + (!(env->flags & LBF_POWER_BAL) && capacity &&
> + rq->nr_running == 1 && wl > env->imbalance))
Just out of curious.
In load_balance(), we only move normal tasks, right?
Then shouldn't it check rq->cfs.h_nr_running rather than rq->nr_running?
Thanks,
Namhyung
> continue;
>
> /*
> @@ -5208,7 +5210,8 @@ redo:
>
> ld_moved = 0;
> lb_iterations = 1;
> - if (busiest->nr_running > 1) {
> + if (busiest->nr_running > 1 ||
> + (busiest->nr_running == 1 && env.flags & LBF_POWER_BAL)) {
> /*
> * Attempt to move tasks. If find_busiest_group has found
> * an imbalance but busiest->nr_running <= 1, the group is
On 04/04/2013 01:59 PM, Namhyung Kim wrote:
>> > - if (capacity && rq->nr_running == 1 && wl > env->imbalance)
>> > + if (rq->nr_running == 0 ||
>> > + (!(env->flags & LBF_POWER_BAL) && capacity &&
>> > + rq->nr_running == 1 && wl > env->imbalance))
> Just out of curious.
>
> In load_balance(), we only move normal tasks, right?
>
> Then shouldn't it check rq->cfs.h_nr_running rather than rq->nr_running?
Yes, it seems so.
What's your opinion of this, Peter?
--
Thanks
Alex
Hi Alex,
On 04/04/2013 07:31 AM, Alex Shi wrote:
> Packing tasks among such domain can't save power, just performance
> losing. So no power balance on them.
As far as my understanding goes, powersave policy is the one that tries
to pack tasks onto a SIBLING domain( domain where SD_SHARE_CPUPOWER is
set).balance policy does not do that,meaning it does not pack on the
domain that shares CPU power,but packs across all other domains.So the
change you are making below results in nothing but the default behaviour
of balance policy.
Correct me if I am wrong but my point is,looks to me,that the powersave
policy is introduced in this patchset,and with the below patch its
characteristic behaviour of packing onto domains sharing cpu power is
removed,thus making it default to balance policy.Now there are two
policies which behave the same way:balance and powersave.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 047a1b3..3a0284b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3503,7 +3503,7 @@ static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
>
> policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
> if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
> - if (wakeup)
> + if (wakeup && !(sd->flags & SD_SHARE_CPUPOWER))
> new_cpu = find_leader_cpu(sds->group_leader,
> p, cpu, policy);
> /* for fork balancing and a little busy task */
> @@ -4410,8 +4410,9 @@ static unsigned long task_h_load(struct task_struct *p)
> static inline void init_sd_lb_power_stats(struct lb_env *env,
> struct sd_lb_stats *sds)
> {
> - if (sched_balance_policy == SCHED_POLICY_PERFORMANCE ||
> - env->idle == CPU_NOT_IDLE) {
> + if (sched_balance_policy == SCHED_POLICY_PERFORMANCE
> + || env->sd->flags & SD_SHARE_CPUPOWER
> + || env->idle == CPU_NOT_IDLE) {
> env->flags &= ~LBF_POWER_BAL;
> env->flags |= LBF_PERF_BAL;
> return;
>
Regards
Preeti U Murthy
Hi Alex,
I am sorry I overlooked the changes you have made to the power
scheduling policies.Now you have just two : performance and powersave.
Hence you can ignore my below comments.But if you use group->capacity
instead of group->weight for threshold,like you did for balance policy
in your version5 of this patchset, dont you think the below patch can be
avoided? group->capacity being the threshold will automatically ensure
that you dont pack onto domains that share cpu power.
Regards
Preeti U Murthy
On 04/08/2013 08:47 AM, Preeti U Murthy wrote:
> Hi Alex,
>
> On 04/04/2013 07:31 AM, Alex Shi wrote:
>> Packing tasks among such domain can't save power, just performance
>> losing. So no power balance on them.
>
> As far as my understanding goes, powersave policy is the one that tries
> to pack tasks onto a SIBLING domain( domain where SD_SHARE_CPUPOWER is
> set).balance policy does not do that,meaning it does not pack on the
> domain that shares CPU power,but packs across all other domains.So the
> change you are making below results in nothing but the default behaviour
> of balance policy.
>
> Correct me if I am wrong but my point is,looks to me,that the powersave
> policy is introduced in this patchset,and with the below patch its
> characteristic behaviour of packing onto domains sharing cpu power is
> removed,thus making it default to balance policy.Now there are two
> policies which behave the same way:balance and powersave.
>
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> ---
>> kernel/sched/fair.c | 7 ++++---
>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 047a1b3..3a0284b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3503,7 +3503,7 @@ static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
>>
>> policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
>> if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
>> - if (wakeup)
>> + if (wakeup && !(sd->flags & SD_SHARE_CPUPOWER))
>> new_cpu = find_leader_cpu(sds->group_leader,
>> p, cpu, policy);
>> /* for fork balancing and a little busy task */
>> @@ -4410,8 +4410,9 @@ static unsigned long task_h_load(struct task_struct *p)
>> static inline void init_sd_lb_power_stats(struct lb_env *env,
>> struct sd_lb_stats *sds)
>> {
>> - if (sched_balance_policy == SCHED_POLICY_PERFORMANCE ||
>> - env->idle == CPU_NOT_IDLE) {
>> + if (sched_balance_policy == SCHED_POLICY_PERFORMANCE
>> + || env->sd->flags & SD_SHARE_CPUPOWER
>> + || env->idle == CPU_NOT_IDLE) {
>> env->flags &= ~LBF_POWER_BAL;
>> env->flags |= LBF_PERF_BAL;
>> return;
>>
>
> Regards
> Preeti U Murthy
>
On 04/08/2013 11:25 AM, Preeti U Murthy wrote:
> Hi Alex,
>
> I am sorry I overlooked the changes you have made to the power
> scheduling policies.Now you have just two : performance and powersave.
>
> Hence you can ignore my below comments.But if you use group->capacity
> instead of group->weight for threshold,like you did for balance policy
> in your version5 of this patchset, dont you think the below patch can be
> avoided? group->capacity being the threshold will automatically ensure
> that you dont pack onto domains that share cpu power.
this patch is different from balance policy, the powersave still try to
move 2 busy tasks into one cpu core on Intel cpu. It is just don't keep
packing in cpu core, like if there are 2 half busy tasks in one cpu
core, with this patch, each of SMT thread has one half busy task,
without this patch, 2 half busy task are packed into one thread.
The removed balance policy just pack one busy task per cpu core. Yes,
the 'balance' policy has its meaning. but that is different.
--
Thanks Alex
On 04/03/2013 10:00 PM, Alex Shi wrote:
> As mentioned in the power aware scheduling proposal, Power aware
> scheduling has 2 assumptions:
> 1, race to idle is helpful for power saving
> 2, less active sched groups will reduce cpu power consumption
[email protected] should be cc:
on Linux proposals that affect power.
> Since the patch can perfect pack tasks into fewer groups, I just show
> some performance/power testing data here:
> =========================================
> $for ((i = 0; i < x; i++)) ; do while true; do :; done & done
>
> On my SNB laptop with 4 core* HT: the data is avg Watts
> powersaving performance
> x = 8 72.9482 72.6702
> x = 4 61.2737 66.7649
> x = 2 44.8491 59.0679
> x = 1 43.225 43.0638
> on SNB EP machine with 2 sockets * 8 cores * HT:
> powersaving performance
> x = 32 393.062 395.134
> x = 16 277.438 376.152
> x = 8 209.33 272.398
> x = 4 199 238.309
> x = 2 175.245 210.739
> x = 1 174.264 173.603
The numbers above say nothing about performance,
and thus don't tell us much.
In particular, they don't tell us if reducing power
by hacking the scheduler is more or less efficient
than using the existing techniques that are already shipping,
such as controlling P-states.
> tasks number keep waving benchmark, 'make -j <x> vmlinux'
> on my SNB EP 2 sockets machine with 8 cores * HT:
> powersaving performance
> x = 2 189.416 /228 23 193.355 /209 24
Energy = Power * Time
189.416*228 = 43186.848 Joules for powersaving to retire the workload
193.355*209 = 40411.195 Joules for performance to retire the workload.
So the net effect of the 'powersaving' mode here is:
1. 228/209 = 9% performance degradation
2. 43186.848/40411.195 = 6.9 % more energy to retire the workload.
These numbers suggest that this patch series simultaneously
has a negative impact on performance and energy required
to retire the workload. Why do it?
> x = 4 215.728 /132 35 219.69 /122 37
ditto here.
8% increase in time.
6% increase in energy.
> x = 8 244.31 /75 54 252.709 /68 58
ditto here
10% increase in time.
6% increase in energy.
> x = 16 299.915 /43 77 259.127 /58 66
Are you sure that powersave mode ran in 43 seconds
when performance mode ran in 58 seconds?
If that is true, than somewhere in this patch series
you have a _significant_ performance benefit
on this workload under these conditions!
Interestingly, powersave mode also ran at
15% higher power than performance mode.
maybe "powersave" isn't quite the right name for it:-)
> x = 32 341.221 /35 83 323.418 /38 81
Why does this patch series have a performance impact (8%)
at x=32. All the processors are always busy, no?
> data explains: 189.416 /228 23
> 189.416: average Watts during compilation
> 228: seconds(compile time)
> 23: scaled performance/watts = 1000000 / seconds / watts
> The performance value of kbuild is better on threads 16/32, that's due
> to lazy power balance reduced the context switch and CPU has more boost
> chance on powersaving balance.
25% is a huge difference in performance.
Can you get a performance benefit in that scenario
without having a negative performance impact
in the other scenarios? In particular,
an 8% hit to the fully utilized case is a deal killer.
The x=16 performance change here suggest there is value
someplace in this patch series to increase performance.
However, the case that these scheduling changes are
a benefit from an energy efficiency point of view
is yet to be made.
thanks,
-Len Brown
Intel Open Source Technology Center
On 04/12/2013 05:02 AM, Len Brown wrote:
>> > x = 16 299.915 /43 77 259.127 /58 66
> Are you sure that powersave mode ran in 43 seconds
> when performance mode ran in 58 seconds?
Thanks a lot for comments, Len!
Will do more testing by your tool fspin. :)
powersaving using less time when thread = 16 or 32.
The main contribution come from CPU freq boost. I have disable the boost
of cpufreq. then find the compile time become similar between
powersaving and performance on thread 32, and powersaving is slower when
threads is 16.
And less Context Switch from less lazy power balance should also do some
help.
>
> If that is true, than somewhere in this patch series
> you have a _significant_ performance benefit
> on this workload under these conditions!
>
> Interestingly, powersave mode also ran at
> 15% higher power than performance mode.
> maybe "powersave" isn't quite the right name for it:-)
What other name you suggest? :)
>
>> > x = 32 341.221 /35 83 323.418 /38 81
> Why does this patch series have a performance impact (8%)
> at x=32. All the processors are always busy, no?
No, all processors are not always busy in 'make -j vmlinux'
So, compile time also get benefit from boost and less CS. the
performance policy doesn't introduce any impact. there is nothing added
in performance policy.
>
>> > data explains: 189.416 /228 23
>> > 189.416: average Watts during compilation
>> > 228: seconds(compile time)
>> > 23: scaled performance/watts = 1000000 / seconds / watts
>> > The performance value of kbuild is better on threads 16/32, that's due
>> > to lazy power balance reduced the context switch and CPU has more boost
>> > chance on powersaving balance.
> 25% is a huge difference in performance.
> Can you get a performance benefit in that scenario
> without having a negative performance impact
> in the other scenarios? In particular,
will try packing task on cpu capacity not cpu weight.
> an 8% hit to the fully utilized case is a deal killer.
that is the 8% gain on powersaving, not 8% lose on performance policy. :)
>
> The x=16 performance change here suggest there is value
> someplace in this patch series to increase performance.
> However, the case that these scheduling changes are
> a benefit from an energy efficiency point of view
> is yet to be made.
--
Thanks Alex
On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
> Thanks a lot for comments, Len!
AFAICT, you kinda forgot to answer his most important question:
> These numbers suggest that this patch series simultaneously
> has a negative impact on performance and energy required
> to retire the workload. Why do it?
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Fri, 2013-04-12 at 18:23 +0200, Borislav Petkov wrote:
> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
> > Thanks a lot for comments, Len!
>
> AFAICT, you kinda forgot to answer his most important question:
>
> > These numbers suggest that this patch series simultaneously
> > has a negative impact on performance and energy required
> > to retire the workload. Why do it?
Hm. When I tested AIM7 compute on a NUMA box, there was a marked
throughput increase at the low to moderate load end of the test spectrum
IIRC. Fully repeatable. There were also other benefits unrelated to
power, ie mitigation of the evil face of select_idle_sibling(). I
rather liked what I saw during ~big box test-drive.
(just saying there are other aspects besides joules in there)
-Mike
On Fri, Apr 12, 2013 at 06:48:31PM +0200, Mike Galbraith wrote:
> (just saying there are other aspects besides joules in there)
Yeah, but we don't allow any regressions in sched*, do we? Can we pick
only the good cherries? :-)
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 04/13/2013 12:23 AM, Borislav Petkov wrote:
> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
>> > Thanks a lot for comments, Len!
> AFAICT, you kinda forgot to answer his most important question:
>
>> > These numbers suggest that this patch series simultaneously
>> > has a negative impact on performance and energy required
>> > to retire the workload. Why do it?
Even some scenario the total energy cost more, at least the avg watts
dropped in that scenarios. Len said he has low p-state which can work
there. but that's is different. I had sent some data in another email
list to show the difference:
The following is 2 times kbuild testing result for 3 kinds condiation on
SNB EP box, the middle column is the lowest p-state testing result, we
can see, it has the lowest power consumption, also has the lowest
performance/watts value.
At least for kbuild benchmark, powersaving policy has the best
compromise on powersaving and power efficient. Further more, due to cpu
boost feature, it has better performance in some scenarios.
powersaving + ondemand userspace + fixed 1.2GHz performance+ondemand
x = 8 231.318 /75 57 165.063 /166 36 253.552 /63 62
x = 16 280.357 /49 72 174.408 /106 54 296.776 /41 82
x = 32 325.206 /34 90 178.675 /90 62 314.153 /37 86
x = 8 233.623 /74 57 164.507 /168 36 254.775 /65 60
x = 16 272.54 /38 96 174.364 /106 54 297.731 /42 79
x = 32 320.758 /34 91 177.917 /91 61 317.875 /35 89
x = 64 326.837 /33 92 179.037 /90 62 320.615 /36 86
--
Thanks
Alex
On 04/13/2013 01:12 AM, Borislav Petkov wrote:
> On Fri, Apr 12, 2013 at 06:48:31PM +0200, Mike Galbraith wrote:
>> (just saying there are other aspects besides joules in there)
>
> Yeah, but we don't allow any regressions in sched*, do we? Can we pick
> only the good cherries? :-)
>
Thanks for all of discussion on this threads. :)
I think we can bear a little power efficient lose when want powersaving.
For second question, the performance increase come from cpu boost
feature, the hardware feature diffined, if there are some cores idle in
cpu socket, other core has more chance to boost on higher frequency. The
task packing try to pack tasks so that left more idle cores.
The difficult to merge this feature into current performance is that
current balance policy is trying to give as much as possible cpu
resources to each of task. that just conflict with the cpu boost condition.
--
Thanks
Alex
On 04/14/2013 09:28 AM, Alex Shi wrote:
>>>> >> > These numbers suggest that this patch series simultaneously
>>>> >> > has a negative impact on performance and energy required
>>>> >> > to retire the workload. Why do it?
> Even some scenario the total energy cost more, at least the avg watts
> dropped in that scenarios. Len said he has low p-state which can work
> there. but that's is different. I had sent some data in another email
> list to show the difference:
>
> The following is 2 times kbuild testing result for 3 kinds condiation on
> SNB EP box, the middle column is the lowest p-state testing result, we
> can see, it has the lowest power consumption, also has the lowest
> performance/watts value.
> At least for kbuild benchmark, powersaving policy has the best
> compromise on powersaving and power efficient. Further more, due to cpu
> boost feature, it has better performance in some scenarios.
BTW, another benefit on powersaving is that powersaving policy is very
flexible on system load. when task number in sched domain is beyond LCPU
number, it will take performance oriented balance. That conduct the
similar performance when system is busy.
--
Thanks
Alex
On Sun, Apr 14, 2013 at 09:28:50AM +0800, Alex Shi wrote:
> Even some scenario the total energy cost more, at least the avg watts
> dropped in that scenarios.
Ok, what's wrong with x = 32 then? So basically if you're looking at
avg watts, you don't want to have more than 16 threads, otherwise
powersaving sucks on that particular uarch and platform. Can you say
that for all platforms out there?
Also, I've added in the columns below the Energy = Power * Time thing.
And the funny thing is, exactly there where avg watts is better in
powersaving, energy for workload retire is worse. And the other way
around. Basically, avg watts vs retire energy is reciprocal. Great :-\.
> Len said he has low p-state which can work there. but that's is
> different. I had sent some data in another email list to show the
> difference:
>
> The following is 2 times kbuild testing result for 3 kinds condiation on
> SNB EP box, the middle column is the lowest p-state testing result, we
> can see, it has the lowest power consumption, also has the lowest
> performance/watts value.
> At least for kbuild benchmark, powersaving policy has the best
> compromise on powersaving and power efficient. Further more, due to cpu
> boost feature, it has better performance in some scenarios.
>
> powersaving + ondemand userspace + fixed 1.2GHz performance+ondemand
> x = 8 231.318 /75 57 165.063 /166 36 253.552 /63 62
> x = 16 280.357 /49 72 174.408 /106 54 296.776 /41 82
> x = 32 325.206 /34 90 178.675 /90 62 314.153 /37 86
>
> x = 8 233.623 /74 57 164.507 /168 36 254.775 /65 60
> x = 16 272.54 /38 96 174.364 /106 54 297.731 /42 79
> x = 32 320.758 /34 91 177.917 /91 61 317.875 /35 89
> x = 64 326.837 /33 92 179.037 /90 62 320.615 /36 86
17348.850 27400.458 15973.776
13737.493 18487.248 12167.816
11057.004 16080.750 11623.661
17288.102 27637.176 16560.375
10356.52 18482.584 12504.702
10905.772 16190.447 11125.625
10785.621 16113.330 11542.140
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 04/14/2013 11:59 PM, Borislav Petkov wrote:
> On Sun, Apr 14, 2013 at 09:28:50AM +0800, Alex Shi wrote:
>> Even some scenario the total energy cost more, at least the avg watts
>> dropped in that scenarios.
>
> Ok, what's wrong with x = 32 then? So basically if you're looking at
> avg watts, you don't want to have more than 16 threads, otherwise
> powersaving sucks on that particular uarch and platform. Can you say
> that for all platforms out there?
The cpu freq boost make the avg watts higher with x = 32, and also make
higher power efficiency. We can disable cpu freq boost for this if we
want lower power consumption all time.
But for my understanding, the power efficient is better way to save power.
As to other platforms, I'm glad to see any testing or try and give me
results...
>
> Also, I've added in the columns below the Energy = Power * Time thing.
Thanks. btw the third data of each column is 'performance/watt'. that
shows similar meaning on the other side. :)
>
> And the funny thing is, exactly there where avg watts is better in
> powersaving, energy for workload retire is worse. And the other way
> around. Basically, avg watts vs retire energy is reciprocal. Great :-\.
>
>> Len said he has low p-state which can work there. but that's is
>> different. I had sent some data in another email list to show the
>> difference:
>>
>> The following is 2 times kbuild testing result for 3 kinds condiation on
>> SNB EP box, the middle column is the lowest p-state testing result, we
>> can see, it has the lowest power consumption, also has the lowest
>> performance/watts value.
>> At least for kbuild benchmark, powersaving policy has the best
>> compromise on powersaving and power efficient. Further more, due to cpu
>> boost feature, it has better performance in some scenarios.
>>
>> powersaving + ondemand userspace + fixed 1.2GHz performance+ondemand
>> x = 8 231.318 /75 57 165.063 /166 36 253.552 /63 62
>> x = 16 280.357 /49 72 174.408 /106 54 296.776 /41 82
>> x = 32 325.206 /34 90 178.675 /90 62 314.153 /37 86
>>
>> x = 8 233.623 /74 57 164.507 /168 36 254.775 /65 60
>> x = 16 272.54 /38 96 174.364 /106 54 297.731 /42 79
>> x = 32 320.758 /34 91 177.917 /91 61 317.875 /35 89
>> x = 64 326.837 /33 92 179.037 /90 62 320.615 /36 86
>
> 17348.850 27400.458 15973.776
> 13737.493 18487.248 12167.816
> 11057.004 16080.750 11623.661
>
> 17288.102 27637.176 16560.375
> 10356.52 18482.584 12504.702
> 10905.772 16190.447 11125.625
> 10785.621 16113.330 11542.140
>
--
Thanks Alex
On 04/15/2013 02:04 PM, Alex Shi wrote:
> On 04/14/2013 11:59 PM, Borislav Petkov wrote:
>> > On Sun, Apr 14, 2013 at 09:28:50AM +0800, Alex Shi wrote:
>>> >> Even some scenario the total energy cost more, at least the avg watts
>>> >> dropped in that scenarios.
>> >
>> > Ok, what's wrong with x = 32 then? So basically if you're looking at
>> > avg watts, you don't want to have more than 16 threads, otherwise
>> > powersaving sucks on that particular uarch and platform. Can you say
>> > that for all platforms out there?
> The cpu freq boost make the avg watts higher with x = 32, and also make
> higher power efficiency. We can disable cpu freq boost for this if we
> want lower power consumption all time.
> But for my understanding, the power efficient is better way to save power.
BTW, lowest p-state, no freq boost and plus this powersaving policy will
give the lowest power consumption.
And I need to say again. the powersaving policy just effect on system
under utilisation. when system goes busy, it won't has effect.
performance oriented policy will take over balance behaviour.
--
Thanks Alex
On Mon, Apr 15, 2013 at 02:16:55PM +0800, Alex Shi wrote:
> And I need to say again. the powersaving policy just effect on system
> under utilisation. when system goes busy, it won't has effect.
> performance oriented policy will take over balance behaviour.
And AFACU your patches, you do this automatically, right? In which case,
an underutilized system will have switched to powersaving balancing and
will use *more* energy to retire the workload. Correct?
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 04/15/2013 05:52 PM, Borislav Petkov wrote:
> On Mon, Apr 15, 2013 at 02:16:55PM +0800, Alex Shi wrote:
>> And I need to say again. the powersaving policy just effect on system
>> under utilisation. when system goes busy, it won't has effect.
>> performance oriented policy will take over balance behaviour.
>
> And AFACU your patches, you do this automatically, right?
Yes
In which case,
> an underutilized system will have switched to powersaving balancing and
> will use *more* energy to retire the workload. Correct?
>
For fairness and total threads consideration, powersaving cost quit
similar energy on kbuild benchmark, and even better.
17348.850 27400.458 15973.776
13737.493 18487.248 12167.816
11057.004 16080.750 11623.661
17288.102 27637.176 16560.375
10356.52 18482.584 12504.702
10905.772 16190.447 11125.625
10785.621 16113.330 11542.140
--
Thanks
Alex
On Mon, Apr 15, 2013 at 09:50:22PM +0800, Alex Shi wrote:
> For fairness and total threads consideration, powersaving cost quit
> similar energy on kbuild benchmark, and even better.
>
> 17348.850 27400.458 15973.776
> 13737.493 18487.248 12167.816
Yeah, but those lines don't look good - powersaving needs more energy
than performance.
And what is even crazier is that fixed 1.2 GHz case. I'd guess in
the normal case those cores are at triple the freq. - i.e. somewhere
around 3-4 GHz. And yet, 1.2 GHz eats almost *double* the power than
performance and powersaving.
So for the x=8 and maybe even the x=16 case we're basically better off
with performance.
Or could it be that the power measurements are not really that accurate
and those numbers above are not really correct?
Hmm.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 04/16/2013 07:12 AM, Borislav Petkov wrote:
> On Mon, Apr 15, 2013 at 09:50:22PM +0800, Alex Shi wrote:
>> For fairness and total threads consideration, powersaving cost quit
>> similar energy on kbuild benchmark, and even better.
>>
>> 17348.850 27400.458 15973.776
>> 13737.493 18487.248 12167.816
>
> Yeah, but those lines don't look good - powersaving needs more energy
> than performance.
>
> And what is even crazier is that fixed 1.2 GHz case. I'd guess in
> the normal case those cores are at triple the freq. - i.e. somewhere
> around 3-4 GHz. And yet, 1.2 GHz eats almost *double* the power than
> performance and powersaving.
yes, the max freq is 2.7 GHZ, plus boost.
>
> So for the x=8 and maybe even the x=16 case we're basically better off
> with performance.
>
> Or could it be that the power measurements are not really that accurate
> and those numbers above are not really correct?
testing has a little variation, but the power data is quite accurate. I
may change to packing tasks per cpu capacity than current cpu weight.
that should has better power efficient value.
>
> Hmm.
>
--
Thanks Alex
On Tue, Apr 16, 2013 at 08:22:19AM +0800, Alex Shi wrote:
> testing has a little variation, but the power data is quite accurate.
> I may change to packing tasks per cpu capacity than current cpu
> weight. that should has better power efficient value.
Yeah, this probably needs careful measuring - and by "this" I mean how
to place N tasks where N is less than number of cores in the system.
I can imagine trying to migrate them all together on a single physical
socket (maybe even overcommitting it) so that you can flush the caches
of the cores on the other sockets and so that you can power down the
other sockets and avoid coherent traffic from waking them up, to be one
strategy. My supposition here is that maybe putting the whole unused
sockets in a deep sleep state could save a lot of power.
Or not, who knows. Only empirical measurements should show us what
actually happens.
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 04/16/2013 06:24 PM, Borislav Petkov wrote:
> On Tue, Apr 16, 2013 at 08:22:19AM +0800, Alex Shi wrote:
>> testing has a little variation, but the power data is quite accurate.
>> I may change to packing tasks per cpu capacity than current cpu
>> weight. that should has better power efficient value.
>
> Yeah, this probably needs careful measuring - and by "this" I mean how
> to place N tasks where N is less than number of cores in the system.
>
> I can imagine trying to migrate them all together on a single physical
> socket (maybe even overcommitting it) so that you can flush the caches
> of the cores on the other sockets and so that you can power down the
> other sockets and avoid coherent traffic from waking them up, to be one
> strategy. My supposition here is that maybe putting the whole unused
> sockets in a deep sleep state could save a lot of power.
Sure. Currently if the whole socket get into sleep, but the memory on
the node is still accessed. the cpu socket still spend some power on
'uncore' part. So the further step is reduce the remote memory access to
save more power, and that is also numa balance want to do.
And then the next step is to detect if this socket is cache intensive,
if there is much cache thresh on the node.
In theory, there is still has lots of tuning space. :)
>
> Or not, who knows. Only empirical measurements should show us what
> actually happens.
Sure. :)
>
> Thanks.
>
--
Thanks Alex
On Wed, Apr 17, 2013 at 09:18:28AM +0800, Alex Shi wrote:
> Sure. Currently if the whole socket get into sleep, but the memory on
> the node is still accessed. the cpu socket still spend some power on
> 'uncore' part. So the further step is reduce the remote memory access
> to save more power, and that is also numa balance want to do.
Yeah, if you also mean, you need to further migrate the memory of the
threads away from the node so that it doesn't need to serve memory
accesses from other sockets, then that should probably help save even
more power. You probably would still need to serve probes from the L3
but your DRAM links will be powered down and such.
> And then the next step is to detect if this socket is cache intensive,
> if there is much cache thresh on the node.
Yeah, that would be probably harder to determine - is cache thrashing
(and I think you mean L3 here) worse than migrating tasks to other nodes
and having them powered on just because my current node is not supposed
to thrash L3. Hmm.
> In theory, there is still has lots of tuning space. :)
Yep. :)
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 04/12/2013 12:48 PM, Mike Galbraith wrote:
> On Fri, 2013-04-12 at 18:23 +0200, Borislav Petkov wrote:
>> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
>>> Thanks a lot for comments, Len!
>>
>> AFAICT, you kinda forgot to answer his most important question:
>>
>>> These numbers suggest that this patch series simultaneously
>>> has a negative impact on performance and energy required
>>> to retire the workload. Why do it?
>
> Hm. When I tested AIM7 compute on a NUMA box, there was a marked
> throughput increase at the low to moderate load end of the test spectrum
> IIRC. Fully repeatable. There were also other benefits unrelated to
> power, ie mitigation of the evil face of select_idle_sibling(). I
> rather liked what I saw during ~big box test-drive.
>
> (just saying there are other aspects besides joules in there)
Mike,
Can you re-run your AIM7 measurement with turbo-mode and HT-mode disabled,
and then independently re-enable them?
If you still see the performance benefit, then that proves
that the scheduler hacks are not about tricking into
turbo mode, but something else.
If the performance gains *are* about interactions with turbo-mode,
then perhaps what we should really be doing here is making
the scheduler explicitly turbo-aware? Of course, that begs the question
of how the scheduler should be aware of cpufreq in general...
thanks,
Len Brown, Intel Open Source Technology Center
On Wed, 2013-04-17 at 17:53 -0400, Len Brown wrote:
> On 04/12/2013 12:48 PM, Mike Galbraith wrote:
> > On Fri, 2013-04-12 at 18:23 +0200, Borislav Petkov wrote:
> >> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
> >>> Thanks a lot for comments, Len!
> >>
> >> AFAICT, you kinda forgot to answer his most important question:
> >>
> >>> These numbers suggest that this patch series simultaneously
> >>> has a negative impact on performance and energy required
> >>> to retire the workload. Why do it?
> >
> > Hm. When I tested AIM7 compute on a NUMA box, there was a marked
> > throughput increase at the low to moderate load end of the test spectrum
> > IIRC. Fully repeatable. There were also other benefits unrelated to
> > power, ie mitigation of the evil face of select_idle_sibling(). I
> > rather liked what I saw during ~big box test-drive.
> >
> > (just saying there are other aspects besides joules in there)
>
> Mike,
>
> Can you re-run your AIM7 measurement with turbo-mode and HT-mode disabled,
> and then independently re-enable them?
Unfortunately no, because I don't have remote access to buttons.
> If you still see the performance benefit, then that proves
> that the scheduler hacks are not about tricking into
> turbo mode, but something else.
Yeah, turbo playing a role in that makes lots of sense. Someone else
will have to test that though. It was 100% repeatable, so should be
easy to verify.
-Mike
On Wed, 2013-04-17 at 17:53 -0400, Len Brown wrote:
> On 04/12/2013 12:48 PM, Mike Galbraith wrote:
> > On Fri, 2013-04-12 at 18:23 +0200, Borislav Petkov wrote:
> >> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
> >>> Thanks a lot for comments, Len!
> >>
> >> AFAICT, you kinda forgot to answer his most important question:
> >>
> >>> These numbers suggest that this patch series simultaneously
> >>> has a negative impact on performance and energy required
> >>> to retire the workload. Why do it?
> >
> > Hm. When I tested AIM7 compute on a NUMA box, there was a marked
> > throughput increase at the low to moderate load end of the test spectrum
> > IIRC. Fully repeatable. There were also other benefits unrelated to
> > power, ie mitigation of the evil face of select_idle_sibling(). I
> > rather liked what I saw during ~big box test-drive.
> >
> > (just saying there are other aspects besides joules in there)
>
> Mike,
>
> Can you re-run your AIM7 measurement with turbo-mode and HT-mode disabled,
> and then independently re-enable them?
>
> If you still see the performance benefit, then that proves
> that the scheduler hacks are not about tricking into
> turbo mode, but something else.
I did that today, neither turbo nor HT affected the performance gain. I
used the same box and patch set as tested before (v4), but plugged into
linus HEAD. "powersaving" AIM7 numbers are ~identical to those I posted
before, "performance" is lower at the low end of AIM7 test spectrum, but
as before, delta goes away once the load becomes hefty.
-Mike
On Fri, 2013-04-26 at 17:11 +0200, Mike Galbraith wrote:
> On Wed, 2013-04-17 at 17:53 -0400, Len Brown wrote:
> > On 04/12/2013 12:48 PM, Mike Galbraith wrote:
> > > On Fri, 2013-04-12 at 18:23 +0200, Borislav Petkov wrote:
> > >> On Fri, Apr 12, 2013 at 04:46:50PM +0800, Alex Shi wrote:
> > >>> Thanks a lot for comments, Len!
> > >>
> > >> AFAICT, you kinda forgot to answer his most important question:
> > >>
> > >>> These numbers suggest that this patch series simultaneously
> > >>> has a negative impact on performance and energy required
> > >>> to retire the workload. Why do it?
> > >
> > > Hm. When I tested AIM7 compute on a NUMA box, there was a marked
> > > throughput increase at the low to moderate load end of the test spectrum
> > > IIRC. Fully repeatable. There were also other benefits unrelated to
> > > power, ie mitigation of the evil face of select_idle_sibling(). I
> > > rather liked what I saw during ~big box test-drive.
> > >
> > > (just saying there are other aspects besides joules in there)
> >
> > Mike,
> >
> > Can you re-run your AIM7 measurement with turbo-mode and HT-mode disabled,
> > and then independently re-enable them?
> >
> > If you still see the performance benefit, then that proves
> > that the scheduler hacks are not about tricking into
> > turbo mode, but something else.
>
> I did that today, neither turbo nor HT affected the performance gain. I
> used the same box and patch set as tested before (v4), but plugged into
> linus HEAD. "powersaving" AIM7 numbers are ~identical to those I posted
> before, "performance" is lower at the low end of AIM7 test spectrum, but
> as before, delta goes away once the load becomes hefty.
Well now, that's not exactly what I expected to see for AIM7 compute.
Filesystem is munching cycles otherwise used for compute when load is
spread across the whole box vs consolidated.
performance
PerfTop: 35 irqs/sec kernel:94.3% exact: 0.0% [1000Hz cycles], (all, 80 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ______________________________ ________________________________________
9367.00 15.5% jbd2_journal_put_journal_head /lib/modules/3.9.0-default/build/vmlinux
7658.00 12.7% jbd2_journal_add_journal_head /lib/modules/3.9.0-default/build/vmlinux
7042.00 11.7% jbd2_journal_grab_journal_head /lib/modules/3.9.0-default/build/vmlinux
4433.00 7.4% sieve /abuild/mike/aim7/multitask
3248.00 5.4% jbd_lock_bh_state /lib/modules/3.9.0-default/build/vmlinux
3034.00 5.0% do_get_write_access /lib/modules/3.9.0-default/build/vmlinux
2058.00 3.4% mul_double /abuild/mike/aim7/multitask
2038.00 3.4% add_double /abuild/mike/aim7/multitask
1365.00 2.3% native_write_msr_safe /lib/modules/3.9.0-default/build/vmlinux
1333.00 2.2% __find_get_block /lib/modules/3.9.0-default/build/vmlinux
1213.00 2.0% add_long /abuild/mike/aim7/multitask
1208.00 2.0% add_int /abuild/mike/aim7/multitask
1084.00 1.8% __wait_on_bit_lock /lib/modules/3.9.0-default/build/vmlinux
1065.00 1.8% div_double /abuild/mike/aim7/multitask
901.00 1.5% intel_idle /lib/modules/3.9.0-default/build/vmlinux
812.00 1.3% _raw_spin_lock_irqsave /lib/modules/3.9.0-default/build/vmlinux
559.00 0.9% jbd2_journal_dirty_metadata /lib/modules/3.9.0-default/build/vmlinux
464.00 0.8% copy_user_generic_string /lib/modules/3.9.0-default/build/vmlinux
455.00 0.8% div_int /abuild/mike/aim7/multitask
430.00 0.7% string_rtns_1 /abuild/mike/aim7/multitask
419.00 0.7% strncat /lib64/libc-2.11.3.so
412.00 0.7% wake_bit_function /lib/modules/3.9.0-default/build/vmlinux
347.00 0.6% jbd2_journal_cancel_revoke /lib/modules/3.9.0-default/build/vmlinux
346.00 0.6% ext4_mark_iloc_dirty /lib/modules/3.9.0-default/build/vmlinux
306.00 0.5% __brelse /lib/modules/3.9.0-default/build/vmlinux
powersaving
PerfTop: 59 irqs/sec kernel:78.0% exact: 0.0% [1000Hz cycles], (all, 80 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ______________________________ ________________________________________
6383.00 22.5% sieve /abuild/mike/aim7/multitask
2380.00 8.4% mul_double /abuild/mike/aim7/multitask
2375.00 8.4% add_double /abuild/mike/aim7/multitask
1678.00 5.9% add_long /abuild/mike/aim7/multitask
1633.00 5.8% add_int /abuild/mike/aim7/multitask
1338.00 4.7% div_double /abuild/mike/aim7/multitask
770.00 2.7% strncat /lib64/libc-2.11.3.so
698.00 2.5% string_rtns_1 /abuild/mike/aim7/multitask
678.00 2.4% copy_user_generic_string /lib/modules/3.9.0-default/build/vmlinux
569.00 2.0% div_int /abuild/mike/aim7/multitask
329.00 1.2% jbd2_journal_put_journal_head /lib/modules/3.9.0-default/build/vmlinux
306.00 1.1% array_rtns /abuild/mike/aim7/multitask
298.00 1.1% do_get_write_access /lib/modules/3.9.0-default/build/vmlinux
270.00 1.0% jbd2_journal_add_journal_head /lib/modules/3.9.0-default/build/vmlinux
258.00 0.9% _int_malloc /lib64/libc-2.11.3.so
251.00 0.9% __find_get_block /lib/modules/3.9.0-default/build/vmlinux
236.00 0.8% __memset /lib/modules/3.9.0-default/build/vmlinux
224.00 0.8% jbd2_journal_grab_journal_head /lib/modules/3.9.0-default/build/vmlinux
221.00 0.8% intel_idle /lib/modules/3.9.0-default/build/vmlinux
161.00 0.6% jbd_lock_bh_state /lib/modules/3.9.0-default/build/vmlinux
161.00 0.6% start_this_handle /lib/modules/3.9.0-default/build/vmlinux
153.00 0.5% __GI_memset /lib64/libc-2.11.3.so
147.00 0.5% ext4_do_update_inode /lib/modules/3.9.0-default/build/vmlinux
135.00 0.5% jbd2_journal_stop /lib/modules/3.9.0-default/build/vmlinux
123.00 0.4% jbd2_journal_dirty_metadata /lib/modules/3.9.0-default/build/vmlinux
performance
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
14 7 0 47716456 255124 674808 0 0 0 0 6183 93733 1 3 95 1 0
0 0 0 47791912 255152 602068 0 0 0 2671 14526 49606 2 2 94 1 0
1 0 0 47794384 255152 603796 0 0 0 0 68 111 0 0 100 0 0
8 6 0 47672340 255156 730040 0 0 0 0 36249 103961 2 8 86 4 0
0 0 0 47793976 255216 604616 0 0 0 2686 5322 6379 2 1 97 0 0
0 0 0 47799128 255216 603108 0 0 0 0 62 106 0 0 100 0 0
3 0 0 47795972 255300 603136 0 0 0 2626 39115 146228 3 5 88 3 0
0 0 0 47797176 255300 603284 0 0 0 43 128 216 0 0 100 0 0
0 0 0 47803244 255300 602580 0 0 0 0 78 124 0 0 100 0 0
0 0 0 47789120 255336 603940 0 0 0 2676 14085 85798 3 3 92 1 0
powersaving
0 0 0 47820780 255516 590292 0 0 0 31 81 126 0 0 100 0 0
0 0 0 47823712 255516 589376 0 0 0 0 107 190 0 0 100 0 0
0 0 0 47826608 255516 588060 0 0 0 0 76 130 0 0 100 0 0
0 0 0 47811260 255632 602080 0 0 0 2678 106 200 0 0 100 0 0
0 0 0 47812548 255632 601892 0 0 0 0 69 110 0 0 100 0 0
0 0 0 47808284 255680 604400 0 0 0 2668 1588 3451 4 2 94 0 0
0 0 0 47810300 255680 603624 0 0 0 0 77 124 0 0 100 0 0
20 3 0 47760764 255720 643744 0 0 1 0 948 2817 2 1 97 0 0
0 0 0 47817828 255756 602400 0 0 1 2703 984 797 2 0 98 0 0
0 0 0 47819548 255756 602532 0 0 0 0 93 158 0 0 100 0 0
1 0 0 47819312 255792 603080 0 0 0 2661 1774 3348 4 2 94 0 0
0 0 0 47821912 255800 602608 0 0 0 2 66 107 0 0 100 0 0
Invisible ink is pretty expensive stuff.
-Mike
On Tue, 2013-04-30 at 07:16 +0200, Mike Galbraith wrote:
> Well now, that's not exactly what I expected to see for AIM7 compute.
> Filesystem is munching cycles otherwise used for compute when load is
> spread across the whole box vs consolidated.
So AIM7 compute performance delta boils down to: powersaving stacks
tasks, so they pat single bit of spinning rust sequentially/gently.
-Mike
* Mike Galbraith <[email protected]> wrote:
> On Tue, 2013-04-30 at 07:16 +0200, Mike Galbraith wrote:
>
> > Well now, that's not exactly what I expected to see for AIM7 compute.
> > Filesystem is munching cycles otherwise used for compute when load is
> > spread across the whole box vs consolidated.
>
> So AIM7 compute performance delta boils down to: powersaving stacks
> tasks, so they pat single bit of spinning rust sequentially/gently.
So AIM7 with real block IO improved, due to sequentiality. Does it improve
if AIM7 works on an SSD, or into ramdisk?
Which are the workloads where 'powersaving' mode hurts workload
performance measurably?
Thanks,
Ingo
On Tue, 2013-04-30 at 10:41 +0200, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > On Tue, 2013-04-30 at 07:16 +0200, Mike Galbraith wrote:
> >
> > > Well now, that's not exactly what I expected to see for AIM7 compute.
> > > Filesystem is munching cycles otherwise used for compute when load is
> > > spread across the whole box vs consolidated.
> >
> > So AIM7 compute performance delta boils down to: powersaving stacks
> > tasks, so they pat single bit of spinning rust sequentially/gently.
>
> So AIM7 with real block IO improved, due to sequentiality. Does it improve
> if AIM7 works on an SSD, or into ramdisk?
Seriously doubt it, but I suppose I can try tmpfs.
performance
Tasks jobs/min jti jobs/min/task real cpu
20 11170.51 99 558.5253 10.85 15.19 Tue Apr 30 11:21:46 2013
20 11078.61 99 553.9305 10.94 15.59 Tue Apr 30 11:21:57 2013
20 11191.14 99 559.5568 10.83 15.29 Tue Apr 30 11:22:08 2013
powersaving
Tasks jobs/min jti jobs/min/task real cpu
20 10978.26 99 548.9130 11.04 19.25 Tue Apr 30 11:22:38 2013
20 10988.21 99 549.4107 11.03 18.71 Tue Apr 30 11:22:49 2013
20 11008.17 99 550.4087 11.01 18.85 Tue Apr 30 11:23:00 2013
Nope.
> Which are the workloads where 'powersaving' mode hurts workload
> performance measurably?
Well, it'll lose throughput any time there's parallel execution
potential but it's serialized instead.. using average will inevitably
stack tasks sometimes, but that's its goal. Hackbench shows it.
performance
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.487
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.487
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.497
powersaving
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.702
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.679
monteverdi:/abuild/mike/aim7/:[0]# hackbench -l 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 1.137
-Mike
On Tue, 2013-04-30 at 11:35 +0200, Mike Galbraith wrote:
> On Tue, 2013-04-30 at 10:41 +0200, Ingo Molnar wrote:
> > Which are the workloads where 'powersaving' mode hurts workload
> > performance measurably?
>
> Well, it'll lose throughput any time there's parallel execution
> potential but it's serialized instead.. using average will inevitably
> stack tasks sometimes, but that's its goal. Hackbench shows it.
(but that consolidation can be a winner too, and I bet a nickle it would
be for a socket sized pgbench run)
On Tue, 2013-04-30 at 11:49 +0200, Mike Galbraith wrote:
> On Tue, 2013-04-30 at 11:35 +0200, Mike Galbraith wrote:
> > On Tue, 2013-04-30 at 10:41 +0200, Ingo Molnar wrote:
>
> > > Which are the workloads where 'powersaving' mode hurts workload
> > > performance measurably?
> >
> > Well, it'll lose throughput any time there's parallel execution
> > potential but it's serialized instead.. using average will inevitably
> > stack tasks sometimes, but that's its goal. Hackbench shows it.
>
> (but that consolidation can be a winner too, and I bet a nickle it would
> be for a socket sized pgbench run)
(belay that, was thinking of keeping all tasks on a single node, but
it'll likely stack the whole thing on a CPU or two, if so, it'll hurt)
Hi Alex,
The below patch looks good to me.
On 04/04/2013 07:30 AM, Alex Shi wrote:
> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
>
> and make forking balancing imbalance since incorrect load_avg_contrib.
>
> set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
> resolve such issues.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/core.c | 6 ++++++
> kernel/sched/fair.c | 4 ++++
> 2 files changed, 10 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 54eaaa2..8843cd3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1564,6 +1564,7 @@ static void __sched_fork(struct task_struct *p)
> #ifdef CONFIG_SMP
> p->se.avg.runnable_avg_period = 0;
> p->se.avg.runnable_avg_sum = 0;
> + p->se.avg.decay_count = 0;
> #endif
> #ifdef CONFIG_SCHEDSTATS
> memset(&p->se.statistics, 0, sizeof(p->se.statistics));
> @@ -1651,6 +1652,11 @@ void sched_fork(struct task_struct *p)
> p->sched_reset_on_fork = 0;
> }
>
> + /* New forked task assumed with full utilization */
> +#if defined(CONFIG_SMP)
> + p->se.avg.load_avg_contrib = p->se.load.weight;
> +#endif
> +
> if (!rt_prio(p->prio))
> p->sched_class = &fair_sched_class;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9c2f726..2881d42 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1508,6 +1508,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> * We track migrations using entity decay_count <= 0, on a wake-up
> * migration we use a negative decay count to track the remote decays
> * accumulated while sleeping.
> + *
> + * When enqueue a new forked task, the se->avg.decay_count == 0, so
> + * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
> + * value: se->load.weight.
> */
> if (unlikely(se->avg.decay_count <= 0)) {
> se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
>
Reviewed-by:Preeti U Murthy
Thanks
Regards
Preeti U Murthy
Hi Alex,
You can add my Reviewed-by for the below patch.
Thanks
Regards
Preeti U Murthy
On 04/04/2013 07:30 AM, Alex Shi wrote:
> The cpu's utilization is to measure how busy is the cpu.
> util = cpu_rq(cpu)->avg.runnable_avg_sum * SCHED_POEWR_SCALE
> / cpu_rq(cpu)->avg.runnable_avg_period;
>
> Since the util is no more than 1, we scale its value with 1024, same as
> SCHED_POWER_SCALE and set the FULL_UTIL as 1024.
>
> In later power aware scheduling, we are sensitive for how busy of the
> cpu. Since as to power consuming, it is tight related with cpu busy
> time.
>
> BTW, rq->util can be used for any purposes if needed, not only power
> scheduling.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> include/linux/sched.h | 2 +-
> kernel/sched/debug.c | 1 +
> kernel/sched/fair.c | 5 +++++
> kernel/sched/sched.h | 4 ++++
> 4 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5a4cf37..226a515 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -793,7 +793,7 @@ enum cpu_idle_type {
> #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
>
> /*
> - * Increase resolution of cpu_power calculations
> + * Increase resolution of cpu_power and rq->util calculations
> */
> #define SCHED_POWER_SHIFT 10
> #define SCHED_POWER_SCALE (1L << SCHED_POWER_SHIFT)
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 75024a6..f5db759 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -311,6 +311,7 @@ do { \
>
> P(ttwu_count);
> P(ttwu_local);
> + P(util);
>
> #undef P
> #undef P64
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2e49c3f..7124244 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1495,8 +1495,13 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>
> static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
> {
> + u32 period;
> __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
> __update_tg_runnable_avg(&rq->avg, &rq->cfs);
> +
> + period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
> + rq->util = (u64)(rq->avg.runnable_avg_sum << SCHED_POWER_SHIFT)
> + / period;
> }
>
> /* Add the load generated by se into cfs_rq's child load-average */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 804ee41..8682110 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -351,6 +351,9 @@ extern struct root_domain def_root_domain;
>
> #endif /* CONFIG_SMP */
>
> +/* full cpu utilization */
> +#define FULL_UTIL SCHED_POWER_SCALE
> +
> /*
> * This is the main, per-CPU runqueue data structure.
> *
> @@ -482,6 +485,7 @@ struct rq {
> #endif
>
> struct sched_avg avg;
> + unsigned int util;
> };
>
> static inline int cpu_of(struct rq *rq)
>
Hi Alex,
You can add my Reviewed-by for the below patch.
Thanks
Regards
Preeti U Murthy
On 04/04/2013 07:30 AM, Alex Shi wrote:
> In power aware scheduling, we don't want to balance 'prefer_sibling'
> groups just because local group has capacity.
> If the local group has no tasks at the time, that is the power
> balance hope so.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/fair.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0dd29f4..86221e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4762,8 +4762,12 @@ static inline void update_sd_lb_stats(struct lb_env *env,
> * extra check prevents the case where you always pull from the
> * heaviest group when it is already under-utilized (possible
> * with a large weight task outweighs the tasks on the system).
> + *
> + * In power aware scheduling, we don't care load weight and
> + * want not to pull tasks just because local group has capacity.
> */
> - if (prefer_sibling && !local_group && sds->this_has_capacity)
> + if (prefer_sibling && !local_group && sds->this_has_capacity
> + && env->flags & LBF_PERF_BAL)
> sgs.group_capacity = min(sgs.group_capacity, 1UL);
>
> if (local_group) {
>
On 05/06/2013 11:26 AM, Preeti U Murthy wrote:
> Hi Alex,
>
> You can add my Reviewed-by for the below patch.
>
> Thanks
Thanks a lot for the review!
--
Thanks
Alex
[Apologies if threading mangled, all headers written by hand]
On 04/04/2013 07:30 AM, Alex Shi wrote:
> The cpu's utilization is to measure how busy is the cpu.
> util = cpu_rq(cpu)->avg.runnable_avg_sum * SCHED_POEWR_SCALE
> / cpu_rq(cpu)->avg.runnable_avg_period;
>
> Since the util is no more than 1, we scale its value with 1024, same as
> SCHED_POWER_SCALE and set the FULL_UTIL as 1024.
>
> In later power aware scheduling, we are sensitive for how busy of the
> cpu. Since as to power consuming, it is tight related with cpu busy
> time.
>
> BTW, rq->util can be used for any purposes if needed, not only power
> scheduling.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> include/linux/sched.h | 2 +-
> kernel/sched/debug.c | 1 +
> kernel/sched/fair.c | 5 +++++
> kernel/sched/sched.h | 4 ++++
> 4 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5a4cf37..226a515 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -793,7 +793,7 @@ enum cpu_idle_type {
> #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
>
> /*
> - * Increase resolution of cpu_power calculations
> + * Increase resolution of cpu_power and rq->util calculations
> */
> #define SCHED_POWER_SHIFT 10
> #define SCHED_POWER_SCALE (1L << SCHED_POWER_SHIFT)
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 75024a6..f5db759 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -311,6 +311,7 @@ do { \
>
> P(ttwu_count);
> P(ttwu_local);
> + P(util);
>
> #undef P
> #undef P64
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2e49c3f..7124244 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1495,8 +1495,13 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>
> static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
> {
> + u32 period;
> __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
> __update_tg_runnable_avg(&rq->avg, &rq->cfs);
> +
> + period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
> + rq->util = (u64)(rq->avg.runnable_avg_sum << SCHED_POWER_SHIFT)
> + / period;
Greetings, Alex.
That cast achieves nothing where it is. If the shift has overflowed,
then you've already lost information; and if it can't overflow, then
it's not needed at all.
It's itsy-bitsy, but note that there exists a div_u64(u64 dividend,
u32 divisor) helper which may be implemented to be superior to just '/'.
(And also note that the assignment to ``period'' is a good candidate for
gcc's ``?:'' operator.)
If you pull the cast inside the brackets, then you may add my
Reviewed-by: Phil Carmody <[email protected]>
Cheers,
Phil
> }
>
> /* Add the load generated by se into cfs_rq's child load-average */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 804ee41..8682110 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -351,6 +351,9 @@ extern struct root_domain def_root_domain;
>
> #endif /* CONFIG_SMP */
>
> +/* full cpu utilization */
> +#define FULL_UTIL SCHED_POWER_SCALE
> +
> /*
> * This is the main, per-CPU runqueue data structure.
> *
> @@ -482,6 +485,7 @@ struct rq {
> #endif
>
> struct sched_avg avg;
> + unsigned int util;
> };
>
> static inline int cpu_of(struct rq *rq)
>
--
"In a world of magnets and miracles"
-- Insane Clown Posse, Miracles, 2009. Much derided.
"Magnets, how do they work"
-- Pink Floyd, High Hopes, 1994. Lauded as lyrical geniuses.
On 05/06/2013 08:03 PM, Phil Carmody wrote:
>> > + period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
>> > + rq->util = (u64)(rq->avg.runnable_avg_sum << SCHED_POWER_SHIFT)
>> > + / period;
> Greetings, Alex.
>
> That cast achieves nothing where it is. If the shift has overflowed,
> then you've already lost information; and if it can't overflow, then
> it's not needed at all.
>
> It's itsy-bitsy, but note that there exists a div_u64(u64 dividend,
> u32 divisor) helper which may be implemented to be superior to just '/'.
> (And also note that the assignment to ``period'' is a good candidate for
> gcc's ``?:'' operator.)
>
> If you pull the cast inside the brackets, then you may add my
> Reviewed-by: Phil Carmody <[email protected]>
You are right, Phil!
Thanks for review! :)
--
Thanks
Alex
On Wed, Apr 3, 2013 at 7:00 PM, Alex Shi <[email protected]> wrote:
> The cpu's utilization is to measure how busy is the cpu.
> util = cpu_rq(cpu)->avg.runnable_avg_sum * SCHED_POEWR_SCALE
> / cpu_rq(cpu)->avg.runnable_avg_period;
>
> Since the util is no more than 1, we scale its value with 1024, same as
> SCHED_POWER_SCALE and set the FULL_UTIL as 1024.
>
> In later power aware scheduling, we are sensitive for how busy of the
> cpu. Since as to power consuming, it is tight related with cpu busy
> time.
>
> BTW, rq->util can be used for any purposes if needed, not only power
> scheduling.
>
> Signed-off-by: Alex Shi <[email protected]>
Hmm, rather than adding another variable to struct rq and another
callsite where we open code runnable-scaling we should consider at
least adding a wrapper, e.g.
/* when to_scale is a load weight callers must pass "scale_load(value)" */
static inline u32 scale_by_runnable_avg(struct sched_avg *avg, u32 to_scale) {
u32 result = se->avg.runnable_avg_sum * to_scale;
result /= (se->avg.runnable_avg_period + 1);
return result;
}
util can then just be scale_by_runnable_avg(&rq->avg, FULL_UTIL) and
if we don't need it frequently, it's now simple enough that we don't
need to cache it.
> ---
> include/linux/sched.h | 2 +-
> kernel/sched/debug.c | 1 +
> kernel/sched/fair.c | 5 +++++
> kernel/sched/sched.h | 4 ++++
> 4 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5a4cf37..226a515 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -793,7 +793,7 @@ enum cpu_idle_type {
> #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
>
> /*
> - * Increase resolution of cpu_power calculations
> + * Increase resolution of cpu_power and rq->util calculations
> */
> #define SCHED_POWER_SHIFT 10
> #define SCHED_POWER_SCALE (1L << SCHED_POWER_SHIFT)
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 75024a6..f5db759 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -311,6 +311,7 @@ do { \
>
> P(ttwu_count);
> P(ttwu_local);
> + P(util);
>
> #undef P
> #undef P64
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2e49c3f..7124244 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1495,8 +1495,13 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>
> static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
> {
> + u32 period;
> __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
> __update_tg_runnable_avg(&rq->avg, &rq->cfs);
> +
> + period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
> + rq->util = (u64)(rq->avg.runnable_avg_sum << SCHED_POWER_SHIFT)
> + / period;
> }
>
> /* Add the load generated by se into cfs_rq's child load-average */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 804ee41..8682110 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -351,6 +351,9 @@ extern struct root_domain def_root_domain;
>
> #endif /* CONFIG_SMP */
>
> +/* full cpu utilization */
> +#define FULL_UTIL SCHED_POWER_SCALE
> +
> /*
> * This is the main, per-CPU runqueue data structure.
> *
> @@ -482,6 +485,7 @@ struct rq {
> #endif
>
> struct sched_avg avg;
> + unsigned int util;
> };
>
> static inline int cpu_of(struct rq *rq)
> --
> 1.7.12
>
a
On 04/30/2013 03:26 PM, Mike Galbraith wrote:
> On Tue, 2013-04-30 at 11:49 +0200, Mike Galbraith wrote:
>> On Tue, 2013-04-30 at 11:35 +0200, Mike Galbraith wrote:
>>> On Tue, 2013-04-30 at 10:41 +0200, Ingo Molnar wrote:
>>
>>>> Which are the workloads where 'powersaving' mode hurts workload
>>>> performance measurably?
I ran ebizzy on a 2 socket, 16 core, SMT 4 Power machine.
The power efficiency drops significantly with the powersaving policy of
this patch,over the power efficiency of the scheduler without this patch.
The below parameters are measured relative to the default scheduler
behaviour.
A: Drop in power efficiency with the patch+powersaving policy
B: Drop in performance with the patch+powersaving policy
C: Decrease in power consumption with the patch+powersaving policy
NumThreads A B C
-----------------------------------------
2 33% 36% 4%
4 31% 33% 3%
8 28% 30% 3%
16 31% 33% 4%
Each of the above run is for 30s.
On investigating socket utilization,I found that only 1 socket was being
used during all the above threaded runs. As can be guessed this is due
to the group_weight being considered for the threshold metric.
This stacks up tasks on a core and further on a socket, thus throttling
them, as observed by Mike below.
I therefore think we must switch to group_capacity as the metric for
threshold and use only (rq->utils*nr_running) for group_utils
calculation during non-bursty wakeup scenarios.
This way we are comparing right; the utilization of the runqueue by the
fair tasks and the cpu capacity available for them after being consumed
by the rt tasks.
After I made the above modification,all the above three parameters came
to be nearly null. However, I am observing the load balancing of the
scheduler with the patch and powersavings policy enabled. It is behaving
very close to the default scheduler (spreading tasks across sockets).
That also explains why there is no performance drop or gain with the
patch+powersavings policy enabled. I will look into this observation and
revert.
>>>
>>> Well, it'll lose throughput any time there's parallel execution
>>> potential but it's serialized instead.. using average will inevitably
>>> stack tasks sometimes, but that's its goal. Hackbench shows it.
>>
>> (but that consolidation can be a winner too, and I bet a nickle it would
>> be for a socket sized pgbench run)
>
> (belay that, was thinking of keeping all tasks on a single node, but
> it'll likely stack the whole thing on a CPU or two, if so, it'll hurt)
At this point, I would like to raise one issue.
*Is the goal of the power aware scheduler improving power efficiency of
the scheduler or a compromise on the power efficiency but definitely a
decrease in power consumption, since it is the user who has decided to
prioritise lower power consumption over performance* ?
>
Regards
Preeti U Murthy
>>>>> Which are the workloads where 'powersaving' mode hurts workload
>>>>> performance measurably?
>
> I ran ebizzy on a 2 socket, 16 core, SMT 4 Power machine.
Is this a 2 * 16 * 4 LCPUs PowerPC machine?
> The power efficiency drops significantly with the powersaving policy of
> this patch,over the power efficiency of the scheduler without this patch.
>
> The below parameters are measured relative to the default scheduler
> behaviour.
>
> A: Drop in power efficiency with the patch+powersaving policy
> B: Drop in performance with the patch+powersaving policy
> C: Decrease in power consumption with the patch+powersaving policy
>
> NumThreads A B C
> -----------------------------------------
> 2 33% 36% 4%
> 4 31% 33% 3%
> 8 28% 30% 3%
> 16 31% 33% 4%
>
> Each of the above run is for 30s.
>
> On investigating socket utilization,I found that only 1 socket was being
> used during all the above threaded runs. As can be guessed this is due
> to the group_weight being considered for the threshold metric.
> This stacks up tasks on a core and further on a socket, thus throttling
> them, as observed by Mike below.
>
> I therefore think we must switch to group_capacity as the metric for
> threshold and use only (rq->utils*nr_running) for group_utils
> calculation during non-bursty wakeup scenarios.
> This way we are comparing right; the utilization of the runqueue by the
> fair tasks and the cpu capacity available for them after being consumed
> by the rt tasks.
>
> After I made the above modification,all the above three parameters came
> to be nearly null. However, I am observing the load balancing of the
> scheduler with the patch and powersavings policy enabled. It is behaving
> very close to the default scheduler (spreading tasks across sockets).
> That also explains why there is no performance drop or gain with the
> patch+powersavings policy enabled. I will look into this observation and
> revert.
Thanks a lot for the great testings!
Seem tasks per SMT cpu isn't power efficient.
And I got the similar result last week. I tested the fspin testing(do
endless calculation, in linux-next tree.). when I bind task per SMT cpu,
the power efficiency really dropped with most every threads number. but
when bind task per core, it has better power efficiency on all threads.
Beside to move task depend on group_capacity, another choice is balance
task according cpu_power. I did the transfer in code. but need to go
through a internal open source process before public them.
>
>>>>
>>>> Well, it'll lose throughput any time there's parallel execution
>>>> potential but it's serialized instead.. using average will inevitably
>>>> stack tasks sometimes, but that's its goal. Hackbench shows it.
>>>
>>> (but that consolidation can be a winner too, and I bet a nickle it would
>>> be for a socket sized pgbench run)
>>
>> (belay that, was thinking of keeping all tasks on a single node, but
>> it'll likely stack the whole thing on a CPU or two, if so, it'll hurt)
>
> At this point, I would like to raise one issue.
> *Is the goal of the power aware scheduler improving power efficiency of
> the scheduler or a compromise on the power efficiency but definitely a
> decrease in power consumption, since it is the user who has decided to
> prioritise lower power consumption over performance* ?
>
It could be one of reason for this feather, but I could like to
make it has better efficiency, like packing tasks according to cpu_power
not current group_weight.
>>
>
> Regards
> Preeti U Murthy
>
--
Thanks
Alex
Hi Alex,
On 05/20/2013 06:31 AM, Alex Shi wrote:
>
>>>>>> Which are the workloads where 'powersaving' mode hurts workload
>>>>>> performance measurably?
>>
>> I ran ebizzy on a 2 socket, 16 core, SMT 4 Power machine.
>
> Is this a 2 * 16 * 4 LCPUs PowerPC machine?
This is a 2 * 8 * 4 LCPUs PowerPC machine.
>> The power efficiency drops significantly with the powersaving policy of
>> this patch,over the power efficiency of the scheduler without this patch.
>>
>> The below parameters are measured relative to the default scheduler
>> behaviour.
>>
>> A: Drop in power efficiency with the patch+powersaving policy
>> B: Drop in performance with the patch+powersaving policy
>> C: Decrease in power consumption with the patch+powersaving policy
>>
>> NumThreads A B C
>> -----------------------------------------
>> 2 33% 36% 4%
>> 4 31% 33% 3%
>> 8 28% 30% 3%
>> 16 31% 33% 4%
>>
>> Each of the above run is for 30s.
>>
>> On investigating socket utilization,I found that only 1 socket was being
>> used during all the above threaded runs. As can be guessed this is due
>> to the group_weight being considered for the threshold metric.
>> This stacks up tasks on a core and further on a socket, thus throttling
>> them, as observed by Mike below.
>>
>> I therefore think we must switch to group_capacity as the metric for
>> threshold and use only (rq->utils*nr_running) for group_utils
>> calculation during non-bursty wakeup scenarios.
>> This way we are comparing right; the utilization of the runqueue by the
>> fair tasks and the cpu capacity available for them after being consumed
>> by the rt tasks.
>>
>> After I made the above modification,all the above three parameters came
>> to be nearly null. However, I am observing the load balancing of the
>> scheduler with the patch and powersavings policy enabled. It is behaving
>> very close to the default scheduler (spreading tasks across sockets).
>> That also explains why there is no performance drop or gain with the
>> patch+powersavings policy enabled. I will look into this observation and
>> revert.
>
> Thanks a lot for the great testings!
> Seem tasks per SMT cpu isn't power efficient.
> And I got the similar result last week. I tested the fspin testing(do
> endless calculation, in linux-next tree.). when I bind task per SMT cpu,
> the power efficiency really dropped with most every threads number. but
> when bind task per core, it has better power efficiency on all threads.
> Beside to move task depend on group_capacity, another choice is balance
> task according cpu_power. I did the transfer in code. but need to go
> through a internal open source process before public them.
What do you mean by *another* choice is balance task according to
cpu_power? group_capacity is based on cpu_power.
Also, your balance policy in v6 was doing the same right? It was rightly
comparing rq->utils * nr_running against cpu_power. Why not simply
switch to that code for power policy load balancing?
>>>>> Well, it'll lose throughput any time there's parallel execution
>>>>> potential but it's serialized instead.. using average will inevitably
>>>>> stack tasks sometimes, but that's its goal. Hackbench shows it.
>>>>
>>>> (but that consolidation can be a winner too, and I bet a nickle it would
>>>> be for a socket sized pgbench run)
>>>
>>> (belay that, was thinking of keeping all tasks on a single node, but
>>> it'll likely stack the whole thing on a CPU or two, if so, it'll hurt)
>>
>> At this point, I would like to raise one issue.
>> *Is the goal of the power aware scheduler improving power efficiency of
>> the scheduler or a compromise on the power efficiency but definitely a
>> decrease in power consumption, since it is the user who has decided to
>> prioritise lower power consumption over performance* ?
>>
>
> It could be one of reason for this feather, but I could like to
> make it has better efficiency, like packing tasks according to cpu_power
> not current group_weight.
Yes we could try the patch using group_capacity and observe the results
for power efficiency, before we decide to compromise on power efficiency
for decrease in power.
Regards
Preeti U Murthy