Hi,
I have revised the previous power scheduler proposal[1] trying to address as
many of the comments as possible. The overall idea was discussed at LPC[2,3].
The revised design has removed the power scheduler and replaced it with a high
level power driver interface. An interface that allows the scheduler to query
the power driver for information and provide hints to guide power management
decisions in the power driver.
The power driver is going to be a unified platform power driver that can
replace cpufreq and cpuidle drivers. Generic power policies will be optional
helper functions called from the power driver. Platforms may choose to
implement their own policies as part of their power driver.
This RFC series prototypes a part of the power driver interface (cpu capacity
hints) and shows how they can be used from the scheduler. More extensive use of
the power driver hints and queries is left for later. The focus for now is the
power driver interface. The patch series includes a power driver/cpufreq
governor that can use existing cpufreq drivers as backend. It has been tested
(not thoroughly) on ARM TC2. The cpufreq governor power driver implementation
is rather horrible, but it illustrates how the power driver interface can be
used. Native power drivers is on the todo list.
The power driver interface is still missing quite a few calls to handle: Idle,
adding extra information to the sched_domain hierarchy to guide scheduling
decisions (packing), and possibly scaling of tracked load to compensate for
frequency changes and asymmetric systems (big.LITTLE).
This set is based on 3.11. I have done ARM TC2 testing based on linux-linaro
2013.08[4] to get cpufreq support for TC2.
Morten
[1] https://lkml.org/lkml/2013/7/9/314
[2] http://etherpad.osuosl.org/lpc2013-power-efficient-scheduling
[3] http://www.linuxplumbersconf.org/2013/ocw//system/presentations/1263/original/Unifying_Power_Policies_LPC.pdf
[4] http://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git
Morten Rasmussen (7):
Initial power driver interface infrastructure
sched: power: Power driver late callback interface
sched: power: go_faster/slower power driver hints
sched: power: Remove power capacity hints for kworker threads
sched: power: Increase cpu capacity based on rq tracked load
sched: power: cpufreq: Initial schedpower cpufreq governor/power
driver
sched: power: Let the power driver choose the best wake-up cpu
arch/arm/Kconfig | 4 +
drivers/cpufreq/Kconfig | 11 ++
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/cpufreq_schedpower.c | 218 ++++++++++++++++++++++++++++++++++
include/linux/sched/power.h | 37 ++++++
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 53 ++++++++-
kernel/sched/power.c | 73 ++++++++++++
kernel/sched/sched.h | 32 +++++
10 files changed, 430 insertions(+), 1 deletion(-)
create mode 100644 drivers/cpufreq/cpufreq_schedpower.c
create mode 100644 include/linux/sched/power.h
create mode 100644 kernel/sched/power.c
--
1.7.9.5
Adds a late callback to the power driver interface which is called
with irq enabled and no locks held. The power driver may postpone
work that can't be done in schedule context (irq disabled and rq
locks held) and let the late callback handle it.
Signed-off-by: Morten Rasmussen <[email protected]>
---
include/linux/sched/power.h | 3 +++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 2 ++
kernel/sched/power.c | 6 ++++++
kernel/sched/sched.h | 8 ++++++++
5 files changed, 20 insertions(+)
diff --git a/include/linux/sched/power.h b/include/linux/sched/power.h
index cb2cf37..aba5f47 100644
--- a/include/linux/sched/power.h
+++ b/include/linux/sched/power.h
@@ -25,6 +25,9 @@ struct power_driver {
int (*go_faster) (int cpu, int hint);
/* Decrease cpu capacity hint */
int (*go_slower) (int cpu, int hint);
+
+ /* Scheduler call-back without rq lock held and with irq enabled */
+ void (*late_callback) (int cpu);
};
int power_driver_register(struct power_driver *driver);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 05c39f0..12b4753 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1913,6 +1913,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
put_task_struct(prev);
}
+ power_late_callback(raw_smp_processor_id());
tick_nohz_task_switch(current);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68f1609..edba1bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5742,6 +5742,8 @@ static void run_rebalance_domains(struct softirq_action *h)
* stopped.
*/
nohz_idle_balance(this_cpu, idle);
+
+ power_late_callback(this_cpu);
}
static inline int on_null_domain(int cpu)
diff --git a/kernel/sched/power.c b/kernel/sched/power.c
index b82965a..f70b563 100644
--- a/kernel/sched/power.c
+++ b/kernel/sched/power.c
@@ -56,3 +56,9 @@ int go_slower(int cpu, int hint)
return power_driver->go_slower(cpu, hint);
}
+void power_late_callback(int cpu)
+{
+ if (!power_driver)
+ return;
+ power_driver->late_callback(cpu);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef0a7b2..907a967 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1364,3 +1364,11 @@ static inline u64 irq_time_read(int cpu)
}
#endif /* CONFIG_64BIT */
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+#ifdef CONFIG_SCHED_POWER
+extern void power_late_callback(int cpu);
+#else
+static inline void power_late_callback(int cpu)
+{
+}
+#endif /* CONFIG_SCHED_POWER */
--
1.7.9.5
weighted_cpuload() gives the sum of priority scaled task loads. One
NICE 0 task is not be enough load to trigger a cpu capacity increase
when cpu_power > 1024. The rq->avg.runnable_avg_sum tracks the actual
busy time of the cpu and is therefore a more accurate indication of
the cpu load. If the rq tracked load is high the cpu is busy even if
the weighted_cpuload() should indicate otherwise.
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea8d973..ee87e65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2861,7 +2861,17 @@ static inline void inc_cpu_capacity(int cpu, struct task_struct *p)
if (p && p->flags & PF_WQ_WORKER)
return;
- if (weighted_cpuload(cpu) > power_of(cpu))
+ if (weighted_cpuload(cpu) > power_of(cpu)) {
+ go_faster(cpu, 0);
+ return;
+ }
+
+ /*
+ * Go faster if the cpu is 80% busy. This threshold should
+ * be set by the power driver later.
+ */
+ if (cpu_rq(cpu)->avg.runnable_avg_sum * 100 >
+ 80 * cpu_rq(cpu)->avg.runnable_avg_period)
go_faster(cpu, 0);
}
--
1.7.9.5
Adds best_wake_cpu() to the power driver interface allowing the power
driver to influence which cpu will be woken up when additional cpus
are needed. Currently only a place holder until idle interface is in
place.
Signed-off-by: Morten Rasmussen <[email protected]>
---
drivers/cpufreq/cpufreq_schedpower.c | 11 +++++++++++
include/linux/sched/power.h | 2 ++
kernel/sched/fair.c | 5 ++++-
kernel/sched/power.c | 9 +++++++++
kernel/sched/sched.h | 6 ++++++
5 files changed, 32 insertions(+), 1 deletion(-)
diff --git a/drivers/cpufreq/cpufreq_schedpower.c b/drivers/cpufreq/cpufreq_schedpower.c
index 5952c79..62847ca 100644
--- a/drivers/cpufreq/cpufreq_schedpower.c
+++ b/drivers/cpufreq/cpufreq_schedpower.c
@@ -187,6 +187,16 @@ int pdriver_go_slower(int cpu, int hint)
return 1;
}
+int pdriver_best_wake_cpu(void)
+{
+ /*
+ * Place holder. Should ideally determine the best cpu to wake up
+ * when more cpus are needed. This may be based on coupled C-states,
+ * power domains, and other platform specific factors.
+ */
+ return nr_cpu_ids;
+}
+
void pdriver_late_callback(int cpu)
{
struct cpufreq_schedpower_cpuinfo *pcpu_info;
@@ -202,6 +212,7 @@ static struct power_driver pdriver = {
.at_max_capacity = pdriver_at_max_capacity,
.go_faster = pdriver_go_faster,
.go_slower = pdriver_go_slower,
+ .best_wake_cpu = pdriver_best_wake_cpu,
.late_callback = pdriver_late_callback,
};
diff --git a/include/linux/sched/power.h b/include/linux/sched/power.h
index aba5f47..5d76f44 100644
--- a/include/linux/sched/power.h
+++ b/include/linux/sched/power.h
@@ -25,6 +25,8 @@ struct power_driver {
int (*go_faster) (int cpu, int hint);
/* Decrease cpu capacity hint */
int (*go_slower) (int cpu, int hint);
+ /* Best cpu to wake up */
+ int (*best_wake_cpu) (void);
/* Scheduler call-back without rq lock held and with irq enabled */
void (*late_callback) (int cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee87e65..d32fb97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5458,7 +5458,10 @@ static struct {
static inline int find_new_ilb(int call_cpu)
{
- int ilb = cpumask_first(nohz.idle_cpus_mask);
+ int ilb = best_wake_cpu();
+
+ if (ilb >= nr_cpu_ids)
+ ilb = cpumask_first(nohz.idle_cpus_mask);
if (ilb < nr_cpu_ids && idle_cpu(ilb))
return ilb;
diff --git a/kernel/sched/power.c b/kernel/sched/power.c
index f70b563..e1357d2 100644
--- a/kernel/sched/power.c
+++ b/kernel/sched/power.c
@@ -11,6 +11,7 @@
*/
#include <linux/kernel.h>
+#include <linux/cpumask.h>
#include <linux/sched/power.h>
static struct power_driver *power_driver;
@@ -56,6 +57,14 @@ int go_slower(int cpu, int hint)
return power_driver->go_slower(cpu, hint);
}
+int best_wake_cpu(void)
+{
+ if (!power_driver)
+ return nr_cpu_ids;
+
+ return power_driver->best_wake_cpu();
+}
+
void power_late_callback(int cpu)
{
if (!power_driver)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 88e7968..a368734 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1370,6 +1370,7 @@ extern void power_late_callback(int cpu);
extern int at_max_capacity(int cpu);
extern int go_faster(int cpu, int hint);
extern int go_slower(int cpu, int hint);
+extern int best_wake_cpu(void);
#else
static inline void power_late_callback(int cpu)
{
@@ -1389,4 +1390,9 @@ static inline int go_slower(int cpu, int hint)
{
return 0;
}
+
+static inline int best_wake_cpu(void)
+{
+ return nr_cpu_ids;
+}
#endif /* CONFIG_SCHED_POWER */
--
1.7.9.5
Adds a 'schedpower' cpufreq governor that acts as a power driver to
cpufreq wrapper. This enables the existing cpufreq drivers to be used
as power driver backends initially until native power drivers have been
implemented.
schedpower currently uses workqueues as a horrible work-around for calling
cpufreq from the late_callback() path. Calling cpufreq from the
late_callback() in its current form is not possible and certainly not
possible from the scheduler context. Suggestions for better solutions
are very welcome.
Native power driver implemented with the locking and context limitations
in mind should be able to avoid such work-arounds.
schedpower has been tested (not thoroughly) on ARM TC2.
Signed-off-by: Morten Rasmussen <[email protected]>
---
drivers/cpufreq/Kconfig | 11 ++
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/cpufreq_schedpower.c | 207 ++++++++++++++++++++++++++++++++++
3 files changed, 219 insertions(+)
create mode 100644 drivers/cpufreq/cpufreq_schedpower.c
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 534fcb8..d832e34 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -184,6 +184,17 @@ config CPU_FREQ_GOV_CONSERVATIVE
If in doubt, say N.
+config CPU_FREQ_GOV_SCHEDPOWER
+ bool "'schedpower' governor/power driver"
+ depends on CPU_FREQ
+ depends on SCHED_POWER
+ help
+ 'schedpower' - this governor allows existing cpufreq drivers to be
+ used as power driver backend. The governor registers itself as a
+ power driver with the scheduler and uses the existing cpufreq framework
+ and drivers to do the actual frequency changes. Frequency selection is
+ based on scheduler hints provided by the power driver interface.
+
config GENERIC_CPUFREQ_CPU0
tristate "Generic CPU0 cpufreq driver"
depends on HAVE_CLK && REGULATOR && PM_OPP && OF
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index d345b5a..e00a17c 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE) += cpufreq_powersave.o
obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) += cpufreq_userspace.o
obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o
obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDPOWER) += cpufreq_schedpower.o
obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
# CPUfreq cross-arch helpers
diff --git a/drivers/cpufreq/cpufreq_schedpower.c b/drivers/cpufreq/cpufreq_schedpower.c
new file mode 100644
index 0000000..5952c79
--- /dev/null
+++ b/drivers/cpufreq/cpufreq_schedpower.c
@@ -0,0 +1,207 @@
+/*
+ * schedpower cpufreq governor/power driver
+ *
+ * drivers/cpufreq/cpufreq_schedpower.c
+ *
+ * Copyright (C) 2013 ARM Limited.
+ * Author: Morten Rasmussen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/cpufreq.h>
+#include <linux/init.h>
+#include <linux/percpu.h>
+#include <linux/workqueue.h>
+#include <linux/slab.h>
+#include <linux/hrtimer.h>
+#include <linux/sched/power.h>
+
+struct cpufreq_schedpower_cpuinfo {
+ struct cpufreq_policy *policy;
+ struct work_struct work;
+ unsigned int target_freq;
+ u64 last_change;
+ int cpufreq_call_needed;
+ int governor_enabled;
+};
+
+DEFINE_PER_CPU(struct cpufreq_schedpower_cpuinfo, cpuinfo);
+
+struct cpufreq_driver_data {
+ struct work_struct work;
+ struct cpufreq_policy *policy;
+ unsigned int target_freq;
+};
+
+static struct power_driver pdriver;
+
+static int cpufreq_governor_schedpower(struct cpufreq_policy *policy,
+ unsigned int event)
+{
+ int i;
+ struct cpufreq_schedpower_cpuinfo *pcpu_info;
+
+ switch (event) {
+ case CPUFREQ_GOV_START:
+ case CPUFREQ_GOV_LIMITS:
+ __cpufreq_driver_target(policy, policy->max,
+ CPUFREQ_RELATION_H);
+
+ for_each_cpu(i, policy->cpus) {
+ pcpu_info = &per_cpu(cpuinfo, i);
+ pcpu_info->policy = policy;
+ pcpu_info->last_change = ktime_to_us(ktime_get());
+ pcpu_info->cpufreq_call_needed = 0;
+ pcpu_info->governor_enabled = 1;
+ }
+
+ power_driver_register(&pdriver);
+ break;
+ default:
+ break;
+ }
+
+ return 0;
+}
+
+static
+struct cpufreq_governor cpufreq_gov_schedpower = {
+ .name = "schedpower",
+ .governor = cpufreq_governor_schedpower,
+ .owner = THIS_MODULE,
+};
+
+static int __init cpufreq_gov_schedpower_init(void)
+{
+ return cpufreq_register_governor(&cpufreq_gov_schedpower);
+}
+late_initcall(cpufreq_gov_schedpower_init);
+
+#define FREQ_STEP 50 /* % */
+#define CALL_RATE 1000 /* us */
+
+/*
+ * cpufreq_driver_call: Workqueue worker function that calls to cpufreq.
+ * More details at queue_cpufreq_driver_call
+ */
+static void cpufreq_driver_call(struct work_struct *work)
+{
+ struct cpufreq_driver_data *call_data =
+ (struct cpufreq_driver_data *)work;
+
+ cpufreq_driver_target(call_data->policy, call_data->target_freq,
+ CPUFREQ_RELATION_H);
+ kfree((void *)call_data);
+}
+
+/*
+ * queue_cpufreq_driver_call: cpufreq can't be called from the schedule()
+ * context with rq locks held and irqs disabled. Using workqueues to do the
+ * actual call to cpufreq should solve that problem. But work cannot be queued
+ * with the irq disabled and rq locks held. So this must be postponed to the
+ * late callback.
+ *
+ * Using workqueues is not ideal as it will schedule the kworker task before
+ * the task we actually want to run. To avoid getting power hints for the
+ * kworker and overriding the power hints for the user task, kthreads are
+ * filtered out in fair.c.
+ */
+static void queue_cpufreq_driver_call(int cpu,
+ struct cpufreq_schedpower_cpuinfo *pcpu_info)
+{
+ struct cpufreq_driver_data *call_data;
+ u64 now = ktime_to_us(ktime_get());
+
+ if (now - pcpu_info->last_change < CALL_RATE)
+ return;
+
+ call_data = kmalloc(sizeof(struct cpufreq_driver_data), GFP_KERNEL);
+
+ if (call_data) {
+ INIT_WORK((struct work_struct *)call_data, cpufreq_driver_call);
+ call_data->policy = pcpu_info->policy;
+ call_data->target_freq = pcpu_info->target_freq;
+ schedule_work_on(cpu, (struct work_struct *)call_data);
+ pcpu_info->last_change = now;
+ }
+}
+
+int pdriver_at_max_capacity(int cpu)
+{
+ struct cpufreq_schedpower_cpuinfo *pcpu_info;
+ pcpu_info = &per_cpu(cpuinfo, cpu);
+
+ return (pcpu_info->policy->cur >= pcpu_info->policy->max);
+}
+
+int pdriver_go_faster(int cpu, int hint)
+{
+ struct cpufreq_schedpower_cpuinfo *pcpu_info;
+ pcpu_info = &per_cpu(cpuinfo, cpu);
+
+ if (!pcpu_info->governor_enabled)
+ return 0;
+
+ if (pcpu_info->policy->cur >= pcpu_info->policy->max)
+ return 0;
+
+ pcpu_info->target_freq = min(((100+FREQ_STEP)
+ *pcpu_info->policy->cur)/100, pcpu_info->policy->max);
+
+ pcpu_info->cpufreq_call_needed = 1;
+ return 1;
+}
+
+int pdriver_go_slower(int cpu, int hint)
+{
+ unsigned int other_freq, max_freq = 0;
+ struct cpufreq_schedpower_cpuinfo *pcpu_info, *other_cpu;
+ int i;
+
+ pcpu_info = &per_cpu(cpuinfo, cpu);
+
+ if (!pcpu_info->governor_enabled)
+ return 0;
+
+ if (pcpu_info->policy->cur <= pcpu_info->policy->min)
+ return 0;
+
+ pcpu_info->target_freq = max(((100-FREQ_STEP)
+ *pcpu_info->policy->cur)/100, pcpu_info->policy->min);
+
+ for_each_cpu(i, pcpu_info->policy->cpus) {
+ other_cpu = &per_cpu(cpuinfo, i);
+ other_freq = other_cpu->target_freq;
+ max_freq = max(other_freq, max_freq);
+ }
+
+ if (max_freq >= pcpu_info->policy->cur)
+ return 0;
+
+ pcpu_info->cpufreq_call_needed = 1;
+ return 1;
+}
+
+void pdriver_late_callback(int cpu)
+{
+ struct cpufreq_schedpower_cpuinfo *pcpu_info;
+ pcpu_info = &per_cpu(cpuinfo, cpu);
+
+ if (pcpu_info->cpufreq_call_needed) {
+ queue_cpufreq_driver_call(cpu, pcpu_info);
+ pcpu_info->cpufreq_call_needed = 0;
+ }
+}
+
+static struct power_driver pdriver = {
+ .at_max_capacity = pdriver_at_max_capacity,
+ .go_faster = pdriver_go_faster,
+ .go_slower = pdriver_go_slower,
+ .late_callback = pdriver_late_callback,
+};
+
--
1.7.9.5
Removing power hints for kworker threads enables easier use of
workqueues in the power driver late callback. That would otherwise
lead to an endless loop unless it is prevented in the power driver.
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 50ae0ce..ea8d973 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2856,23 +2856,29 @@ static inline void hrtick_update(struct rq *rq)
static unsigned long weighted_cpuload(const int cpu);
static unsigned long power_of(int cpu);
-static inline void inc_cpu_capacity(int cpu)
+static inline void inc_cpu_capacity(int cpu, struct task_struct *p)
{
+ if (p && p->flags & PF_WQ_WORKER)
+ return;
+
if (weighted_cpuload(cpu) > power_of(cpu))
go_faster(cpu, 0);
}
-static inline void dec_cpu_capacity(int cpu)
+static inline void dec_cpu_capacity(int cpu, struct task_struct *p)
{
+ if (p && p->flags & PF_WQ_WORKER)
+ return;
+
if (weighted_cpuload(cpu) < power_of(cpu))
go_slower(cpu, 0);
}
#else
-static inline void inc_cpu_capacity(int cpu)
+static inline void inc_cpu_capacity(int cpu, struct task_struct *p)
{
}
-static inline void dec_cpu_capacity(int cpu)
+static inline void dec_cpu_capacity(int cpu, struct task_struct *p)
{
}
#endif
@@ -2923,7 +2929,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
inc_nr_running(rq);
}
- inc_cpu_capacity(task_cpu(p));
+ inc_cpu_capacity(task_cpu(p), p);
hrtick_update(rq);
}
@@ -2986,7 +2992,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_rq_runnable_avg(rq, 1);
}
- dec_cpu_capacity(task_cpu(p));
+ dec_cpu_capacity(task_cpu(p), p);
hrtick_update(rq);
}
@@ -5772,7 +5778,7 @@ static void run_rebalance_domains(struct softirq_action *h)
*/
nohz_idle_balance(this_cpu, idle);
- inc_cpu_capacity(this_cpu);
+ inc_cpu_capacity(this_cpu, NULL);
power_late_callback(this_cpu);
}
--
1.7.9.5
Infrastructure for power driver registration and defines three initial
power driver interface calls: go_faster(), go_slower(), and
at_max_capacity(). The intention is to use these to guide scheduling
decisions and frequency state selection in the power driver.
Signed-off-by: Morten Rasmussen <[email protected]>
---
arch/arm/Kconfig | 4 +++
include/linux/sched/power.h | 32 ++++++++++++++++++++++++
kernel/sched/Makefile | 1 +
kernel/sched/power.c | 58 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 95 insertions(+)
create mode 100644 include/linux/sched/power.h
create mode 100644 kernel/sched/power.c
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 43594d5..763d147 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1844,6 +1844,10 @@ config XEN
help
Say Y if you want to run Linux in a Virtual Machine on Xen on ARM.
+config SCHED_POWER
+ bool "(EXPERIMENTAL) Power aware scheduling"
+ default n
+
endmenu
menu "Boot options"
diff --git a/include/linux/sched/power.h b/include/linux/sched/power.h
new file mode 100644
index 0000000..cb2cf37
--- /dev/null
+++ b/include/linux/sched/power.h
@@ -0,0 +1,32 @@
+/*
+ * include/linux/sched/power.h
+ *
+ * Copyright (C) 2013 ARM Limited.
+ * Author: Morten Rasmussen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef _LINUX_SCHED_POWER_H
+#define _LINUX_SCHED_POWER_H
+
+struct power_driver {
+
+ /*
+ * Power driver calls may happen from scheduler context with irq
+ * disabled and rq locks held. This must be taken into account in the
+ * power driver.
+ */
+
+ /* cpu already at max capacity? */
+ int (*at_max_capacity) (int cpu);
+ /* Increase cpu capacity hint */
+ int (*go_faster) (int cpu, int hint);
+ /* Decrease cpu capacity hint */
+ int (*go_slower) (int cpu, int hint);
+};
+
+int power_driver_register(struct power_driver *driver);
+int power_driver_unregister(struct power_driver *driver);
+#endif
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 54adcf3..55727b4 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -17,3 +17,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_SCHED_POWER) += power.o
diff --git a/kernel/sched/power.c b/kernel/sched/power.c
new file mode 100644
index 0000000..b82965a
--- /dev/null
+++ b/kernel/sched/power.c
@@ -0,0 +1,58 @@
+/*
+ * kernel/sched/power.c
+ *
+ * Copyright (C) 2013 ARM Limited.
+ * Author: Morten Rasmussen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched/power.h>
+
+static struct power_driver *power_driver;
+
+int power_driver_register(struct power_driver *driver)
+{
+ power_driver = driver;
+
+ return 1;
+}
+
+int power_driver_unregister(struct power_driver *driver)
+{
+ power_driver = NULL;
+
+ return 1;
+}
+
+int at_max_capacity(int cpu)
+{
+ /* Assume no performance scaling when no driver */
+ if (!power_driver)
+ return 1;
+
+ return power_driver->at_max_capacity(cpu);
+}
+
+int go_faster(int cpu, int hint)
+{
+ /* Assume no performance scaling when no driver */
+ if (!power_driver)
+ return 0;
+
+ return power_driver->go_faster(cpu, hint);
+}
+
+int go_slower(int cpu, int hint)
+{
+ /* Assume no performance scaling when no driver */
+ if (!power_driver)
+ return 0;
+
+ return power_driver->go_slower(cpu, hint);
+}
+
--
1.7.9.5
Adds power driver hints at task enqueue/dequeue and at the sched tick if
the new cpu load doesn't match the current cpu capacity. The optional hint
argument of go_faster/slower calls is currently unused. It is meant to
give the power driver a rough estimate of how much more/less cpu capacity
is needed. The power driver may ignore this and the go_faster/slower call
completely.
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++++
kernel/sched/sched.h | 18 ++++++++++++++++++
2 files changed, 48 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index edba1bb..50ae0ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2852,6 +2852,31 @@ static inline void hrtick_update(struct rq *rq)
}
#endif
+#ifdef CONFIG_SCHED_POWER
+static unsigned long weighted_cpuload(const int cpu);
+static unsigned long power_of(int cpu);
+
+static inline void inc_cpu_capacity(int cpu)
+{
+ if (weighted_cpuload(cpu) > power_of(cpu))
+ go_faster(cpu, 0);
+}
+
+static inline void dec_cpu_capacity(int cpu)
+{
+ if (weighted_cpuload(cpu) < power_of(cpu))
+ go_slower(cpu, 0);
+}
+#else
+static inline void inc_cpu_capacity(int cpu)
+{
+}
+
+static inline void dec_cpu_capacity(int cpu)
+{
+}
+#endif
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -2897,6 +2922,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_rq_runnable_avg(rq, rq->nr_running);
inc_nr_running(rq);
}
+
+ inc_cpu_capacity(task_cpu(p));
hrtick_update(rq);
}
@@ -2958,6 +2985,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
dec_nr_running(rq);
update_rq_runnable_avg(rq, 1);
}
+
+ dec_cpu_capacity(task_cpu(p));
hrtick_update(rq);
}
@@ -5743,6 +5772,7 @@ static void run_rebalance_domains(struct softirq_action *h)
*/
nohz_idle_balance(this_cpu, idle);
+ inc_cpu_capacity(this_cpu);
power_late_callback(this_cpu);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 907a967..88e7968 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1367,8 +1367,26 @@ static inline u64 irq_time_read(int cpu)
#ifdef CONFIG_SCHED_POWER
extern void power_late_callback(int cpu);
+extern int at_max_capacity(int cpu);
+extern int go_faster(int cpu, int hint);
+extern int go_slower(int cpu, int hint);
#else
static inline void power_late_callback(int cpu)
{
}
+
+static inline int at_max_capacity(int cpu)
+{
+ return 1;
+}
+
+static inline int go_faster(int cpu, int hint)
+{
+ return 0;
+}
+
+static inline int go_slower(int cpu, int hint)
+{
+ return 0;
+}
#endif /* CONFIG_SCHED_POWER */
--
1.7.9.5
Hi, Morten
On 10/12/2013 01:19 AM, Morten Rasmussen wrote:
[snip]
>
> @@ -5743,6 +5772,7 @@ static void run_rebalance_domains(struct softirq_action *h)
> */
> nohz_idle_balance(this_cpu, idle);
>
> + inc_cpu_capacity(this_cpu);
Just wondering is this check necessary here? if rq get more tasks during
the balance, enqueue_task() should already do the check each time when
we move_task(), isn't it?
Regards,
Michael Wang
> power_late_callback(this_cpu);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 907a967..88e7968 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1367,8 +1367,26 @@ static inline u64 irq_time_read(int cpu)
>
> #ifdef CONFIG_SCHED_POWER
> extern void power_late_callback(int cpu);
> +extern int at_max_capacity(int cpu);
> +extern int go_faster(int cpu, int hint);
> +extern int go_slower(int cpu, int hint);
> #else
> static inline void power_late_callback(int cpu)
> {
> }
> +
> +static inline int at_max_capacity(int cpu)
> +{
> + return 1;
> +}
> +
> +static inline int go_faster(int cpu, int hint)
> +{
> + return 0;
> +}
> +
> +static inline int go_slower(int cpu, int hint)
> +{
> + return 0;
> +}
> #endif /* CONFIG_SCHED_POWER */
>
Hi Michael,
On Sat, Oct 12, 2013 at 03:58:07AM +0100, Michael wang wrote:
> Hi, Morten
>
> On 10/12/2013 01:19 AM, Morten Rasmussen wrote:
> [snip]
> >
> > @@ -5743,6 +5772,7 @@ static void run_rebalance_domains(struct softirq_action *h)
> > */
> > nohz_idle_balance(this_cpu, idle);
> >
> > + inc_cpu_capacity(this_cpu);
>
> Just wondering is this check necessary here? if rq get more tasks during
> the balance, enqueue_task() should already do the check each time when
> we move_task(), isn't it?
True. enqueue_task() should do the necessary check when we move tasks.
Hovewer, the cpu load may change over time without moving tasks if a
task changes its behavior. A small task may enter a cpu intensive phase
and grow bigger over time while staying on the rq due to the load
tracking. That won't be noticed unless we check the cpuload
periodically. That is the reason behind this check.
I will look at remove the redundant call to inc_cpu_capacity() when
actually did move tasks. More work is needed on the policy for
inc/dec_cpu_capacity(). I just added a few easy examples of how they can
be used for this post.
Thanks,
Morten
>
> Regards,
> Michael Wang
>
> > power_late_callback(this_cpu);
> > }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 907a967..88e7968 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1367,8 +1367,26 @@ static inline u64 irq_time_read(int cpu)
> >
> > #ifdef CONFIG_SCHED_POWER
> > extern void power_late_callback(int cpu);
> > +extern int at_max_capacity(int cpu);
> > +extern int go_faster(int cpu, int hint);
> > +extern int go_slower(int cpu, int hint);
> > #else
> > static inline void power_late_callback(int cpu)
> > {
> > }
> > +
> > +static inline int at_max_capacity(int cpu)
> > +{
> > + return 1;
> > +}
> > +
> > +static inline int go_faster(int cpu, int hint)
> > +{
> > + return 0;
> > +}
> > +
> > +static inline int go_slower(int cpu, int hint)
> > +{
> > + return 0;
> > +}
> > #endif /* CONFIG_SCHED_POWER */
> >
>
>
On Fri, Oct 11, 2013 at 06:19:10PM +0100, Morten Rasmussen wrote:
> Hi,
>
> I have revised the previous power scheduler proposal[1] trying to address as
> many of the comments as possible. The overall idea was discussed at LPC[2,3].
> The revised design has removed the power scheduler and replaced it with a high
> level power driver interface. An interface that allows the scheduler to query
> the power driver for information and provide hints to guide power management
> decisions in the power driver.
>
> The power driver is going to be a unified platform power driver that can
> replace cpufreq and cpuidle drivers. Generic power policies will be optional
> helper functions called from the power driver. Platforms may choose to
> implement their own policies as part of their power driver.
>
> This RFC series prototypes a part of the power driver interface (cpu capacity
> hints) and shows how they can be used from the scheduler. More extensive use of
> the power driver hints and queries is left for later. The focus for now is the
> power driver interface. The patch series includes a power driver/cpufreq
> governor that can use existing cpufreq drivers as backend. It has been tested
> (not thoroughly) on ARM TC2. The cpufreq governor power driver implementation
> is rather horrible, but it illustrates how the power driver interface can be
> used. Native power drivers is on the todo list.
>
> The power driver interface is still missing quite a few calls to handle: Idle,
> adding extra information to the sched_domain hierarchy to guide scheduling
> decisions (packing), and possibly scaling of tracked load to compensate for
> frequency changes and asymmetric systems (big.LITTLE).
>
> This set is based on 3.11. I have done ARM TC2 testing based on linux-linaro
> 2013.08[4] to get cpufreq support for TC2.
What I'm missing is a general overview of why what and how.
In particular; how does this proposal lead to power savings. Is there a
mathematical model that supports this framework? Something where if you
give it a task set with global utilisation < 1 (ie. there's idle time),
it results in less power used.
Also, how does this proposal deal with cpufreq's fundamental broken
approach to SMP? Afaict nothing considers the effect of one cpu upon
another -- something which isn't true at all.
In fact, I don't see anything except a random bunch of hooks without an
over-all picture of how to get less power used.
On Fri, Oct 11, 2013 at 06:19:14PM +0100, Morten Rasmussen wrote:
> Removing power hints for kworker threads enables easier use of
> workqueues in the power driver late callback. That would otherwise
> lead to an endless loop unless it is prevented in the power driver.
There's many kworker users; some of them actually consume lots of
cputime. Therefore how did you come to the conclusion that excepting all
users was the better choice of a little added complexity in the one
place where it actually matters?
On Fri, Oct 11, 2013 at 06:19:12PM +0100, Morten Rasmussen wrote:
> Adds a late callback to the power driver interface which is called
> with irq enabled and no locks held. The power driver may postpone
> work that can't be done in schedule context (irq disabled and rq
> locks held) and let the late callback handle it.
What kind of things are we talking about here?
On Fri, Oct 11, 2013 at 06:19:13PM +0100, Morten Rasmussen wrote:
> +static inline void inc_cpu_capacity(int cpu)
> +{
> + if (weighted_cpuload(cpu) > power_of(cpu))
> + go_faster(cpu, 0);
> +}
> +
> +static inline void dec_cpu_capacity(int cpu)
> +{
> + if (weighted_cpuload(cpu) < power_of(cpu))
> + go_slower(cpu, 0);
> +}
It seems wrong to me to use weighted_cpuload() here; that contains the
task weight, which is irrelevant to power usage. I would expect a pure
utilization term here.
Something like:
se->avg.runnable_avg_sum / se->avg.runnable_avg_period
On 10/14/2013 6:33 AM, Peter Zijlstra wrote:
> On Fri, Oct 11, 2013 at 06:19:14PM +0100, Morten Rasmussen wrote:
>> Removing power hints for kworker threads enables easier use of
>> workqueues in the power driver late callback. That would otherwise
>> lead to an endless loop unless it is prevented in the power driver.
>
> There's many kworker users; some of them actually consume lots of
> cputime. Therefore how did you come to the conclusion that excepting all
> users was the better choice of a little added complexity in the one
> place where it actually matters?
.. and likely only for a very few architectures
x86, and I suspect modern ARM, can change frequency synchronously.
(using an instruction or maybe two or three for ARM)
On Mon, Oct 14, 2013 at 02:48:09PM +0100, Peter Zijlstra wrote:
> On Fri, Oct 11, 2013 at 06:19:13PM +0100, Morten Rasmussen wrote:
> > +static inline void inc_cpu_capacity(int cpu)
> > +{
> > + if (weighted_cpuload(cpu) > power_of(cpu))
> > + go_faster(cpu, 0);
> > +}
> > +
> > +static inline void dec_cpu_capacity(int cpu)
> > +{
> > + if (weighted_cpuload(cpu) < power_of(cpu))
> > + go_slower(cpu, 0);
> > +}
>
> It seems wrong to me to use weighted_cpuload() here; that contains the
> task weight, which is irrelevant to power usage. I would expect a pure
> utilization term here.
>
> Something like:
>
> se->avg.runnable_avg_sum / se->avg.runnable_avg_period
>
Fully agree. There is no unweighted equivalent to cfs.runnable_load_avg
but we could add it. It would be very useful for the power-aware
scheduling. It will add some overhead though.
Morten
On Mon, Oct 14, 2013 at 02:33:56PM +0100, Peter Zijlstra wrote:
> On Fri, Oct 11, 2013 at 06:19:14PM +0100, Morten Rasmussen wrote:
> > Removing power hints for kworker threads enables easier use of
> > workqueues in the power driver late callback. That would otherwise
> > lead to an endless loop unless it is prevented in the power driver.
>
> There's many kworker users; some of them actually consume lots of
> cputime. Therefore how did you come to the conclusion that excepting all
> users was the better choice of a little added complexity in the one
> place where it actually matters?
Agreed that it is not ideal. I will find a better solution for this
problem.
Thanks,
Morten
On 10/14/2013 9:10 AM, Morten Rasmussen wrote:
> On Mon, Oct 14, 2013 at 02:33:56PM +0100, Peter Zijlstra wrote:
>> On Fri, Oct 11, 2013 at 06:19:14PM +0100, Morten Rasmussen wrote:
>>> Removing power hints for kworker threads enables easier use of
>>> workqueues in the power driver late callback. That would otherwise
>>> lead to an endless loop unless it is prevented in the power driver.
>>
>> There's many kworker users; some of them actually consume lots of
>> cputime. Therefore how did you come to the conclusion that excepting all
>> users was the better choice of a little added complexity in the one
>> place where it actually matters?
>
> Agreed that it is not ideal. I will find a better solution for this
> problem.
>
can you explain which architectures we're talking about that need this?
is it only old stuff, or is there anything current ?
On Mon, Oct 14, 2013 at 02:32:34PM +0100, Peter Zijlstra wrote:
> On Fri, Oct 11, 2013 at 06:19:10PM +0100, Morten Rasmussen wrote:
> > Hi,
> >
> > I have revised the previous power scheduler proposal[1] trying to address as
> > many of the comments as possible. The overall idea was discussed at LPC[2,3].
> > The revised design has removed the power scheduler and replaced it with a high
> > level power driver interface. An interface that allows the scheduler to query
> > the power driver for information and provide hints to guide power management
> > decisions in the power driver.
> >
> > The power driver is going to be a unified platform power driver that can
> > replace cpufreq and cpuidle drivers. Generic power policies will be optional
> > helper functions called from the power driver. Platforms may choose to
> > implement their own policies as part of their power driver.
> >
> > This RFC series prototypes a part of the power driver interface (cpu capacity
> > hints) and shows how they can be used from the scheduler. More extensive use of
> > the power driver hints and queries is left for later. The focus for now is the
> > power driver interface. The patch series includes a power driver/cpufreq
> > governor that can use existing cpufreq drivers as backend. It has been tested
> > (not thoroughly) on ARM TC2. The cpufreq governor power driver implementation
> > is rather horrible, but it illustrates how the power driver interface can be
> > used. Native power drivers is on the todo list.
> >
> > The power driver interface is still missing quite a few calls to handle: Idle,
> > adding extra information to the sched_domain hierarchy to guide scheduling
> > decisions (packing), and possibly scaling of tracked load to compensate for
> > frequency changes and asymmetric systems (big.LITTLE).
> >
> > This set is based on 3.11. I have done ARM TC2 testing based on linux-linaro
> > 2013.08[4] to get cpufreq support for TC2.
>
> What I'm missing is a general overview of why what and how.
>
> In particular; how does this proposal lead to power savings. Is there a
> mathematical model that supports this framework? Something where if you
> give it a task set with global utilisation < 1 (ie. there's idle time),
> it results in less power used.
It is not there yet. What I'm proposing here is just the interface with
some very simple examples of how they can be used. It is not the
complete set of hooks, but a starting point. In the first round of
discussions it was clear that it is quite important to find an interface
that can work for everyone. To find a good power optimization model we
first need to know what information the platform (represented by the
power driver) can provide.
With guidance from the power driver about what level performance we can
expect from a cpu, and possibly also the power cost, we can make better
load-balancing decisions. We can add task packing support and let the
platform decide the degree of packing that meets the power/performance
target.
>
> Also, how does this proposal deal with cpufreq's fundamental broken
> approach to SMP? Afaict nothing considers the effect of one cpu upon
> another -- something which isn't true at all.
If you are referring to not doing anything clever with the affected cpus
in the power driver then yes. I doesn't do anything clever at the
moment. However, the go_faster/slower() interface would allow the power
driver to refuse increasing the frequency if the power cost can't be
justified for some reason. Using the power driver interface the
scheduler will know this and be able to try a different balance. Spread
load to other cpus in the same frequency domain that are less loaded, if
possible.
>
> In fact, I don't see anything except a random bunch of hooks without an
> over-all picture of how to get less power used.
I will follow up with a better description of the overall picture. The
slides I linked to are not really self-explaining.
Thanks,
Morten
On Mon, Oct 14, 2013 at 05:13:11PM +0100, Arjan van de Ven wrote:
> On 10/14/2013 9:10 AM, Morten Rasmussen wrote:
> > On Mon, Oct 14, 2013 at 02:33:56PM +0100, Peter Zijlstra wrote:
> >> On Fri, Oct 11, 2013 at 06:19:14PM +0100, Morten Rasmussen wrote:
> >>> Removing power hints for kworker threads enables easier use of
> >>> workqueues in the power driver late callback. That would otherwise
> >>> lead to an endless loop unless it is prevented in the power driver.
> >>
> >> There's many kworker users; some of them actually consume lots of
> >> cputime. Therefore how did you come to the conclusion that excepting all
> >> users was the better choice of a little added complexity in the one
> >> place where it actually matters?
> >
> > Agreed that it is not ideal. I will find a better solution for this
> > problem.
> >
>
> can you explain which architectures we're talking about that need this?
> is it only old stuff, or is there anything current ?
>
For now I need it for the cpufreq wrapper governor. I hope it can be
avoided for native power drivers, but I don't have a complete overview
of all architectures yet.
On Mon, Oct 14, 2013 at 06:15:41PM +0100, Morten Rasmussen wrote:
> > In fact, I don't see anything except a random bunch of hooks without an
> > over-all picture of how to get less power used.
>
> I will follow up with a better description of the overall picture. The
> slides I linked to are not really self-explaining.
I hadn't even noticed there were slides linked. In general I tend to
ignore external links -- patches should be descriptive enough to stand
on their own.
Hi,
On 10/14/2013 07:02 PM, Peter Zijlstra wrote:
> On Fri, Oct 11, 2013 at 06:19:10PM +0100, Morten Rasmussen wrote:
>> Hi,
>>
>> I have revised the previous power scheduler proposal[1] trying to address as
>> many of the comments as possible. The overall idea was discussed at LPC[2,3].
>> The revised design has removed the power scheduler and replaced it with a high
>> level power driver interface. An interface that allows the scheduler to query
>> the power driver for information and provide hints to guide power management
>> decisions in the power driver.
>>
>> The power driver is going to be a unified platform power driver that can
>> replace cpufreq and cpuidle drivers. Generic power policies will be optional
>> helper functions called from the power driver. Platforms may choose to
>> implement their own policies as part of their power driver.
>>
>> This RFC series prototypes a part of the power driver interface (cpu capacity
>> hints) and shows how they can be used from the scheduler. More extensive use of
>> the power driver hints and queries is left for later. The focus for now is the
>> power driver interface. The patch series includes a power driver/cpufreq
>> governor that can use existing cpufreq drivers as backend. It has been tested
>> (not thoroughly) on ARM TC2. The cpufreq governor power driver implementation
>> is rather horrible, but it illustrates how the power driver interface can be
>> used. Native power drivers is on the todo list.
>>
>> The power driver interface is still missing quite a few calls to handle: Idle,
>> adding extra information to the sched_domain hierarchy to guide scheduling
>> decisions (packing), and possibly scaling of tracked load to compensate for
>> frequency changes and asymmetric systems (big.LITTLE).
>>
>> This set is based on 3.11. I have done ARM TC2 testing based on linux-linaro
>> 2013.08[4] to get cpufreq support for TC2.
>
> What I'm missing is a general overview of why what and how.
I agree that the "why" needs to be mentioned very clearly since the
patchset revolves around it. As far as I understand we need a single
controller for deciding the power efficiency of the kernel, who is
exposed to all the user policies and the frequency+idle states stats of
the CPU to begin with. These stats are being supplied by the power driver.
Having these details and decision making in multiple places like we do
today in cpuidle, cpu-frequency and scheduler will probably cause
problems. For example, when the power efficiency of the kernel goes
wrong we have trouble point out the reason behind it. Where did the
problem arise from among the above three power policy decision makers?
This is a maintainability concern.
Another reason is the power saving decisions made by say cpuidle may
not complement the power saving decisions made by cpufreq. This can lead
to inconsistent results across different workloads.
Thus having a single policy maker for power savings we are hoping to
solve the primary concerns of consistent behaviour from the kernel in
terms of power efficiency and improved maintainability.
>
> In particular; how does this proposal lead to power savings. Is there a
> mathematical model that supports this framework? Something where if you
> give it a task set with global utilisation < 1 (ie. there's idle time),
> it results in less power used.
AFAIK, this patchset is an attempt to achieve consistency in the power
efficiency of the kernel across workloads with the existing algorithms,
in addition to a cleanup involving integration of the power policy
making in one place as explained above. In an attempt to do so, *maybe*
better power numbers can be obtained or at-least the default power
efficiency of the kernel will show up.
However adding the new patchsets like packing small tasks, heterogeneous
scheduling, power aware scheduling etc.. *should* then yield good and
consistent power savings since they now stand on top of an integrated
stable power driver.
Regards
Preeti U Murthy
>
> Also, how does this proposal deal with cpufreq's fundamental broken
> approach to SMP? Afaict nothing considers the effect of one cpu upon
> another -- something which isn't true at all.
>
> In fact, I don't see anything except a random bunch of hooks without an
> over-all picture of how to get less power used.
>
On Mon, Oct 14, 2013 at 06:31:13PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 14, 2013 at 06:15:41PM +0100, Morten Rasmussen wrote:
> > > In fact, I don't see anything except a random bunch of hooks without an
> > > over-all picture of how to get less power used.
> >
> > I will follow up with a better description of the overall picture. The
> > slides I linked to are not really self-explaining.
>
> I hadn't even noticed there were slides linked. In general I tend to
> ignore external links -- patches should be descriptive enough to stand
> on their own.
Elaborating a bit more on the big picture and where we can go with this
proposal, here are the main requirements:
1. A unified scheduler driven power policy, i.e. the scheduler drives
DVFS/idle (as suggested by Ingo and hence this first set of patches).
2. Small task packing. Avoid spreading tasks under light workloads.
In addition for big.LITTLE we need:
3. Task placement based cpu suitability. Associate task load ranges with
each cpu to give task placement. Heavy tasks on big, small tasks on
little.
This patch set addresses part of 1, while 3 will follow soon. Point 2 is
worked on by Vincent in collaboration with us.
The power driver introduced in this set has a role in the solution to
all three points. It serves as a unified platform power driver and the
interface allows the scheduler to get highlevel feedback which cpufreq
and cpuidle do not currently provide.
Decisions about the power/performance trade-off will be made in the
power driver guided by the hints from the scheduler. That allow
platforms the freedom to do what they want with the hints including
ignoring them completely (taking Arjan's previous comments into
account). It will make the power driver much powerful than the current
cpufreq/cpuidle drivers.
Morten
On Mon, Oct 14, 2013 at 04:14:25PM +0100, Arjan van de Ven wrote:
> On 10/14/2013 6:33 AM, Peter Zijlstra wrote:
> > On Fri, Oct 11, 2013 at 06:19:14PM +0100, Morten Rasmussen wrote:
> >> Removing power hints for kworker threads enables easier use of
> >> workqueues in the power driver late callback. That would otherwise
> >> lead to an endless loop unless it is prevented in the power driver.
> >
> > There's many kworker users; some of them actually consume lots of
> > cputime. Therefore how did you come to the conclusion that excepting all
> > users was the better choice of a little added complexity in the one
> > place where it actually matters?
>
> .. and likely only for a very few architectures
>
> x86, and I suspect modern ARM, can change frequency synchronously.
> (using an instruction or maybe two or three for ARM)
It should be possible to implement synchronous frequency changes on most
modern ARM platforms. It is a bit more than a few instructions to change
frequency though particularly for the current cpufreq drivers.
cpufreq drivers, like the one for ARM TC2, uses the clock framework to
manage clocks. clk_set_rate() is allowed to sleep which won't work if we
call it from scheduler context. The clock framework will need a look if
it doesn't provide a very fast synchronous alternative to clk_set_rate()
to change frequency and we want to use it for scheduler driven frequency
scaling.
cpufreq has pre- and post-change notifiers so the current TC2 clock driver
waits (yields) in its clk_set_rate() implementation until the change has
happened to ensure that the post-change notifier happens at the right
time. Since clk_set_rate() is allowed to sleep other tasks may be
running while waiting for the change to complete. This may be true for
other clock drivers as well.
AFAICT, there is no way to reuse the existing cpufreq drivers in a
sensible way for scheduler driven frequency scaling. It should be
possible to have very fast frequency changes on ARM but it is not the
way it is currently done.
On Thu, Oct 17, 2013 at 05:40:38PM +0100, Morten Rasmussen wrote:
> On Mon, Oct 14, 2013 at 04:14:25PM +0100, Arjan van de Ven wrote:
> > On 10/14/2013 6:33 AM, Peter Zijlstra wrote:
> > > On Fri, Oct 11, 2013 at 06:19:14PM +0100, Morten Rasmussen wrote:
> > >> Removing power hints for kworker threads enables easier use of
> > >> workqueues in the power driver late callback. That would otherwise
> > >> lead to an endless loop unless it is prevented in the power driver.
> > >
> > > There's many kworker users; some of them actually consume lots of
> > > cputime. Therefore how did you come to the conclusion that excepting all
> > > users was the better choice of a little added complexity in the one
> > > place where it actually matters?
> >
> > .. and likely only for a very few architectures
> >
> > x86, and I suspect modern ARM, can change frequency synchronously.
> > (using an instruction or maybe two or three for ARM)
>
> It should be possible to implement synchronous frequency changes on most
> modern ARM platforms. It is a bit more than a few instructions to change
> frequency though particularly for the current cpufreq drivers.
>
> cpufreq drivers, like the one for ARM TC2, uses the clock framework to
> manage clocks. clk_set_rate() is allowed to sleep which won't work if we
> call it from scheduler context. The clock framework will need a look if
> it doesn't provide a very fast synchronous alternative to clk_set_rate()
> to change frequency and we want to use it for scheduler driven frequency
> scaling.
>
> cpufreq has pre- and post-change notifiers so the current TC2 clock driver
> waits (yields) in its clk_set_rate() implementation until the change has
> happened to ensure that the post-change notifier happens at the right
> time. Since clk_set_rate() is allowed to sleep other tasks may be
> running while waiting for the change to complete. This may be true for
> other clock drivers as well.
>
> AFAICT, there is no way to reuse the existing cpufreq drivers in a
> sensible way for scheduler driven frequency scaling. It should be
> possible to have very fast frequency changes on ARM but it is not the
> way it is currently done.
Note that you still have preemption disabled in your late callback from
finish_task_switch(). There's no way you can wait/yield/whatever from
there. Nor is that really sane.
Just say no to the existing cruft ?
>>
>> cpufreq has pre- and post-change notifiers so the current TC2 clock driver
yeah those are EVIL ;-)
>> waits (yields) in its clk_set_rate() implementation until the change has
>> happened to ensure that the post-change notifier happens at the right
>> time. Since clk_set_rate() is allowed to sleep other tasks may be
>> running while waiting for the change to complete. This may be true for
>> other clock drivers as well.
>>
>> AFAICT, there is no way to reuse the existing cpufreq drivers in a
>> sensible way for scheduler driven frequency scaling.
that's the conclusion we came to as well about a year ago (and is also
why we're no longer using cpufreq core for the Intel pstate driver.
the locking/sleeping/callback/cpuhotplug/sysfs/etc stuff is just a MESS
for something that ends up being extremely simple if you just code the
sequence... for us it's just one register write to change... which shows this as an
extreme obviously)
>
> Note that you still have preemption disabled in your late callback from
> finish_task_switch(). There's no way you can wait/yield/whatever from
> there. Nor is that really sane.
the other fun one with this could be that if you have a series of scheduleable tasks
for changing stuff.... somehow you want to keep ordering in the requests, and only do
the last one/etc.
Not Fun(tm)
On Thu, Oct 17, 2013 at 05:54:16PM +0100, Peter Zijlstra wrote:
> On Thu, Oct 17, 2013 at 05:40:38PM +0100, Morten Rasmussen wrote:
> > On Mon, Oct 14, 2013 at 04:14:25PM +0100, Arjan van de Ven wrote:
> > > On 10/14/2013 6:33 AM, Peter Zijlstra wrote:
> > > > On Fri, Oct 11, 2013 at 06:19:14PM +0100, Morten Rasmussen wrote:
> > > >> Removing power hints for kworker threads enables easier use of
> > > >> workqueues in the power driver late callback. That would otherwise
> > > >> lead to an endless loop unless it is prevented in the power driver.
> > > >
> > > > There's many kworker users; some of them actually consume lots of
> > > > cputime. Therefore how did you come to the conclusion that excepting all
> > > > users was the better choice of a little added complexity in the one
> > > > place where it actually matters?
> > >
> > > .. and likely only for a very few architectures
> > >
> > > x86, and I suspect modern ARM, can change frequency synchronously.
> > > (using an instruction or maybe two or three for ARM)
> >
> > It should be possible to implement synchronous frequency changes on most
> > modern ARM platforms. It is a bit more than a few instructions to change
> > frequency though particularly for the current cpufreq drivers.
> >
> > cpufreq drivers, like the one for ARM TC2, uses the clock framework to
> > manage clocks. clk_set_rate() is allowed to sleep which won't work if we
> > call it from scheduler context. The clock framework will need a look if
> > it doesn't provide a very fast synchronous alternative to clk_set_rate()
> > to change frequency and we want to use it for scheduler driven frequency
> > scaling.
> >
> > cpufreq has pre- and post-change notifiers so the current TC2 clock driver
> > waits (yields) in its clk_set_rate() implementation until the change has
> > happened to ensure that the post-change notifier happens at the right
> > time. Since clk_set_rate() is allowed to sleep other tasks may be
> > running while waiting for the change to complete. This may be true for
> > other clock drivers as well.
> >
> > AFAICT, there is no way to reuse the existing cpufreq drivers in a
> > sensible way for scheduler driven frequency scaling. It should be
> > possible to have very fast frequency changes on ARM but it is not the
> > way it is currently done.
>
>
> Note that you still have preemption disabled in your late callback from
> finish_task_switch(). There's no way you can wait/yield/whatever from
> there. Nor is that really sane.
No, that is what I have realized after messing around trying to call
into cpufreq. It just won't work. A non-waiting/yielding/whatever driver
is needed. There is no point in having the late callback it won't solve
anything.
>
> Just say no to the existing cruft ?
That is the only way ahead I think. intel_pstate.c does it. I will into
what it takes to do something similar on ARM TC2.
On Thu, Oct 17, 2013 at 06:18:38PM +0100, Arjan van de Ven wrote:
> >>
> >> cpufreq has pre- and post-change notifiers so the current TC2 clock driver
>
> yeah those are EVIL ;-)
>
> >> waits (yields) in its clk_set_rate() implementation until the change has
> >> happened to ensure that the post-change notifier happens at the right
> >> time. Since clk_set_rate() is allowed to sleep other tasks may be
> >> running while waiting for the change to complete. This may be true for
> >> other clock drivers as well.
> >>
> >> AFAICT, there is no way to reuse the existing cpufreq drivers in a
> >> sensible way for scheduler driven frequency scaling.
>
> that's the conclusion we came to as well about a year ago (and is also
> why we're no longer using cpufreq core for the Intel pstate driver.
> the locking/sleeping/callback/cpuhotplug/sysfs/etc stuff is just a MESS
> for something that ends up being extremely simple if you just code the
> sequence... for us it's just one register write to change... which shows this as an
> extreme obviously)
We should be able to boil it down to a sequence on ARM as well. But it
means dropping cpufreq and looking at the clock framework.
Are you still using the pre- and post-change notifiers on Intel, or can
they be ignored safely?
>
>
> >
> > Note that you still have preemption disabled in your late callback from
> > finish_task_switch(). There's no way you can wait/yield/whatever from
> > there. Nor is that really sane.
>
> the other fun one with this could be that if you have a series of scheduleable tasks
> for changing stuff.... somehow you want to keep ordering in the requests, and only do
> the last one/etc.
> Not Fun(tm)
Agreed, I don't want to go there. Also, the overhead will probably kill any
benefit that there might be.
>
> We should be able to boil it down to a sequence on ARM as well. But it
> means dropping cpufreq and looking at the clock framework.
>
> Are you still using the pre- and post-change notifiers on Intel, or can
> they be ignored safely?
we do not use change notifiers
(since the hardware changes frequency independently of what "p state" we asked for
frequently and in many very common conditions, even if anything depended on being notified on changes...
it couldn't possibly work anyway)