2013-07-09 15:55:38

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC][PATCH 0/9] sched: Power scheduler design proposal

Hi,

This patch set is an initial prototype aiming at the overall power-aware
scheduler design proposal that I previously described
<http://permalink.gmane.org/gmane.linux.kernel/1508480>.

The patch set introduces a cpu capacity managing 'power scheduler' which lives
by the side of the existing (process) scheduler. Its role is to monitor the
system load and decide which cpus that should be available to the process
scheduler. Long term the power scheduler is intended to replace the currently
distributed uncoordinated power management policies and will interface a
unified platform specific power driver obtain power topology information and
handle idle and P-states. The power driver interface should be made flexible
enough to support multiple platforms including Intel and ARM.

This prototype supports very simple task packing and adds cpufreq wrapper
governor that allows the power scheduler to drive P-state selection. The
prototype policy is absolutely untuned, but this will be addressed in the
future. Scalability improvements, such as avoid iterating over all cpus, will
also be addressed in the future.

Thanks,
Morten

Morten Rasmussen (9):
sched: Introduce power scheduler
sched: Redirect update_cpu_power to sched/power.c
sched: Make select_idle_sibling() skip cpu with a cpu_power of 1
sched: Make periodic load-balance disregard cpus with a cpu_power of
1
sched: Make idle_balance() skip cpus with a cpu_power of 1
sched: power: add power_domain data structure
sched: power: Add power driver interface
sched: power: Add initial frequency scaling support to power
scheduler
sched: power: cpufreq: Initial schedpower cpufreq governor

arch/arm/Kconfig | 2 +
drivers/cpufreq/Kconfig | 8 +
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/cpufreq_schedpower.c | 119 +++++++++++++
include/linux/sched/power.h | 29 ++++
kernel/Kconfig.power | 3 +
kernel/sched/Makefile | 1 +
kernel/sched/fair.c | 43 +++--
kernel/sched/power.c | 307 ++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 24 +++
10 files changed, 525 insertions(+), 12 deletions(-)
create mode 100644 drivers/cpufreq/cpufreq_schedpower.c
create mode 100644 include/linux/sched/power.h
create mode 100644 kernel/Kconfig.power
create mode 100644 kernel/sched/power.c

--
1.7.9.5


2013-07-09 15:55:47

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC][PATCH 8/9] sched: power: Add initial frequency scaling support to power scheduler

Extends the power scheduler capacity management algorithm to handle
frequency scaling and provide basic frequency/P-state selection hints
to the power driver.

Signed-off-by: Morten Rasmussen <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Catalin Marinas <[email protected]>
---
kernel/sched/power.c | 33 ++++++++++++++++++++++++++++-----
1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/power.c b/kernel/sched/power.c
index 9e44c0e..5fc32b0 100644
--- a/kernel/sched/power.c
+++ b/kernel/sched/power.c
@@ -21,6 +21,8 @@

#define INTERVAL 5 /* ms */
#define CPU_FULL 90 /* Busy %-age - TODO: Make tunable */
+#define CPU_TARGET 80 /* Target busy %-age - TODO: Make tunable */
+#define CPU_EMPTY 5 /* Idle noise %-age - TODO: Make tunable */

struct power_domain {
/* Domain hierarchy pointers */
@@ -87,7 +89,7 @@ static void update_cpu_load(void)
u32 sum = rq->avg.runnable_avg_sum;
u32 period = rq->avg.runnable_avg_period;

- load = (sum * power_sched_cpu_power(i)) / (period+1);
+ load = (sum * cpu_pd(i)->curr_power) / (period+1);
cpu_pd(i)->load = load;
cpu_pd(i)->nr_tasks = rq->nr_running;

@@ -160,19 +162,40 @@ static void calculate_cpu_capacities(void)

for_each_online_cpu(i) {
int t_cap = 0;
- int sched_power = cpu_pd(i)->sched_power;
+ int curr_power = cpu_pd(i)->curr_power;

stats = cpu_pd(i);
- t_cap = sched_power - stats->load;
+ t_cap = curr_power - stats->load;

- if (t_cap < (sched_power * (100-CPU_FULL)) / 100) {
+ if (t_cap < (curr_power * (100-CPU_FULL)) / 100) {
/* Potential for spreading load */
if (stats->nr_tasks > 1)
t_cap = -(stats->load / stats->nr_tasks);
+ /*
+ * Single task and higher p-state available on
+ * current cpu
+ */
+ else if (power_driver &&
+ curr_power < cpu_pd(i)->arch_power)
+ power_driver->req_power(i,
+ cpu_pd(i)->arch_power);
+ } else {
+ /* cpu not full - request lower p-state */
+ /*
+ * TODO global view of spare capacity is needed to do
+ * proper p-state selection
+ */
+ if (power_driver)
+ power_driver->req_power(i,
+ (stats->load*100)/CPU_TARGET);
+
+ /* Don't let noise keep the cpu awake */
+ if (t_cap > (curr_power * CPU_EMPTY) / 100)
+ t_cap = curr_power;
}

/* Do we have enough capacity already? */
- if (spare_cap + t_cap > sched_power) {
+ if (spare_cap + t_cap > curr_power) {
cpu_pd(i)->sched_power = 1;
} else {
cpu_pd(i)->sched_power = cpu_pd(i)->arch_power;
--
1.7.9.5

2013-07-09 15:55:43

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC][PATCH 1/9] sched: Introduce power scheduler

Proof of concept capacity managing power scheduler. Supports simple
packing without any consideration of power topology. The power scheduler
is meant to use a platform specific power driver to obtain information
about power topology and select idle states and frequency/P-states.

For now, the power scheduler is called periodically on cpu0. This will be
replaced by calls from the scheduler in the future. Thresholds and other
defined constants will be configurable, possibly set by the power driver,
in the future. Iterations over all cpus will be also be optimized to
ensure scalability.

Signed-off-by: Morten Rasmussen <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Catalin Marinas <[email protected]>
---
arch/arm/Kconfig | 2 +
kernel/Kconfig.power | 3 +
kernel/sched/Makefile | 1 +
kernel/sched/power.c | 161 +++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 167 insertions(+)
create mode 100644 kernel/Kconfig.power
create mode 100644 kernel/sched/power.c

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 2651b1d..04076ab 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1805,6 +1805,8 @@ config XEN
help
Say Y if you want to run Linux in a Virtual Machine on Xen on ARM.

+source "kernel/Kconfig.power"
+
endmenu

menu "Boot options"
diff --git a/kernel/Kconfig.power b/kernel/Kconfig.power
new file mode 100644
index 0000000..4fdaa13
--- /dev/null
+++ b/kernel/Kconfig.power
@@ -0,0 +1,3 @@
+config SCHED_POWER
+ bool "(EXPERIMENTAL) Power scheduler"
+ default n
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index deaf90e..67b01b2 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -17,3 +17,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_SCHED_POWER) += power.o
diff --git a/kernel/sched/power.c b/kernel/sched/power.c
new file mode 100644
index 0000000..ddf249f
--- /dev/null
+++ b/kernel/sched/power.c
@@ -0,0 +1,161 @@
+/*
+ * kernel/sched/power.c
+ *
+ * Copyright (C) 2013 ARM Limited.
+ * Author: Morten Rasmussen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/percpu.h>
+#include <linux/workqueue.h>
+#include <linux/sched.h>
+
+#include "sched.h"
+
+#define INTERVAL 5 /* ms */
+#define CPU_FULL 90 /* Busy %-age - TODO: Make tunable */
+
+struct cpu_stats_struct {
+ int load;
+ int nr_tasks;
+};
+
+static unsigned long power_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_power;
+}
+
+DEFINE_PER_CPU(struct cpu_stats_struct, cpu_stats);
+
+/*
+ * update_cpu_load fetches runqueue statistics from the scheduler should
+ * only be called with approitate locks held.
+ */
+static void update_cpu_load(void)
+{
+ int i;
+
+ for_each_online_cpu(i) {
+ struct rq *rq = cpu_rq(i);
+ int load = 0;
+ u32 sum = rq->avg.runnable_avg_sum;
+ u32 period = rq->avg.runnable_avg_period;
+
+ load = (sum * power_of(i)) / (period+1);
+ per_cpu(cpu_stats, i).load = load;
+ per_cpu(cpu_stats, i).nr_tasks = rq->nr_running;
+
+ /* Take power scheduler kthread into account */
+ if (smp_processor_id() == i)
+ per_cpu(cpu_stats, i).nr_tasks--;
+ }
+}
+
+extern unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu);
+DEFINE_PER_CPU(unsigned long, arch_cpu_power);
+
+static void get_arch_cpu_power(void)
+{
+ int i;
+
+ if (sched_feat(ARCH_POWER)) {
+ for_each_online_cpu(i)
+ per_cpu(arch_cpu_power, i) =
+ arch_scale_freq_power(cpu_rq(i)->sd, i);
+ } else {
+ for_each_online_cpu(i)
+ per_cpu(arch_cpu_power, i) = SCHED_POWER_SCALE;
+ }
+}
+
+DEFINE_PER_CPU(unsigned long, cpu_power);
+
+/*
+ * power_sched_cpu_power is called from fair.c to get the power scheduler
+ * cpu capacities. We can't use arch_scale_freq_power() as this may already
+ * be defined by the platform.
+ */
+unsigned long power_sched_cpu_power(struct sched_domain *sd, int cpu)
+{
+ return per_cpu(cpu_power, cpu);
+}
+
+/*
+ * calculate_cpu_capacities figures out how many cpus that are necessary
+ * to handle the current load. The current algorithm is very simple and
+ * does not take power topology into account and it does not scale the cpu
+ * capacity. It is either on or off. Plenty of potential for improvements!
+ */
+static void calculate_cpu_capacities(void)
+{
+ int i, spare_cap = 0;
+ struct cpu_stats_struct *stats;
+
+ /*
+ * spare_cap keeps track of the total available capacity across
+ * all cpus
+ */
+
+ for_each_online_cpu(i) {
+ int t_cap = 0;
+ int arch_power = per_cpu(arch_cpu_power, i);
+
+ stats = &per_cpu(cpu_stats, i);
+ t_cap = arch_power - stats->load;
+
+ if (t_cap < (arch_power * (100-CPU_FULL)) / 100) {
+ /* Potential for spreading load */
+ if (stats->nr_tasks > 1)
+ t_cap = -(stats->load / stats->nr_tasks);
+ }
+
+ /* Do we have enough capacity already? */
+ if (spare_cap + t_cap > arch_power) {
+ per_cpu(cpu_power, i) = 1;
+ } else {
+ per_cpu(cpu_power, i) = arch_power;
+ spare_cap += t_cap;
+ }
+ }
+}
+
+static void __power_schedule(void)
+{
+ rcu_read_lock();
+
+ get_arch_cpu_power();
+ update_cpu_load();
+ calculate_cpu_capacities();
+
+ rcu_read_unlock();
+}
+
+struct delayed_work dwork;
+
+/* Periodic power schedule target cpu */
+static int schedule_cpu(void)
+{
+ return 0;
+}
+
+void power_schedule_wq(struct work_struct *work)
+{
+ __power_schedule();
+ mod_delayed_work_on(schedule_cpu(), system_wq, &dwork,
+ msecs_to_jiffies(INTERVAL));
+}
+
+static int __init sched_power_init(void)
+{
+ INIT_DELAYED_WORK(&dwork, power_schedule_wq);
+ mod_delayed_work_on(schedule_cpu(), system_wq, &dwork,
+ msecs_to_jiffies(INTERVAL));
+ return 0;
+}
+late_initcall(sched_power_init);
--
1.7.9.5

2013-07-09 15:56:10

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC][PATCH 9/9] sched: power: cpufreq: Initial schedpower cpufreq governor

Adds a 'schedpower' cpufreq governor that acts as a cpufreq driver
wrapper for the power scheduler. This enables the power scheduler
to initially use existing cpufreq drivers during development.
The long term plan is platform specific unified power drivers.

Signed-off-by: Morten Rasmussen <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Catalin Marinas <[email protected]>
---
drivers/cpufreq/Kconfig | 8 +++
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/cpufreq_schedpower.c | 119 ++++++++++++++++++++++++++++++++++
3 files changed, 128 insertions(+)
create mode 100644 drivers/cpufreq/cpufreq_schedpower.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 534fcb8..f0d168d 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -184,6 +184,14 @@ config CPU_FREQ_GOV_CONSERVATIVE

If in doubt, say N.

+config CPU_FREQ_GOV_SCHEDPOWER
+ bool "'schedpower' governor/power driver"
+ depends on CPU_FREQ
+ depends on SCHED_POWER
+ help
+ 'schedpower' - this governor acts as a wrapper power driver for the
+ power scheduler.
+
config GENERIC_CPUFREQ_CPU0
tristate "Generic CPU0 cpufreq driver"
depends on HAVE_CLK && REGULATOR && PM_OPP && OF
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 315b923..6ff50ad 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE) += cpufreq_powersave.o
obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) += cpufreq_userspace.o
obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o
obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDPOWER) += cpufreq_schedpower.o
obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o

# CPUfreq cross-arch helpers
diff --git a/drivers/cpufreq/cpufreq_schedpower.c b/drivers/cpufreq/cpufreq_schedpower.c
new file mode 100644
index 0000000..1b20adb
--- /dev/null
+++ b/drivers/cpufreq/cpufreq_schedpower.c
@@ -0,0 +1,119 @@
+/*
+ * Power driver to cpufreq wrapper for power scheduler
+ *
+ * drivers/cpufreq/cpufreq_schedpower.c
+ *
+ * Copyright (C) 2013 ARM Limited.
+ * Author: Morten Rasmussen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/cpufreq.h>
+#include <linux/init.h>
+#include <linux/percpu.h>
+#include <linux/sched/power.h>
+
+static struct cpufreq_policy *cur_policy;
+
+DEFINE_PER_CPU(unsigned int, freq_req);
+
+static int cpufreq_governor_schedpower(struct cpufreq_policy *policy,
+ unsigned int event)
+{
+ int i;
+
+ switch (event) {
+ case CPUFREQ_GOV_START:
+ case CPUFREQ_GOV_LIMITS:
+ pr_debug("setting to %u kHz because of event %u\n",
+ policy->max, event);
+ __cpufreq_driver_target(policy, policy->max,
+ CPUFREQ_RELATION_H);
+
+ for_each_cpu(i, policy->cpus)
+ per_cpu(freq_req, i) = policy->max;
+
+ break;
+ default:
+ break;
+ }
+
+ cur_policy = policy;
+
+ return 0;
+}
+
+static
+struct cpufreq_governor cpufreq_gov_schedpower = {
+ .name = "schedpower",
+ .governor = cpufreq_governor_schedpower,
+ .owner = THIS_MODULE,
+};
+
+static struct sched_power_driver pdriver;
+
+static int __init cpufreq_gov_schedpower_init(void)
+{
+ int freq_reg;
+
+ cur_policy = NULL;
+
+ freq_reg = cpufreq_register_governor(&cpufreq_gov_schedpower);
+ if (freq_reg)
+ return freq_reg;
+ return sched_power_register_driver(&pdriver);
+}
+late_initcall(cpufreq_gov_schedpower_init);
+
+unsigned long pdriver_get_power(int cpu)
+{
+ if (!cur_policy)
+ return 1024;
+ return (cur_policy->cur * 1024)/cur_policy->max;
+}
+
+unsigned long pdriver_get_power_cap(int cpu)
+{
+ return 1024;
+}
+
+static unsigned long max_freq_req(void)
+{
+ int i;
+ int max = 0;
+
+ for_each_cpu(i, cur_policy->cpus) {
+ if (per_cpu(freq_req, i) > max)
+ max = per_cpu(freq_req, i);
+ }
+
+ if (max < cur_policy->min)
+ return cur_policy->min;
+ return max;
+}
+
+unsigned long pdriver_req_power(int cpu, unsigned long cpu_power)
+{
+ unsigned int target;
+ if (!cur_policy)
+ return 1024;
+ target = (cur_policy->max * cpu_power)/1024;
+
+ per_cpu(freq_req, cpu) = target;
+
+ cpufreq_driver_target(cur_policy, max_freq_req(), CPUFREQ_RELATION_H);
+
+ return (cur_policy->cur * 1024)/cur_policy->max;
+}
+
+static struct sched_power_driver pdriver = {
+ .get_power = pdriver_get_power,
+ .get_power_cap = pdriver_get_power_cap,
+ .req_power = pdriver_req_power,
+};
+
--
1.7.9.5

2013-07-09 15:55:41

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC][PATCH 3/9] sched: Make select_idle_sibling() skip cpu with a cpu_power of 1

select_idle_sibling() must disregard cpus with cpu_power=1 to avoid using
cpus disabled by the power scheduler.

This is a quick fix. The algorithm should be updated to handle cpu_power=1
properly.

Signed-off-by: Morten Rasmussen <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Catalin Marinas <[email protected]>
---
kernel/sched/fair.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f1f26..f637ea5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3287,13 +3287,14 @@ static int select_idle_sibling(struct task_struct *p, int target)
struct sched_group *sg;
int i = task_cpu(p);

- if (idle_cpu(target))
+ if (idle_cpu(target) && power_cpu_balance(target))
return target;

/*
* If the prevous cpu is cache affine and idle, don't be stupid.
*/
- if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
+ if (i != target && cpus_share_cache(i, target) && idle_cpu(i) &&
+ power_cpu_balance(i))
return i;

/*
@@ -3308,7 +3309,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
goto next;

for_each_cpu(i, sched_group_cpus(sg)) {
- if (i == target || !idle_cpu(i))
+ if (i == target || !idle_cpu(i) ||
+ !power_cpu_balance(i))
goto next;
}

--
1.7.9.5

2013-07-09 15:56:36

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC][PATCH 7/9] sched: power: Add power driver interface

Initial power driver interface. The unified power driver is to
replace freq and idle drivers and more. Currently only frequency
scaling interface is considered. The power scheduler may query the
power driver for information about current the state and provide
hints for state changes.

Signed-off-by: Morten Rasmussen <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Catalin Marinas <[email protected]>
---
include/linux/sched/power.h | 29 +++++++++++++++++++++++++++++
kernel/sched/power.c | 42 +++++++++++++++++++++++++++++++++++++++---
2 files changed, 68 insertions(+), 3 deletions(-)
create mode 100644 include/linux/sched/power.h

diff --git a/include/linux/sched/power.h b/include/linux/sched/power.h
new file mode 100644
index 0000000..ae11fec
--- /dev/null
+++ b/include/linux/sched/power.h
@@ -0,0 +1,29 @@
+/*
+ * include/linux/sched/power.h
+ *
+ * Copyright (C) 2013 ARM Limited.
+ * Author: Morten Rasmussen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef _LINUX_SCHED_POWER_H
+#define _LINUX_SCHED_POWER_H
+
+struct sched_power_driver {
+ /* performance scaling:
+ * get_power returns the average performance (cpu_power) since last call
+ * get_power_cap returns the max available performance state (cpu_power)
+ * such that get_power_cap - get_power is the currently available
+ * capacity.
+ * req_power (optional) requests driver to change the performance state.
+ */
+ unsigned long (*get_power) (int cpu);
+ unsigned long (*get_power_cap) (int cpu);
+ unsigned long (*req_power) (int cpu, unsigned long cpu_power);
+};
+
+int sched_power_register_driver(struct sched_power_driver *driver);
+int sched_power_unregister_driver(struct sched_power_driver *driver);
+#endif
diff --git a/kernel/sched/power.c b/kernel/sched/power.c
index 1ff8e4a..9e44c0e 100644
--- a/kernel/sched/power.c
+++ b/kernel/sched/power.c
@@ -15,6 +15,7 @@
#include <linux/percpu.h>
#include <linux/workqueue.h>
#include <linux/sched.h>
+#include <linux/sched/power.h>

#include "sched.h"

@@ -30,6 +31,8 @@ struct power_domain {
struct cpumask span;
/* current max power supported by platform */
unsigned long arch_power;
+ /* current power reported by power driver */
+ unsigned long curr_power;
/* cpu power exposed to the scheduler (fair.c) */
unsigned long sched_power;
/* load ratio (load tracking) */
@@ -38,6 +41,7 @@ struct power_domain {
};

static struct power_domain power_hierarchy;
+static struct sched_power_driver *power_driver;

DEFINE_PER_CPU(struct power_domain, *cpu_pds);

@@ -102,12 +106,26 @@ static void get_arch_cpu_power(void)
int i;

if (sched_feat(ARCH_POWER)) {
- for_each_online_cpu(i)
+ for_each_online_cpu(i) {
cpu_pd(i)->arch_power =
arch_scale_freq_power(cpu_rq(i)->sd, i);
+ cpu_pd(i)->curr_power = cpu_pd(i)->arch_power;
+ }
} else {
- for_each_online_cpu(i)
+ for_each_online_cpu(i) {
cpu_pd(i)->arch_power = SCHED_POWER_SCALE;
+ cpu_pd(i)->curr_power = cpu_pd(i)->arch_power;
+ }
+ }
+}
+
+static void get_driver_cpu_power(void)
+{
+ int i;
+
+ for_each_possible_cpu(i) {
+ cpu_pd(i)->arch_power = power_driver->get_power_cap(i);
+ cpu_pd(i)->curr_power = power_driver->get_power(i);
}
}

@@ -167,7 +185,10 @@ static void __power_schedule(void)
{
rcu_read_lock();

- get_arch_cpu_power();
+ if (!power_driver)
+ get_arch_cpu_power();
+ else
+ get_driver_cpu_power();
update_cpu_load();
calculate_cpu_capacities();

@@ -236,6 +257,21 @@ void power_schedule_wq(struct work_struct *work)
msecs_to_jiffies(INTERVAL));
}

+int sched_power_register_driver(struct sched_power_driver *driver)
+{
+ if (!driver->get_power || !driver->get_power_cap)
+ return -1;
+
+ power_driver = driver;
+ return 0;
+}
+
+int sched_power_unregister_driver(struct sched_power_driver *driver)
+{
+ power_driver = NULL;
+ return 0;
+}
+
static int __init sched_power_init(void)
{
init_power_hierarchy();
--
1.7.9.5

2013-07-09 15:56:41

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC][PATCH 6/9] sched: power: add power_domain data structure

Initial proposal for power topology representation in power
scheduler. For now just one global hierarchy. It will need a more
scalable layout later. More topology information will be added as
the power scheduler design evolves and implements power topology
aware freqency/P-state and idle state selection.

Signed-off-by: Morten Rasmussen <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Catalin Marinas <[email protected]>
---
kernel/sched/power.c | 133 +++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 110 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/power.c b/kernel/sched/power.c
index ddf249f..1ff8e4a 100644
--- a/kernel/sched/power.c
+++ b/kernel/sched/power.c
@@ -21,18 +21,54 @@
#define INTERVAL 5 /* ms */
#define CPU_FULL 90 /* Busy %-age - TODO: Make tunable */

-struct cpu_stats_struct {
+struct power_domain {
+ /* Domain hierarchy pointers */
+ struct power_domain *parent;
+ struct power_domain *next;
+ struct power_domain *child;
+ /* Domain info */
+ struct cpumask span;
+ /* current max power supported by platform */
+ unsigned long arch_power;
+ /* cpu power exposed to the scheduler (fair.c) */
+ unsigned long sched_power;
+ /* load ratio (load tracking) */
int load;
int nr_tasks;
};

-static unsigned long power_of(int cpu)
+static struct power_domain power_hierarchy;
+
+DEFINE_PER_CPU(struct power_domain, *cpu_pds);
+
+#define cpu_pd(cpu) (per_cpu(cpu_pds, (cpu)))
+
+#define for_each_pd(cpu, __pd) \
+ for (__pd = cpu_pd(cpu); __pd; __pd = __pd->parent)
+
+/*
+ * update_hierarchy updates the power domain hierarchy with new information
+ * for a specific cpu
+ */
+static void update_hierarchy(int cpu)
{
- return cpu_rq(cpu)->cpu_power;
+ int i;
+ int domain_load;
+ int domain_arch_power;
+ struct power_domain *pd;
+
+ for_each_pd(cpu, pd) {
+ domain_load = 0;
+ domain_arch_power = 0;
+ for_each_cpu_mask(i, pd->span) {
+ domain_load += cpu_pd(i)->load;
+ domain_arch_power += cpu_pd(i)->arch_power;
+ }
+ pd->load = domain_load;
+ pd->arch_power = domain_arch_power;
+ }
}

-DEFINE_PER_CPU(struct cpu_stats_struct, cpu_stats);
-
/*
* update_cpu_load fetches runqueue statistics from the scheduler should
* only be called with approitate locks held.
@@ -47,18 +83,19 @@ static void update_cpu_load(void)
u32 sum = rq->avg.runnable_avg_sum;
u32 period = rq->avg.runnable_avg_period;

- load = (sum * power_of(i)) / (period+1);
- per_cpu(cpu_stats, i).load = load;
- per_cpu(cpu_stats, i).nr_tasks = rq->nr_running;
+ load = (sum * power_sched_cpu_power(i)) / (period+1);
+ cpu_pd(i)->load = load;
+ cpu_pd(i)->nr_tasks = rq->nr_running;

/* Take power scheduler kthread into account */
if (smp_processor_id() == i)
- per_cpu(cpu_stats, i).nr_tasks--;
+ cpu_pd(i)->nr_tasks--;
+
+ update_hierarchy(i);
}
}

extern unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu);
-DEFINE_PER_CPU(unsigned long, arch_cpu_power);

static void get_arch_cpu_power(void)
{
@@ -66,16 +103,14 @@ static void get_arch_cpu_power(void)

if (sched_feat(ARCH_POWER)) {
for_each_online_cpu(i)
- per_cpu(arch_cpu_power, i) =
+ cpu_pd(i)->arch_power =
arch_scale_freq_power(cpu_rq(i)->sd, i);
} else {
for_each_online_cpu(i)
- per_cpu(arch_cpu_power, i) = SCHED_POWER_SCALE;
+ cpu_pd(i)->arch_power = SCHED_POWER_SCALE;
}
}

-DEFINE_PER_CPU(unsigned long, cpu_power);
-
/*
* power_sched_cpu_power is called from fair.c to get the power scheduler
* cpu capacities. We can't use arch_scale_freq_power() as this may already
@@ -83,7 +118,10 @@ DEFINE_PER_CPU(unsigned long, cpu_power);
*/
unsigned long power_sched_cpu_power(struct sched_domain *sd, int cpu)
{
- return per_cpu(cpu_power, cpu);
+ if (cpu_pd(cpu))
+ return cpu_pd(cpu)->sched_power;
+ else
+ return SCHED_POWER_SCALE;
}

/*
@@ -95,7 +133,7 @@ unsigned long power_sched_cpu_power(struct sched_domain *sd, int cpu)
static void calculate_cpu_capacities(void)
{
int i, spare_cap = 0;
- struct cpu_stats_struct *stats;
+ struct power_domain *stats;

/*
* spare_cap keeps track of the total available capacity across
@@ -104,22 +142,22 @@ static void calculate_cpu_capacities(void)

for_each_online_cpu(i) {
int t_cap = 0;
- int arch_power = per_cpu(arch_cpu_power, i);
+ int sched_power = cpu_pd(i)->sched_power;

- stats = &per_cpu(cpu_stats, i);
- t_cap = arch_power - stats->load;
+ stats = cpu_pd(i);
+ t_cap = sched_power - stats->load;

- if (t_cap < (arch_power * (100-CPU_FULL)) / 100) {
+ if (t_cap < (sched_power * (100-CPU_FULL)) / 100) {
/* Potential for spreading load */
if (stats->nr_tasks > 1)
t_cap = -(stats->load / stats->nr_tasks);
}

/* Do we have enough capacity already? */
- if (spare_cap + t_cap > arch_power) {
- per_cpu(cpu_power, i) = 1;
+ if (spare_cap + t_cap > sched_power) {
+ cpu_pd(i)->sched_power = 1;
} else {
- per_cpu(cpu_power, i) = arch_power;
+ cpu_pd(i)->sched_power = cpu_pd(i)->arch_power;
spare_cap += t_cap;
}
}
@@ -136,6 +174,53 @@ static void __power_schedule(void)
rcu_read_unlock();
}

+static void init_power_domain(struct power_domain *pd)
+{
+ pd->parent = NULL;
+ pd->next = pd;
+ pd->child = NULL;
+ pd->load = 0;
+ pd->arch_power = 0;
+ pd->sched_power = 0;
+ cpumask_copy(&pd->span, cpu_possible_mask);
+}
+
+/*
+ * init_power_hierarhcy sets up the default power domain hierarchy with
+ * one top level domain spanning all cpus and child domains for each cpu.
+ * next points to the next power domain at the current level and forms a
+ * circular list.
+ */
+static void init_power_hierarchy(void)
+{
+ int cpu, next_cpu;
+ struct power_domain *pd;
+
+ init_power_domain(&power_hierarchy);
+ cpumask_copy(&power_hierarchy.span, cpu_possible_mask);
+
+ pd = kzalloc(sizeof(struct power_domain) * nr_cpu_ids, GFP_KERNEL);
+
+ cpu = cpumask_next(-1, &power_hierarchy.span);
+
+ while (cpu < nr_cpu_ids) {
+ cpu_pd(cpu) = &pd[cpu];
+ cpu_pd(cpu)->parent = &power_hierarchy;
+ cpu_pd(cpu)->child = NULL;
+ cpumask_copy(&(cpu_pd(cpu)->span), get_cpu_mask(cpu));
+ cpu_pd(cpu)->arch_power = 1;
+ cpu_pd(cpu)->sched_power = 1;
+
+ next_cpu = cpumask_next(cpu, &power_hierarchy.span);
+ if (next_cpu < nr_cpu_ids)
+ cpu_pd(cpu)->next = &pd[next_cpu];
+ else
+ cpu_pd(cpu)->next =
+ &pd[cpumask_first(&power_hierarchy.span)];
+ cpu = next_cpu;
+ }
+}
+
struct delayed_work dwork;

/* Periodic power schedule target cpu */
@@ -153,6 +238,8 @@ void power_schedule_wq(struct work_struct *work)

static int __init sched_power_init(void)
{
+ init_power_hierarchy();
+
INIT_DELAYED_WORK(&dwork, power_schedule_wq);
mod_delayed_work_on(schedule_cpu(), system_wq, &dwork,
msecs_to_jiffies(INTERVAL));
--
1.7.9.5

2013-07-09 15:57:10

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC][PATCH 5/9] sched: Make idle_balance() skip cpus with a cpu_power of 1

idle_balance() should disregard cpus disabled by the power scheduler.

This is a quick fix. idle_balance() should be revisit to implement proper
handling of cpus with cpu_power=1.

Signed-off-by: Morten Rasmussen <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Catalin Marinas <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4610463..a59617b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5261,6 +5261,10 @@ void idle_balance(int this_cpu, struct rq *this_rq)
if (this_rq->avg_idle < sysctl_sched_migration_cost)
return;

+ /* Don't pull tasks if disable by power scheduler */
+ if (!power_cpu_balance(this_cpu))
+ return;
+
/*
* Drop the rq->lock, but keep IRQ/preempt disabled.
*/
--
1.7.9.5

2013-07-09 15:57:30

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC][PATCH 4/9] sched: Make periodic load-balance disregard cpus with a cpu_power of 1

Some of the load_balance() helper functions will put tasks on cpus with
cpu_power=1 when they are completely idle. This patch changes this behaviour.

The patch is a quick fix. The load_balance() helper functions should be
revisited to implement proper handling of cpus with cpu_power=1.

Signed-off-by: Morten Rasmussen <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Catalin Marinas <[email protected]>
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f637ea5..4610463 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3952,6 +3952,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
}

/*
+ * Vacate cpus disabled by the power scheduler even if the cache is
+ * hot
+ */
+ if (!power_cpu_balance(env->src_cpu))
+ return 1;
+
+ /*
* Aggressive migration if:
* 1) task is cache cold, or
* 2) too many balance attempts have failed.
@@ -4500,6 +4507,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
update_group_power(env->sd, env->dst_cpu);
} else if (time_after_eq(jiffies, group->sgp->next_update))
update_group_power(env->sd, env->dst_cpu);
+
+ if (!power_cpu_balance(env->dst_cpu)) {
+ *balance = 0;
+ return;
+ }
}

/* Adjust by relative CPU power of the group */
--
1.7.9.5

2013-07-09 15:57:28

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC][PATCH 2/9] sched: Redirect update_cpu_power to sched/power.c

With CONFIG_SCHED_POWER enabled, update_cpu_power() gets the capacity
managed cpu_power from the power scheduler instead of
arch_scale_freq_power().

Signed-off-by: Morten Rasmussen <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Catalin Marinas <[email protected]>
---
kernel/sched/fair.c | 19 ++++++++++---------
kernel/sched/sched.h | 24 ++++++++++++++++++++++++
2 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..01f1f26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3214,6 +3214,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
tsk_cpus_allowed(p)))
continue;

+ /* Group restricted by power scheduler (cpu_power=1) */
+ if (!power_group_balance(group))
+ continue;
+
local_group = cpumask_test_cpu(this_cpu,
sched_group_cpus(group));

@@ -3258,6 +3262,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)

/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+
+ /* Skip cpus disabled by power scheduler */
+ if (!power_cpu_balance(i))
+ continue;
+
load = weighted_cpuload(i);

if (load < min_load || (load == min_load && i == this_cpu)) {
@@ -4265,11 +4274,6 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
return load_idx;
}

-static unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
-{
- return SCHED_POWER_SCALE;
-}
-
unsigned long __weak arch_scale_freq_power(struct sched_domain *sd, int cpu)
{
return default_scale_freq_power(sd, cpu);
@@ -4336,10 +4340,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)

sdg->sgp->power_orig = power;

- if (sched_feat(ARCH_POWER))
- power *= arch_scale_freq_power(sd, cpu);
- else
- power *= default_scale_freq_power(sd, cpu);
+ power *= power_sched_cpu_power(sd, cpu);

power >>= SCHED_POWER_SHIFT;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ce39224..2e62faa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1377,3 +1377,27 @@ static inline u64 irq_time_read(int cpu)
}
#endif /* CONFIG_64BIT */
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+static inline unsigned long default_scale_freq_power(struct sched_domain *sd,
+ int cpu)
+{
+ return SCHED_POWER_SCALE;
+}
+
+extern unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu);
+
+#ifdef CONFIG_SCHED_POWER
+extern unsigned long power_sched_cpu_power(struct sched_domain *sd, int cpu);
+#define power_cpu_balance(cpu) (cpu_rq(cpu)->cpu_power > 1)
+#define power_group_balance(group) (group->sgp->power > group->group_weight)
+#else
+static inline unsigned long power_sched_cpu_power(struct sched_domain *sd,
+ int cpu)
+{
+ if (sched_feat(ARCH_POWER))
+ return arch_scale_freq_power(sd, cpu);
+ return default_scale_freq_power(sd, cpu);
+}
+#define power_cpu_balance(cpu) 1
+#define power_group_balance(group) 1
+#endif
--
1.7.9.5

2013-07-09 16:48:24

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/9] sched: Introduce power scheduler

On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
> +void power_schedule_wq(struct work_struct *work)
> +{
> + __power_schedule();
> + mod_delayed_work_on(schedule_cpu(), system_wq, &dwork,
> + msecs_to_jiffies(INTERVAL));
> +}

please tell me this is at least a deferable timer kind of work?

waking the cpu up all the time even when idle is a huge negative on power ;-(

2013-07-09 16:58:57

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
> Hi,
>
> This patch set is an initial prototype aiming at the overall power-aware
> scheduler design proposal that I previously described
> <http://permalink.gmane.org/gmane.linux.kernel/1508480>.
>
> The patch set introduces a cpu capacity managing 'power scheduler' which lives
> by the side of the existing (process) scheduler. Its role is to monitor the
> system load and decide which cpus that should be available to the process
> scheduler. Long term the power scheduler is intended to replace the currently
> distributed uncoordinated power management policies and will interface a
> unified platform specific power driver obtain power topology information and
> handle idle and P-states. The power driver interface should be made flexible
> enough to support multiple platforms including Intel and ARM.
>
I quickly browsed through it but have a hard time seeing what the
real interface is between the scheduler and the hardware driver.
What information does the scheduler give the hardware driver exactly?
e.g. what does it mean?

If the interface is "go faster please" or "we need you to be at fastest now",
that doesn't sound too bad.
But if the interface is "you should be at THIS number" that is pretty bad and
not going to work for us.

also, it almost looks like there is a fundamental assumption in the code
that you can get the current effective P state to make scheduler decisions on;
on Intel at least that is basically impossible... and getting more so with every generation
(likewise for AMD afaics)

(you can get what you ran at on average over some time in the past, but not
what you're at now or going forward)

I'm rather nervous about calculating how many cores you want active as a core scheduler feature.
I understand that for your big.LITTLE architecture you need this due to the asymmetry,
but as a general rule for more symmetric systems it's known to be suboptimal by quite a
real percentage. For a normal Intel single CPU system it's sort of the worst case you can do
in that it leads to serializing tasks that could have run in parallel over multiple cores/threads.
So at minimum this kind of logic must be enabled/disabled based on architecture decisions.




2013-07-10 02:10:21

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/9] sched: Introduce power scheduler

On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
> + mod_delayed_work_on(schedule_cpu(), system_wq, &dwork,
> + msecs_to_jiffies(INTERVAL));

so thinking about this more, this really really should not be a work queue.

a work queue will cause a large number of context switches for no reason
(on Intel and AMD you can switch P state from interrupt context, and I'm pretty sure
that holds for many ARM as well)

and in addition, it causes some really nasty cases, especially around real time tasks.
Your workqueue will schedule a kernel thread, which will run
BEHIND real time tasks, and such real time task will then never be able to start running at a higher performance.

(and with the delta between lowest and highest performance sometimes being 10x or more,
the real time task will be running SLOW... quite possible longer than several milliseconds)


and all for no good reason; a normal timer running in irq context would be much better for this kind of thing!

2013-07-10 11:11:45

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/9] sched: Introduce power scheduler

On Wed, Jul 10, 2013 at 03:10:15AM +0100, Arjan van de Ven wrote:
> On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
> > + mod_delayed_work_on(schedule_cpu(), system_wq, &dwork,
> > + msecs_to_jiffies(INTERVAL));
>
> so thinking about this more, this really really should not be a work queue.
> a work queue will cause a large number of context switches for no reason
> (on Intel and AMD you can switch P state from interrupt context, and I'm pretty sure
> that holds for many ARM as well)

Agree. I should have made it clear this is only a temporary solution. I
would prefer to tie the power scheduler to the existing scheduler tick
instead so we don't wake up cpus unnecessarily. nohz may be able handle
that for us. Also, currently the power scheduler updates all cpus.
Going forward this would change to per cpu updates and partial updates
of the global view to improve scalability.

>
> and in addition, it causes some really nasty cases, especially around real time tasks.
> Your workqueue will schedule a kernel thread, which will run
> BEHIND real time tasks, and such real time task will then never be able to start running at a higher performance.
>
> (and with the delta between lowest and highest performance sometimes being 10x or more,
> the real time task will be running SLOW... quite possible longer than several milliseconds)
>
> and all for no good reason; a normal timer running in irq context would be much better for this kind of thing!
>
>

2013-07-10 11:16:38

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Tue, Jul 09, 2013 at 05:58:55PM +0100, Arjan van de Ven wrote:
> On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
> > Hi,
> >
> > This patch set is an initial prototype aiming at the overall power-aware
> > scheduler design proposal that I previously described
> > <http://permalink.gmane.org/gmane.linux.kernel/1508480>.
> >
> > The patch set introduces a cpu capacity managing 'power scheduler' which lives
> > by the side of the existing (process) scheduler. Its role is to monitor the
> > system load and decide which cpus that should be available to the process
> > scheduler. Long term the power scheduler is intended to replace the currently
> > distributed uncoordinated power management policies and will interface a
> > unified platform specific power driver obtain power topology information and
> > handle idle and P-states. The power driver interface should be made flexible
> > enough to support multiple platforms including Intel and ARM.
> >
> I quickly browsed through it but have a hard time seeing what the
> real interface is between the scheduler and the hardware driver.
> What information does the scheduler give the hardware driver exactly?
> e.g. what does it mean?
>
> If the interface is "go faster please" or "we need you to be at fastest now",
> that doesn't sound too bad.
> But if the interface is "you should be at THIS number" that is pretty bad and
> not going to work for us.

It is the former.

The current power driver interface (which is far from complete)
basically allows the power scheduler to get the current P-state, the
maximum available P-state, and provide P-state change hints. The current
P-state is not the instantaneous P-state, but an average over some
period of time. Since last query would work. (I should have called it
avg instead of curr.) Knowing that and also the maximum available
P-state at that point in time (may change over time due to thermal or
power budget constraints) allows the power scheduler to reason about the
spare capacity of the cpus and decide whether a P-state change is enough
or if the load must be spread across more cpus.

The P-state change request allows the power scheduler to ask the power
driver to go faster or slower. I was initially thinking about having a
simple up/down interface, but realized that it would not be sufficient
as the power driver wouldn't necessarily know how much it should go up or
down. When the cpu load is decreasing the power scheduler should be able
to determine fairly accurately how much compute capacity that is needed.
So I think it makes sense to pass this information to the power driver.

For some platforms the power driver may use the P-state hint directly to
choose the next P-state. The schedpower cpufreq wrapper governor is an
example of this. Others may have much more sophisticated power drivers
that take platform specific constraints into account and select whatever
P-state they like. The intention is that the P-state request will return
the actual P-state selected by the power driver so the power scheduler
can act accordingly.

The power driver interface uses a cpu_power-like P-state abstraction to
avoid dealing with frequencies in the power scheduler.

>
> also, it almost looks like there is a fundamental assumption in the code
> that you can get the current effective P state to make scheduler decisions on;
> on Intel at least that is basically impossible... and getting more so with every generation
> (likewise for AMD afaics)
>
> (you can get what you ran at on average over some time in the past, but not
> what you're at now or going forward)
>

As described above, it is not a strict assumption. From a scheduler
point of view we somehow need to know if the cpus are truly fully
utilized (at their highest P-state) so we need to throw more cpus at the
problem (assuming that we have more than one task per cpu) or if we can
just go to a higher P-state. We don't need a strict guarantee that we
get exactly the P-state that we request for each cpu. The power
scheduler generates hints and the power driver gives us feedback on what
we can roughly expect to get.

> I'm rather nervous about calculating how many cores you want active as a core scheduler feature.
> I understand that for your big.LITTLE architecture you need this due to the asymmetry,
> but as a general rule for more symmetric systems it's known to be suboptimal by quite a
> real percentage. For a normal Intel single CPU system it's sort of the worst case you can do
> in that it leads to serializing tasks that could have run in parallel over multiple cores/threads.
> So at minimum this kind of logic must be enabled/disabled based on architecture decisions.

Packing clearly has to take power topology into account and do the right
thing for the particular platform. It is not in place yet, but will be
addressed. I believe it would make sense for dual cpu Intel systems to
pack at socket level? I fully understand that it won't make sense for
single cpu Intel systems or inside each cpu in dual cpu Intel system.

For ARM it depends on the particular implemention. For big.LITTLE where
you have two cpu clusters (big and little), which may have different
C-states. It may make sense to pack between clusters and inside one
cluster, but not the other. The power scheduler must be able to handle
this. The power driver should provide the necessary platform information
as part of the power topology.

2013-07-10 11:19:22

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/9] sched: Introduce power scheduler

On 10 July 2013 13:11, Morten Rasmussen <[email protected]> wrote:
> On Wed, Jul 10, 2013 at 03:10:15AM +0100, Arjan van de Ven wrote:
>> On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
>> > + mod_delayed_work_on(schedule_cpu(), system_wq, &dwork,
>> > + msecs_to_jiffies(INTERVAL));
>>
>> so thinking about this more, this really really should not be a work queue.
>> a work queue will cause a large number of context switches for no reason
>> (on Intel and AMD you can switch P state from interrupt context, and I'm pretty sure
>> that holds for many ARM as well)
>
> Agree. I should have made it clear this is only a temporary solution. I
> would prefer to tie the power scheduler to the existing scheduler tick
> instead so we don't wake up cpus unnecessarily. nohz may be able handle
> that for us. Also, currently the power scheduler updates all cpus.
> Going forward this would change to per cpu updates and partial updates
> of the global view to improve scalability.

For the packing tasks patches, we are using the periodic load balance
sequence to update the activity like it is done for the cpu_power. I
have planned to update the packing patches to see how it can cooperate
with Morten patches as it has similar needs.

>
>>
>> and in addition, it causes some really nasty cases, especially around real time tasks.
>> Your workqueue will schedule a kernel thread, which will run
>> BEHIND real time tasks, and such real time task will then never be able to start running at a higher performance.
>>
>> (and with the delta between lowest and highest performance sometimes being 10x or more,
>> the real time task will be running SLOW... quite possible longer than several milliseconds)
>>
>> and all for no good reason; a normal timer running in irq context would be much better for this kind of thing!
>>
>>
>

2013-07-10 13:05:03

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal


>
>>
>> also, it almost looks like there is a fundamental assumption in the code
>> that you can get the current effective P state to make scheduler decisions on;
>> on Intel at least that is basically impossible... and getting more so with every generation
>> (likewise for AMD afaics)
>>
>> (you can get what you ran at on average over some time in the past, but not
>> what you're at now or going forward)
>>
>
> As described above, it is not a strict assumption. From a scheduler
> point of view we somehow need to know if the cpus are truly fully
> utilized (at their highest P-state)

unfortunately we can't provide this on Intel ;-(
we can provide you what you ran at average, we cannot provide you if that is the max or not

(first of all, because we outright don't know what the max would have been, and second,
because we may be running slower than max because the workload was memory bound or
any of the other conditions that makes the HW P state "governor" decide to reduce
frequency for efficiency reasons)

> so we need to throw more cpus at the
> problem (assuming that we have more than one task per cpu) or if we can
> just go to a higher P-state. We don't need a strict guarantee that we
> get exactly the P-state that we request for each cpu. The power
> scheduler generates hints and the power driver gives us feedback on what
> we can roughly expect to get.


>
>> I'm rather nervous about calculating how many cores you want active as a core scheduler feature.
>> I understand that for your big.LITTLE architecture you need this due to the asymmetry,
>> but as a general rule for more symmetric systems it's known to be suboptimal by quite a
>> real percentage. For a normal Intel single CPU system it's sort of the worst case you can do
>> in that it leads to serializing tasks that could have run in parallel over multiple cores/threads.
>> So at minimum this kind of logic must be enabled/disabled based on architecture decisions.
>
> Packing clearly has to take power topology into account and do the right
> thing for the particular platform. It is not in place yet, but will be
> addressed. I believe it would make sense for dual cpu Intel systems to
> pack at socket level?

a little bit. if you have 2 quad core systems, it will make sense to pack 2 tasks
onto a single core, assuming they are not cache or memory bandwidth bound (remember this is numa!)
but if you have 4 tasks, it's not likely to be worth it to pack, unless you get an enormous
economy of scale due to cache sharing
(this is far more about getting numa balancing right than about power; you're not very likely
to win back the power you loose from inefficiency if you get the numa side wrong by being
too smart about power placement)

2013-07-10 13:11:02

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] sched: power: Add initial frequency scaling support to power scheduler

On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
> Extends the power scheduler capacity management algorithm to handle
> frequency scaling and provide basic frequency/P-state selection hints
> to the power driver.
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> CC: Ingo Molnar <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Catalin Marinas <[email protected]>
> ---
> kernel/sched/power.c | 33 ++++++++++++++++++++++++++++-----
> 1 file changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/power.c b/kernel/sched/power.c
> index 9e44c0e..5fc32b0 100644
> --- a/kernel/sched/power.c
> +++ b/kernel/sched/power.c
> @@ -21,6 +21,8 @@
>
> #define INTERVAL 5 /* ms */
> #define CPU_FULL 90 /* Busy %-age - TODO: Make tunable */
> +#define CPU_TARGET 80 /* Target busy %-age - TODO: Make tunable */
> +#define CPU_EMPTY 5 /* Idle noise %-age - TODO: Make tunable */
>

to be honest, this is the policy part that really should be in the hardware specific driver
and not in the scheduler.
(even if said driver is sort of a "generic library" kind of thing)


2013-07-11 11:37:16

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

Hi Morten,

I have a few quick comments.

On 07/09/2013 10:28 PM, Arjan van de Ven wrote:
> On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
>> Hi,
>>
>> This patch set is an initial prototype aiming at the overall power-aware
>> scheduler design proposal that I previously described
>> <http://permalink.gmane.org/gmane.linux.kernel/1508480>.
>>
>> The patch set introduces a cpu capacity managing 'power scheduler'
>> which lives
>> by the side of the existing (process) scheduler. Its role is to
>> monitor the
>> system load and decide which cpus that should be available to the process
>> scheduler. Long term the power scheduler is intended to replace the
>> currently
>> distributed uncoordinated power management policies and will interface a
>> unified platform specific power driver obtain power topology
>> information and
>> handle idle and P-states. The power driver interface should be made
>> flexible
>> enough to support multiple platforms including Intel and ARM.
>>
> I quickly browsed through it but have a hard time seeing what the
> real interface is between the scheduler and the hardware driver.
> What information does the scheduler give the hardware driver exactly?
> e.g. what does it mean?
>
> If the interface is "go faster please" or "we need you to be at fastest
> now",
> that doesn't sound too bad.
> But if the interface is "you should be at THIS number" that is pretty
> bad and
> not going to work for us.
>
> also, it almost looks like there is a fundamental assumption in the code
> that you can get the current effective P state to make scheduler
> decisions on;
> on Intel at least that is basically impossible... and getting more so
> with every generation
> (likewise for AMD afaics)

I am concerned too about scheduler making its load balancing decisions
based on the cpu frequency for the reason that it could create an
imbalance in the load across cpus.

Scheduler could keep loading a cpu, because its cpu frequency goes on
increasing, and it could keep un-loading a cpu because its cpu frequency
goes on decreasing. This increase and decrease as an effect of the load
itself. This is of course assuming that the driver would make its
decisions proportional to the cpu load. There could be many more
complications, if the driver makes its decisions on factors unknown to
the scheduler.

Therefore my suggestion is that we should simply have the scheduler
asking for increase/decrease in the frequency and letting it at that.

Secondly, I think we should spend more time on when to make a call to
the frequency driver in your patchset regarding the change in the
frequency of the CPU, the scheduler wishes to request. The reason being,
the whole effort of integrating the knowledge of cpu frequency
statistics into the scheduler is being done because the scheduler can
call the frequency driver at times *complimenting* load balancing,
unlike now.

Also adding Rafael to the cc list.

Regards
Preeti U Murthy

2013-07-12 12:46:06

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Wed, Jul 10, 2013 at 02:05:00PM +0100, Arjan van de Ven wrote:
>
> >
> >>
> >> also, it almost looks like there is a fundamental assumption in the code
> >> that you can get the current effective P state to make scheduler decisions on;
> >> on Intel at least that is basically impossible... and getting more so with every generation
> >> (likewise for AMD afaics)
> >>
> >> (you can get what you ran at on average over some time in the past, but not
> >> what you're at now or going forward)
> >>
> >
> > As described above, it is not a strict assumption. From a scheduler
> > point of view we somehow need to know if the cpus are truly fully
> > utilized (at their highest P-state)
>
> unfortunately we can't provide this on Intel ;-(
> we can provide you what you ran at average, we cannot provide you if that is the max or not
>
> (first of all, because we outright don't know what the max would have been, and second,
> because we may be running slower than max because the workload was memory bound or
> any of the other conditions that makes the HW P state "governor" decide to reduce
> frequency for efficiency reasons)

I have had a quick look at intel_pstate.c and to me it seems that it can
be turned into a power driver that uses the proposed interface with a
few modifications. intel_pstate.c already has max and min P-state as
well as a current P-state calculated using the aperf/mperf ratio. I
think these are quite similar to what we need for the power
scheduler/driver. The aperf/mperf ratio can approximate the current
'power'. For max 'power' it can be done in two ways: Either use the
highest non-turbo P-state or the highest available turbo P-state.

In the first case, the power scheduler would not know about turbo mode
and never request it. Turbo mode could still be used by the power driver
as a hidden bonus when power scheduler requests max power.

In the second approach, the power scheduler may request power (P-state)
that can only be provided by a turbo P-state. Since we cannot be
guaranteed to get that, the power driver would return the power
(P-state) that is guaranteed (or at least very likely). That is, the
highest non-turbo P-state. That approach seems better to me and also
somewhat similar to what is done in intel_pstate.c (if I understand it
correctly).

I'm not an expert on Intel power management, so I may be missing
something.

I understand that the difference between highest guaranteed P-state and
highest potential P-state is likely to increase in the future. Without
any feedback about what potential P-state we can approximately get, we
can only pack tasks until we hit the load that can be handled at the
highest guaranteed P-state. Are you (Intel) considering any new feedback
mechanisms for this?

I believe that there already is a power limit notification mechanism on
Intel that can notify the OS when the firmware chooses a lower P-state
than the one requested by the OS.

You (or Rafael) mentioned in our previous discussion that you are
working on an improved intel_pstate driver. Will that be fundamentally
different from the current one?

> > so we need to throw more cpus at the
> > problem (assuming that we have more than one task per cpu) or if we can
> > just go to a higher P-state. We don't need a strict guarantee that we
> > get exactly the P-state that we request for each cpu. The power
> > scheduler generates hints and the power driver gives us feedback on what
> > we can roughly expect to get.
>
>
> >
> >> I'm rather nervous about calculating how many cores you want active as a core scheduler feature.
> >> I understand that for your big.LITTLE architecture you need this due to the asymmetry,
> >> but as a general rule for more symmetric systems it's known to be suboptimal by quite a
> >> real percentage. For a normal Intel single CPU system it's sort of the worst case you can do
> >> in that it leads to serializing tasks that could have run in parallel over multiple cores/threads.
> >> So at minimum this kind of logic must be enabled/disabled based on architecture decisions.
> >
> > Packing clearly has to take power topology into account and do the right
> > thing for the particular platform. It is not in place yet, but will be
> > addressed. I believe it would make sense for dual cpu Intel systems to
> > pack at socket level?
>
> a little bit. if you have 2 quad core systems, it will make sense to pack 2 tasks
> onto a single core, assuming they are not cache or memory bandwidth bound (remember this is numa!)
> but if you have 4 tasks, it's not likely to be worth it to pack, unless you get an enormous
> economy of scale due to cache sharing
> (this is far more about getting numa balancing right than about power; you're not very likely
> to win back the power you loose from inefficiency if you get the numa side wrong by being
> too smart about power placement)

I agree that packing is not a good idea for cache or memory bound tasks.
It is not any different on dual cluster ARM setups like big.LITTLE. But,
we do see a lot of benefit in packing small tasks which are not cache or
memory bound, or performance critical. Keeping them on as few cpus as
possible means that the rest can enter deeper C-states for longer.

BTW. Packing one strictly memory bound task and one strictly cpu bound
task on one socket might work. The only problem is to determine the task
charateristics ;-)

2013-07-12 12:51:04

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] sched: power: Add initial frequency scaling support to power scheduler

On Wed, Jul 10, 2013 at 02:10:59PM +0100, Arjan van de Ven wrote:
> On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
> > Extends the power scheduler capacity management algorithm to handle
> > frequency scaling and provide basic frequency/P-state selection hints
> > to the power driver.
> >
> > Signed-off-by: Morten Rasmussen <[email protected]>
> > CC: Ingo Molnar <[email protected]>
> > CC: Peter Zijlstra <[email protected]>
> > CC: Catalin Marinas <[email protected]>
> > ---
> > kernel/sched/power.c | 33 ++++++++++++++++++++++++++++-----
> > 1 file changed, 28 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/power.c b/kernel/sched/power.c
> > index 9e44c0e..5fc32b0 100644
> > --- a/kernel/sched/power.c
> > +++ b/kernel/sched/power.c
> > @@ -21,6 +21,8 @@
> >
> > #define INTERVAL 5 /* ms */
> > #define CPU_FULL 90 /* Busy %-age - TODO: Make tunable */
> > +#define CPU_TARGET 80 /* Target busy %-age - TODO: Make tunable */
> > +#define CPU_EMPTY 5 /* Idle noise %-age - TODO: Make tunable */
> >
>
> to be honest, this is the policy part that really should be in the hardware specific driver
> and not in the scheduler.
> (even if said driver is sort of a "generic library" kind of thing)

I agree that the values should be set by a hardware specific power
driver. Or do you mean that algorithms using this sort of values should
be in the driver?

2013-07-12 13:02:00

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Wed, Jul 10, 2013 at 02:05:00PM +0100, Arjan van de Ven wrote:
> >> also, it almost looks like there is a fundamental assumption in the
> >> code that you can get the current effective P state to make
> >> scheduler decisions on; on Intel at least that is basically
> >> impossible... and getting more so with every generation (likewise
> >> for AMD afaics)
> >>
> >> (you can get what you ran at on average over some time in the past,
> >> but not what you're at now or going forward)
> >
> > As described above, it is not a strict assumption. From a scheduler
> > point of view we somehow need to know if the cpus are truly fully
> > utilized (at their highest P-state)
>
> unfortunately we can't provide this on Intel ;-(
> we can provide you what you ran at average, we cannot provide you if
> that is the max or not
>
> (first of all, because we outright don't know what the max would have
> been, and second, because we may be running slower than max because
> the workload was memory bound or any of the other conditions that
> makes the HW P state "governor" decide to reduce frequency for
> efficiency reasons)

I guess even if we have a constant CPU frequency (no turbo boost), we
still don't have a simple relation between the load as seen by the
scheduler and the CPU frequency (for reasons that you mentioned above
like memory-bound tasks).

But on x86 you still have a P-state hint for the CPU and the scheduler
could at least hope for more CPU performance. We can make the power
scheduler ask the power driver for an increase or decrease of
performance (as Preeti suggested) and give it the current load as
argument rather than a precise performance/frequency level. The power
driver would change the P-state accordingly and take the load into
account (or ignore it, something like intel_pstate.c can do its own
aperf/mperf tracking). But the power driver will inform the scheduler
that it can't change the P-state further and the power scheduler can
decide to spread the load out to other CPUs.

--
Catalin

2013-07-12 13:07:01

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] sched: power: Add initial frequency scaling support to power scheduler

On Fri, Jul 12, 2013 at 01:51:13PM +0100, Morten Rasmussen wrote:
> On Wed, Jul 10, 2013 at 02:10:59PM +0100, Arjan van de Ven wrote:
> > On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
> > > Extends the power scheduler capacity management algorithm to handle
> > > frequency scaling and provide basic frequency/P-state selection hints
> > > to the power driver.
> > >
> > > Signed-off-by: Morten Rasmussen <[email protected]>
> > > CC: Ingo Molnar <[email protected]>
> > > CC: Peter Zijlstra <[email protected]>
> > > CC: Catalin Marinas <[email protected]>
> > > ---
> > > kernel/sched/power.c | 33 ++++++++++++++++++++++++++++-----
> > > 1 file changed, 28 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/sched/power.c b/kernel/sched/power.c
> > > index 9e44c0e..5fc32b0 100644
> > > --- a/kernel/sched/power.c
> > > +++ b/kernel/sched/power.c
> > > @@ -21,6 +21,8 @@
> > >
> > > #define INTERVAL 5 /* ms */
> > > #define CPU_FULL 90 /* Busy %-age - TODO: Make tunable */
> > > +#define CPU_TARGET 80 /* Target busy %-age - TODO: Make tunable */
> > > +#define CPU_EMPTY 5 /* Idle noise %-age - TODO: Make tunable */
> > >
> >
> > to be honest, this is the policy part that really should be in the hardware specific driver
> > and not in the scheduler.
> > (even if said driver is sort of a "generic library" kind of thing)
>
> I agree that the values should be set by a hardware specific power
> driver. Or do you mean that algorithms using this sort of values should
> be in the driver?

I think for flexibility we could place the default algorithm in a
library and it would be used by the cpufreq power driver wrapper or
directly by a new power driver. The intel_pstate.c driver could be
allowed to do smarter things.

--
Catalin

2013-07-12 13:32:17

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Tue, Jul 09, 2013 at 05:58:55PM +0100, Arjan van de Ven wrote:
> On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
> > This patch set is an initial prototype aiming at the overall power-aware
> > scheduler design proposal that I previously described
> > <http://permalink.gmane.org/gmane.linux.kernel/1508480>.
> >
> > The patch set introduces a cpu capacity managing 'power scheduler' which lives
> > by the side of the existing (process) scheduler. Its role is to monitor the
> > system load and decide which cpus that should be available to the process
> > scheduler. Long term the power scheduler is intended to replace the currently
> > distributed uncoordinated power management policies and will interface a
> > unified platform specific power driver obtain power topology information and
> > handle idle and P-states. The power driver interface should be made flexible
> > enough to support multiple platforms including Intel and ARM.
...
> I'm rather nervous about calculating how many cores you want active as
> a core scheduler feature. I understand that for your big.LITTLE
> architecture you need this due to the asymmetry, but as a general rule
> for more symmetric systems it's known to be suboptimal by quite a real
> percentage. For a normal Intel single CPU system it's sort of the
> worst case you can do in that it leads to serializing tasks that could
> have run in parallel over multiple cores/threads. So at minimum this
> kind of logic must be enabled/disabled based on architecture
> decisions.

As Morten already stated, we *think* this is beneficial for symmetric
multi-socket (multi-cluster, multi-core or whatever other name) systems
as well. The only thing that big.LITTLE requires is that we want to
favour little CPUs when the load is not too high. But even if they were
symmetric (big.big is not unlikely, though for different markets), we
still want to pack tasks on a single cluster if it has enough compute
capacity so that the other cluster can go into deeper sleep state.
Basically we don't want 5 tasks to use 5 CPUs when 4 (or less) would
suffice.

So apart from intel_pstate.c improvements (which look really nice), my
guess is that Intel also has an interest in scheduler changes for power
reasons (my guess is based on the work done by Alex Shi).

If not (IOW all you need is the intel_pstate.c driver), the proposed
power scheduler will have two policies anyway: power and performance.
The latter would only improve on the current (performance) behaviour and
will allow the load balancing to equally use all the CPUs. A modified
intel_pstate.c driver could benefit from extra hints from the power
scheduler (like CPU load) or can simply ignore them. The scheduler will
also benefit by not migrating a task unnecessarily if the pstate driver
can switch to higher P-state (I'm not convinced 10ms load tracking in
the intel_pstate.c driver is fast enough, especially since it integrates
the load over multiple such periods).

--
Catalin

2013-07-12 13:48:06

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Thu, Jul 11, 2013 at 12:34:49PM +0100, Preeti U Murthy wrote:
> Hi Morten,
>
> I have a few quick comments.
>
> On 07/09/2013 10:28 PM, Arjan van de Ven wrote:
> > On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
> >> Hi,
> >>
> >> This patch set is an initial prototype aiming at the overall power-aware
> >> scheduler design proposal that I previously described
> >> <http://permalink.gmane.org/gmane.linux.kernel/1508480>.
> >>
> >> The patch set introduces a cpu capacity managing 'power scheduler'
> >> which lives
> >> by the side of the existing (process) scheduler. Its role is to
> >> monitor the
> >> system load and decide which cpus that should be available to the process
> >> scheduler. Long term the power scheduler is intended to replace the
> >> currently
> >> distributed uncoordinated power management policies and will interface a
> >> unified platform specific power driver obtain power topology
> >> information and
> >> handle idle and P-states. The power driver interface should be made
> >> flexible
> >> enough to support multiple platforms including Intel and ARM.
> >>
> > I quickly browsed through it but have a hard time seeing what the
> > real interface is between the scheduler and the hardware driver.
> > What information does the scheduler give the hardware driver exactly?
> > e.g. what does it mean?
> >
> > If the interface is "go faster please" or "we need you to be at fastest
> > now",
> > that doesn't sound too bad.
> > But if the interface is "you should be at THIS number" that is pretty
> > bad and
> > not going to work for us.
> >
> > also, it almost looks like there is a fundamental assumption in the code
> > that you can get the current effective P state to make scheduler
> > decisions on;
> > on Intel at least that is basically impossible... and getting more so
> > with every generation
> > (likewise for AMD afaics)
>
> I am concerned too about scheduler making its load balancing decisions
> based on the cpu frequency for the reason that it could create an
> imbalance in the load across cpus.
>
> Scheduler could keep loading a cpu, because its cpu frequency goes on
> increasing, and it could keep un-loading a cpu because its cpu frequency
> goes on decreasing. This increase and decrease as an effect of the load
> itself. This is of course assuming that the driver would make its
> decisions proportional to the cpu load. There could be many more
> complications, if the driver makes its decisions on factors unknown to
> the scheduler.
>
> Therefore my suggestion is that we should simply have the scheduler
> asking for increase/decrease in the frequency and letting it at that.

If I understand correctly your concern is about the effect of frequency
scaling on load-balancing when using tracked load (PJT's) for the task
loads as it is done in Alex Shi's patches.

That problem is present even with the existing cpufreq governors and has
not been addressed yet. Tasks on cpus at low frequencies appear bigger
since they run longer, which will cause the load-balancer to think the
cpu loaded and move tasks to other cpus. That will cause cpufreq to
lower the frequency of that cpu and make any remaining tasks look even
bigger. The story repeats itself.

One might be tempted to suggest to use arch_scale_freq_power to tell the
load-balancer about frequency scaling. But in its current form it will
actually make it worse, as cpu_power is currently used to indicate max
compute capacity and not the current one.

I don't understand how a simple up/down request from the scheduler would
solve that problem. It would just make frequency scaling slower if you
only go up or down one step at the time. Much like the existing
conservative cpufreq governor that nobody uses. Maybe I am missing
something?

I think we should look into scaling the tracked load by some metric that
represents the current performance of the cpu whenever the tracked load
is updated as it was suggested by Arjan in our previous discussion. I
included it in my power scheduler design proposal, but I haven't done
anything about it yet.

In short, I agree that there is a problem around load-balancing and
frequency scaling that needs to be fixed. Without Alex's patches the
problem is not present as task load doesn't depend on the cpu load of the
task.

> Secondly, I think we should spend more time on when to make a call to
> the frequency driver in your patchset regarding the change in the
> frequency of the CPU, the scheduler wishes to request. The reason being,
> the whole effort of integrating the knowledge of cpu frequency
> statistics into the scheduler is being done because the scheduler can
> call the frequency driver at times *complimenting* load balancing,
> unlike now.

I don't think I get your point here. The current policy in this patch
set is just a prototype that should be improved. The power scheduler
does complement the load-balancer already by asking for frequency
changes as the cpu load changes.

>
> Also adding Rafael to the cc list.
>

Thanks.

Morten

2013-07-12 15:36:04

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/12/2013 5:46 AM, Morten Rasmussen wrote:


> I have had a quick look at intel_pstate.c and to me it seems that it can
> be turned into a power driver that uses the proposed interface with a
> few modifications. intel_pstate.c already has max and min P-state as
> well as a current P-state calculated using the aperf/mperf ratio. I

it calculates average frequency... not current p state.
first of all, it's completely and strictly backwards looking
(and in the light of this being used in a load balancing decision,
the past is NOT a predictor for the future since you're about to change the maximum)
and second, in the light of having idle time... you do not get what you
think you get ;-)


>
> In the first case, the power scheduler would not know about turbo mode
> and never request it. Turbo mode could still be used by the power driver
> as a hidden bonus when power scheduler requests max power.

but what do you do when you ask for low power? On Intel.. for various cases,
you also pick a high P state!

(the assumption "low P state == low power" and "high P state == high power"
is just not valid)


>
> In the second approach, the power scheduler may request power (P-state)
> that can only be provided by a turbo P-state. Since we cannot be
> guaranteed to get that, the power driver would return the power
> (P-state) that is guaranteed (or at least very likely)

even non-turbo is very likely to not be achievable in various very
common situations. Two year ago I would have said, sure, but today,
it's just not the case anymore.

> I understand that the difference between highest guaranteed P-state and
> highest potential P-state is likely to increase in the future. Without
> any feedback about what potential P-state we can approximately get, we
> can only pack tasks until we hit the load that can be handled at the
> highest guaranteed P-state.

the only highest guaranteed P state is... the lowest P state. Sorry.
Everything else is subject to thermal management and hardware policies.


> I believe that there already is a power limit notification mechanism on
> Intel that can notify the OS when the firmware chooses a lower P-state
> than the one requested by the OS.

and we turn that off to avoid interrupt floods.....


> You (or Rafael) mentioned in our previous discussion that you are
> working on an improved intel_pstate driver. Will that be fundamentally
> different from the current one?

yes.
the hardware has been changing, and will be changing more (at a faster rate),
and we'll have very different algorithms for the different generations.

For example, for the recently launched client Haswell (think Ultrabook) the
system idle power is going down about 20 times compared to the previous generation (e.g.
what you'd buy a month ago).
With that change, the rules about when to go fast and not are changing dramatically....
since going faster means you'll go to the low power faster (even on previous generations that
effect is there, but with lower power in idle, this just gets stronger).

> I agree that packing is not a good idea for cache or memory bound tasks.
> It is not any different on dual cluster ARM setups like big.LITTLE. But,
> we do see a lot of benefit in packing small tasks which are not cache or
> memory bound, or performance critical. Keeping them on as few cpus as
> possible means that the rest can enter deeper C-states for longer.

I totally agree with the idea of *statistically* grouping short running tasks.
But... this can be done VERY simple without such explicit "how many do we need".
All you need to do is to do a statistical "sort left", e.g. if a short running tasks
wants to run (that by definition has not run for a while, so is cache cold anyway),
make it prefer the lowest number idle cpu to wake up on.
Heck, even making it just prefer only cpu 0 when it's idle will by and large already achieve
this.
Remember that you don't have to be perfect; no point trying to move tasks that never run in your
management time window; only the ones that actually want to run need management.
And at the "I want to run" time, you can just sort it left.
(and this is fine for tasks that run short; all the numa/etc logic value kicks in for tasks that do
some serious amounts of work and thus by definition run for longer stretches)

What you don't want to do, is run tasks sequentially that could have run in parallel. That's the best
way to destroy power efficiency in multicore systems ;-(

And to be honest, the effect of per logical CPU C states is much smaller on Intel than the effect
of global idle (in Intel terms, "package C states"). The break even points of CPU core states are
extremely short for us, even for the deepest states. The bigger bang for the buck is with system wide
idle, so that memory can go to self refresh (and the memory controllers/etc can be turned off).
The break even point for those kind of things is longer, and that's where wakeups/etc make a much bigger dent.



> BTW. Packing one strictly memory bound task and one strictly cpu bound
> task on one socket might work. The only problem is to determine the task
> charateristics ;-)

yeah "NUMA is hard, lets go shopping" for sure.

2013-07-12 15:37:51

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] sched: power: Add initial frequency scaling support to power scheduler

On 7/12/2013 5:51 AM, Morten Rasmussen wrote:
> On Wed, Jul 10, 2013 at 02:10:59PM +0100, Arjan van de Ven wrote:
>> On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
>>> Extends the power scheduler capacity management algorithm to handle
>>> frequency scaling and provide basic frequency/P-state selection hints
>>> to the power driver.
>>>
>>> Signed-off-by: Morten Rasmussen <[email protected]>
>>> CC: Ingo Molnar <[email protected]>
>>> CC: Peter Zijlstra <[email protected]>
>>> CC: Catalin Marinas <[email protected]>
>>> ---
>>> kernel/sched/power.c | 33 ++++++++++++++++++++++++++++-----
>>> 1 file changed, 28 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/kernel/sched/power.c b/kernel/sched/power.c
>>> index 9e44c0e..5fc32b0 100644
>>> --- a/kernel/sched/power.c
>>> +++ b/kernel/sched/power.c
>>> @@ -21,6 +21,8 @@
>>>
>>> #define INTERVAL 5 /* ms */
>>> #define CPU_FULL 90 /* Busy %-age - TODO: Make tunable */
>>> +#define CPU_TARGET 80 /* Target busy %-age - TODO: Make tunable */
>>> +#define CPU_EMPTY 5 /* Idle noise %-age - TODO: Make tunable */
>>>
>>
>> to be honest, this is the policy part that really should be in the hardware specific driver
>> and not in the scheduler.
>> (even if said driver is sort of a "generic library" kind of thing)
>
> I agree that the values should be set by a hardware specific power
> driver. Or do you mean that algorithms using this sort of values should
> be in the driver?

the later.
the algorithm you have makes assumptions about how the hardware behaves to a degree that
is really problematic. It sort of seems to resemble the ondemand kind of thing....
... and there's some very good reasons we ran away screaming from ondemand for Intel.

>

2013-07-12 15:44:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal


> But on x86 you still have a P-state hint for the CPU and the scheduler
> could at least hope for more CPU performance. We can make the power
> scheduler ask the power driver for an increase or decrease of
> performance (as Preeti suggested) and give it the current load as
> argument rather than a precise performance/frequency level. The power
> driver would change the P-state accordingly and take the load into
> account (or ignore it, something like intel_pstate.c can do its own
> aperf/mperf tracking). But the power driver will inform the scheduler
> that it can't change the P-state further and the power scheduler can
> decide to spread the load out to other CPUs.


I am completely fine with an interface that is something like

void arch_please_go_faster(int cpunr);
void arch_please_go_fastest(int cpunr);
int arch_can_you_go_faster_than_now(int cpunr);

(maybe without the arguments and only make it for the local cpu, that would
make the implementation surely simpler)

with the understanding that these are instant requests (e.g. longer term policy will
clobber requests eventually).

it makes total sense to me for the scheduler to indicate "I need performance NOW".
Either when it sees it's on the verge of needing to load balance, or when it is about to schedule
a high priority (think realtime) task.

Part of the reason I like such interface is that it is a higher level one, it's a clear and high level enough
policy request that the hardware driver can translate into a hardware specific thing.

An interface that would be "put it at THIS much" is not. It's too low level and makes assumptions about
hardware things that change between generations/vendors that the scheduler really shouldn't know about.

2013-07-13 06:50:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Tue, Jul 09, 2013 at 04:55:29PM +0100, Morten Rasmussen wrote:
> Hi,
>
> This patch set is an initial prototype aiming at the overall power-aware
> scheduler design proposal that I previously described
> <http://permalink.gmane.org/gmane.linux.kernel/1508480>.
>
> The patch set introduces a cpu capacity managing 'power scheduler' which lives
> by the side of the existing (process) scheduler. Its role is to monitor the
> system load and decide which cpus that should be available to the process
> scheduler.

Hmm...

This looks like a userspace hotplug deamon approach lifted to kernel space :/

How about instead of layering over the load-balancer to constrain its behaviour
you change the behaviour to not need constraint? Fix it so it does the right
thing, instead of limiting it.

I don't think its _that_ hard to make the balancer do packing over spreading.
The power balance code removed in 8e7fbcbc had things like that (although it
was broken). And I'm sure I've seen patches over the years that did similar
things. Didn't Vincent and Alex also do things like that?

We should take the good bits from all that and make something of it. And I
think its easier now that we have the per task and per rq utilization numbers
[1].

Just start by changing the balancer to pack instead of spread. Once that works,
see where the two modes diverge and put a knob in.

Then worry about power thingies.


[1] -- I realize that the utilization numbers are actually influenced by
cpufreq state. Fixing this is another possible first step. I think it could be
done independently of the larger picture of a power aware balancer.


You also introduce a power topology separate from the topology information we
already have. Please integrate with the existing topology information so that
its a single entity.


The integration of cpuidle and cpufreq should start by unifying all the
statistics stuff. For cpuidle we need to pull in the per-cpu idle time
guestimator. For cpufreq the per-cpu usage stuff -- which we already have in
the scheduler these days!

Once we have all the statistics in place, its also easier to see what we can do
with them and what might be missing.

At this point mandate that policy drivers may not do any statistics gathering
of their own. If they feel the need to do so, we're missing something and
that's not right.

For the actual policies we should build a small library of concepts that can be
quickly composed to form an actual policy. Such that when two chips need
similar things they do indeed use the same code and not a copy with different
bugs. If there's only a single arch user of a concept that's fine, but at least
its out in the open and ready for re-use. Not hidden away in arch code.


Then we can start doing fancy stuff like fairness when constrained by power or
thermal envelopes. We'll need input from the GPU etc. for that. And the wildly
asymmetric thing you're interested in :-)


I'm not entirely sold on differentiating between short running and other tasks
either. Although I suppose I see where that comes from. A task that would run
50% on a big core would unlikely be qualified as small, however if it would
require 85% of a small core and there's room on the small cores its a good move
to run it there.

So where's the limit for being small? It seems like an artificial limit and
such should be avoided where possible.


Arjan; from reading your emails you're mostly busy explaining what cannot be
done. Please explain what _can_ be done and what Intel wants. From what I can
see you basically promote a max P state max concurrency race to idle FTW.

Since you can't say what the max P state is; and I think I understand the
reasons for that, and the hardware might not even respect the P state you tell
it to run at, does it even make sense to talk about Intel P states? When would
you not program the max P state?

In such a case the aperf/mperf ratio [2] gives both the current freq as the max
freq, since you're effectively always going at max speed.

[2] aperf/mperf ratio with an idle filter, we should exclude idle time.

IIRC you at one point said there was a time limit below which concurrency
spread wasn't useful anymore?

Also, most what you say for single socket systems; what does Intel want for
multi-socket systems?

2013-07-13 10:23:57

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

Hi Peter,

(Morten's away for a week, I'll try cover some bits in the meantime)

On Sat, Jul 13, 2013 at 07:49:09AM +0100, Peter Zijlstra wrote:
> On Tue, Jul 09, 2013 at 04:55:29PM +0100, Morten Rasmussen wrote:
> > This patch set is an initial prototype aiming at the overall power-aware
> > scheduler design proposal that I previously described
> > <http://permalink.gmane.org/gmane.linux.kernel/1508480>.
> >
> > The patch set introduces a cpu capacity managing 'power scheduler' which lives
> > by the side of the existing (process) scheduler. Its role is to monitor the
> > system load and decide which cpus that should be available to the process
> > scheduler.
>
> Hmm...
>
> This looks like a userspace hotplug deamon approach lifted to kernel space :/

The difference is that this is faster. We even had hotplug in mind some
years ago for big.LITTLE but it wouldn't give the performance we need
(hotplug is incredibly slow even if driven from the kernel).

> How about instead of layering over the load-balancer to constrain its behaviour
> you change the behaviour to not need constraint? Fix it so it does the right
> thing, instead of limiting it.
>
> I don't think its _that_ hard to make the balancer do packing over spreading.
> The power balance code removed in 8e7fbcbc had things like that (although it
> was broken). And I'm sure I've seen patches over the years that did similar
> things. Didn't Vincent and Alex also do things like that?
>
> We should take the good bits from all that and make something of it. And I
> think its easier now that we have the per task and per rq utilization numbers
> [1].

That's what we've been pushing for. From a big.LITTLE perspective, I
would probably vote for Vincent's patches but I guess we could probably
adapt any of the other options.

But then we got Ingo NAK'ing all these approaches. Taking the best bits
from the current load balancing patches would create yet another set of
patches which don't fall under Ingo's requirements (at least as I
understand them).

> Just start by changing the balancer to pack instead of spread. Once that works,
> see where the two modes diverge and put a knob in.

That's the approach we've had so far (not sure about the knob). But it
doesn't solve Ingo's complain about fragmentation between scheduler,
cpufreq and cpuidle policies.

> Then worry about power thingies.

To quote Ingo: "To create a new low level idle driver mechanism the
scheduler could use and integrate proper power saving / idle policy into
the scheduler."

That's unless we all agree (including Ingo) that the above requirement
is orthogonal to task packing and, as a *separate* project, we look at
better integrating the cpufreq/cpuidle with the scheduler, possibly with
a new driver model and governors as libraries used by such drivers. In
which case the current packing patches shouldn't be NAK'ed but reviewed
so that they can be improved further or rewritten.

> The integration of cpuidle and cpufreq should start by unifying all the
> statistics stuff. For cpuidle we need to pull in the per-cpu idle time
> guestimator. For cpufreq the per-cpu usage stuff -- which we already have in
> the scheduler these days!
>
> Once we have all the statistics in place, its also easier to see what we can do
> with them and what might be missing.
>
> At this point mandate that policy drivers may not do any statistics gathering
> of their own. If they feel the need to do so, we're missing something and
> that's not right.

I agree in general but there is the intel_pstate.c driver which has it's
own separate statistics that the scheduler does not track. We could move
to invariant task load tracking which uses aperf/mperf (and could do
similar things with perf counters on ARM). As I understand from Arjan,
the new pstate driver will be different, so we don't know exactly what
it requires.

> I'm not entirely sold on differentiating between short running and other tasks
> either. Although I suppose I see where that comes from. A task that would run
> 50% on a big core would unlikely be qualified as small, however if it would
> require 85% of a small core and there's room on the small cores its a good move
> to run it there.
>
> So where's the limit for being small? It seems like an artificial limit and
> such should be avoided where possible.

I agree. With Morten's approach, it doesn't care about how small a task
is but rather when a CPU (or cluster) is loaded to a certain threshold,
just spread tasks to the next. I think small task threshold on its own
doesn't make much sense if you have lots of such 'small' tasks, so you
need a view of the overall load or a more dynamic threshold.

--
Catalin

2013-07-13 14:40:12

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/12/2013 11:49 PM, Peter Zijlstra wrote:
>
> Arjan; from reading your emails you're mostly busy explaining what cannot be
> done. Please explain what _can_ be done and what Intel wants. From what I can
> see you basically promote a max P state max concurrency race to idle FTW.

>
> Since you can't say what the max P state is; and I think I understand the
> reasons for that, and the hardware might not even respect the P state you tell
> it to run at, does it even make sense to talk about Intel P states? When would
> you not program the max P state?

this is where it gets complicated ;-(
the race-to-idle depends on the type of code that is running, if things are memory bound it's outright
not true, but for compute bound it often is.

What I would like to see is

1) Move the idle predictor logic into the scheduler, or at least a library
(I'm not sure the scheduler can do better than the current code, but it might,
and what menu does today is at least worth putting in some generic library)

2) An interface between scheduler and P state code in the form of (and don't take the names as actual function names ;-)
void arch_please_go_fastest(void); /* or maybe int cpunr as argument, but that's harder to implement */
int arch_can_you_go_faster(void); /* if the scheduler would like to know this instead of load balancing .. unsure */
unsigned long arch_instructions_executed(void); /* like tsc, but on instructions, so the scheduler can account actual work done */

the first one is for the scheduler to call when it sees a situation of "we care deeply about performance now" coming,
for example near overload, or when a realtime (or otherwise high priority) task gets scheduled.
the second one I am dubious about, but maybe you have a use for it; some folks think that there is value in
deciding to ramp up the performance rather than load balancing. For load balancing to an idle cpu, I don't see that
value (in terms of power efficiency) but I do see a case where your 2 cores happen to be busy (some sort of thundering
herd effect) but imbalanced; in that case going faster rather than rebalance... I can certainly see the point.

3) an interface from the C state hardware driver to the scheduler to say "oh btw, the LLC got flushed, forget about past
cache affinity". The C state driver can sometimes know this.. and linux today tries to keep affinity anyway
while we could get more optimal by being allowed to balance more freely

4) this is the most important one, but like the hardest one:
An interface from the scheduler that says "we are performance sensitive now":
void arch_sched_performance_sensitive(int duration_ms);

I've put a duration as argument, rather than a "arch_no_longer_sensitive", to avoid for the scheduler to run some
periodic timer/whatever to keep this; rather it is sort of a "lease", that the scheduler can renew as often as it
wants; but it auto-expires eventually.

with this the hardware and/or hardware drivers can make a performance bias in their decisions based on what
is actually the driving force behind both P and C state decisions: performance sensitivity.
(all this utilization stuff menu but also the P state drivers try to do is estimating how sensitive we are to
performance, and if we're not sensitive, consider sacrificing some performance for power. Even with race-to-halt,
sometimes sacrificing a little performance gives a power benefit at the top of the range)

>
> IIRC you at one point said there was a time limit below which concurrency
> spread wasn't useful anymore?

there is a time below which waking up a core (not hyperthread pair, that is ALWAYS worth it since it's insanely cheap)
is not worth it.
Think in the order of "+/- 50 microseconds".


> Also, most what you say for single socket systems; what does Intel want for
> multi-socket systems?

for multisocket, rule number one is "don't screw up numa".
for tasks where numa matters, that's the top priority.
beyond that, experiments seem to show that grouping "a little" helps.
Say on a 2x 4 core system, it's worth running the first 2 tasks on the same package
but after that we need to start considering the 2nd package.
I have to say that we don't have quite enough data yet to figure out where this cutoff is;
most of the microbenchmarks in this have been done with fspin, which by design has zero cache
footprint or memory use... and the whole damage side of grouping (and thus the reason for spreading)
is in sharing of the caches and memory bandwidth.
(if you end up thrashing the cache, the power you burn by losing the efficiency there is not easy to win back
by placement)


2013-07-13 16:14:24

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/12/2013 11:49 PM, Peter Zijlstra wrote:
> On Tue, Jul 09, 2013 at 04:55:29PM +0100, Morten Rasmussen wrote:
>> Hi,
>>
>> This patch set is an initial prototype aiming at the overall power-aware
>> scheduler design proposal that I previously described
>> <http://permalink.gmane.org/gmane.linux.kernel/1508480>.
>>
>> The patch set introduces a cpu capacity managing 'power scheduler' which lives
>> by the side of the existing (process) scheduler. Its role is to monitor the
>> system load and decide which cpus that should be available to the process
>> scheduler.
>
> Hmm...
>
> This looks like a userspace hotplug deamon approach lifted to kernel space :/
>
> How about instead of layering over the load-balancer to constrain its behaviour
> you change the behaviour to not need constraint? Fix it so it does the right
> thing, instead of limiting it.
>
> I don't think its _that_ hard to make the balancer do packing over spreading.
> The power balance code removed in 8e7fbcbc had things like that (although it
> was broken). And I'm sure I've seen patches over the years that did similar
> things. Didn't Vincent and Alex also do things like that?

a basic "sort left" (e.g. when needing to pick a cpu for a task that is short running,
pick the lowest numbered idle one) will already have the effect of packing in practice.
it's not perfect packing, but on a statistical level it'll be quite good.

(this all assumes relatively idle systems with spare capacity to play with of course..
... but that's the domain where packing plays a role)



> Arjan; from reading your emails you're mostly busy explaining what cannot be
> done. Please explain what _can_ be done and what Intel wants. From what I can
> see you basically promote a max P state max concurrency race to idle FTW.
>

btw one more thing I'd like to get is a communication between the scheduler
and the policy/hardware drivers about task migration.
When a task migrates to another CPU, the statistics that the hardware/driver/policy
were keeping on that target CPU are really not valid anymore in terms of forward
looking predictive power. A communication (API or notification or whatever form it takes)
around this would be quite helpful.
This could be as simple as just setting a flag on the target cpu (in their rq), so that
the next power event (exiting idle, P state evaluation, whatever) the policy code
can flush-and-start-over.


on thinking more about the short running task thing; there is an optimization we currently don't do,
mostly for hyperthreading. (and HT is just one out of a set of cases with similar power behavior)
If we know a task runs briefly AND is not performance critical, it's much much better to place it on
a hyperthreading buddy of an already busy core than it is to place it on an empty core (or to delay it).
Yes a HT pair isn't the same performance as a full core, but in terms of power the 2nd half of a HT pair
is nearly free... so if there's a task that's not performance sensitive (and won't disturb the other task too much,
e.g. runs briefly enough)... it's better to pack onto a core than to spread.
you can generalize this to a class of systems where adding work to a core (read: group of cpus that share resources)
is significantly cheaper than running on a full empty core.

(there is clearly a tradeoff, by sharing resources you also end up reducing performance/efficiency, and that has its
own effect on power, so there is some kind of balance needed and a big enough gain to be worth the loss)

2013-07-15 02:06:40

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 07/14/2013 12:14 AM, Arjan van de Ven wrote:
>
>
> on thinking more about the short running task thing; there is an
> optimization we currently don't do,
> mostly for hyperthreading. (and HT is just one out of a set of cases
> with similar power behavior)
> If we know a task runs briefly AND is not performance critical, it's
> much much better to place it on
> a hyperthreading buddy of an already busy core than it is to place it on
> an empty core (or to delay it).
> Yes a HT pair isn't the same performance as a full core, but in terms of
> power the 2nd half of a HT pair
> is nearly free... so if there's a task that's not performance sensitive
> (and won't disturb the other task too much,
> e.g. runs briefly enough)... it's better to pack onto a core than to
> spread.
> you can generalize this to a class of systems where adding work to a
> core (read: group of cpus that share resources)
> is significantly cheaper than running on a full empty core.

Right!
That is one of purpose that my old power sheduling's wanna do:
https://lkml.org/lkml/2013/4/3/747
Vincent's patchset also target at this.

--
Thanks
Alex

2013-07-15 03:46:20

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

Hi Morten,

On 07/12/2013 07:18 PM, Morten Rasmussen wrote:
> On Thu, Jul 11, 2013 at 12:34:49PM +0100, Preeti U Murthy wrote:
>> Hi Morten,
>>
>> I have a few quick comments.
>>
>> I am concerned too about scheduler making its load balancing decisions
>> based on the cpu frequency for the reason that it could create an
>> imbalance in the load across cpus.
>>
>> Scheduler could keep loading a cpu, because its cpu frequency goes on
>> increasing, and it could keep un-loading a cpu because its cpu frequency
>> goes on decreasing. This increase and decrease as an effect of the load
>> itself. This is of course assuming that the driver would make its
>> decisions proportional to the cpu load. There could be many more
>> complications, if the driver makes its decisions on factors unknown to
>> the scheduler.
>>
>> Therefore my suggestion is that we should simply have the scheduler
>> asking for increase/decrease in the frequency and letting it at that.
>
> If I understand correctly your concern is about the effect of frequency
> scaling on load-balancing when using tracked load (PJT's) for the task
> loads as it is done in Alex Shi's patches.
>
> That problem is present even with the existing cpufreq governors and has
> not been addressed yet. Tasks on cpus at low frequencies appear bigger
> since they run longer, which will cause the load-balancer to think the
> cpu loaded and move tasks to other cpus. That will cause cpufreq to
> lower the frequency of that cpu and make any remaining tasks look even
> bigger. The story repeats itself.
>
> One might be tempted to suggest to use arch_scale_freq_power to tell the
> load-balancer about frequency scaling. But in its current form it will
> actually make it worse, as cpu_power is currently used to indicate max
> compute capacity and not the current one.
>
> I don't understand how a simple up/down request from the scheduler would
> solve that problem. It would just make frequency scaling slower if you
> only go up or down one step at the time. Much like the existing
> conservative cpufreq governor that nobody uses. Maybe I am missing
> something?
>
> I think we should look into scaling the tracked load by some metric that
> represents the current performance of the cpu whenever the tracked load
> is updated as it was suggested by Arjan in our previous discussion. I
> included it in my power scheduler design proposal, but I haven't done
> anything about it yet.
>
> In short, I agree that there is a problem around load-balancing and
> frequency scaling that needs to be fixed. Without Alex's patches the
> problem is not present as task load doesn't depend on the cpu load of the
> task.

My concern is something like this:

Scheduler sees a cpu loaded, asks the driver for an increase in its
frequency. Let us assume now that the driver agrees to increase the
frequency. Next time the scheduler checks this cpu, it has higher
capacity due to the increase in the frequency. It loads it more. Now the
load is high again, an increase in cpu frequency is asked. This cycle if
it repeats will see a few cpus heavily loaded with the maximum frequency
that it could possibly run at, while the rest are not at all. Will this
patch result in such a see-saw situation? This is something I am unable
to make out.

Currently the scheduler sees all cpus alike at a core level. So the bias
towards some cpu is based only on the load. But in this patch, the bias
in scheduling can be based on cpu frequency as well. What kind of an
impact can this have on load balancing? This is my primary concern.
Probably you will be able to see this in your testing. But just bringing
out this point.

>
>> Secondly, I think we should spend more time on when to make a call to
>> the frequency driver in your patchset regarding the change in the
>> frequency of the CPU, the scheduler wishes to request. The reason being,
>> the whole effort of integrating the knowledge of cpu frequency
>> statistics into the scheduler is being done because the scheduler can
>> call the frequency driver at times *complimenting* load balancing,
>> unlike now.
>
> I don't think I get your point here. The current policy in this patch
> set is just a prototype that should be improved. The power scheduler
> does complement the load-balancer already by asking for frequency
> changes as the cpu load changes.

Scenario : Lets say the scheduler at some point finds that load
balancing cannot be done for performance at some point in time. At this
time, it would be good to have the frequencies of the cpus boosted.

In the existing implementation, the cpu frequency governor gets called
after certain intervals of time, asynchronous with the load balancing.
In the above scenario the frequency governor would probably not come to
the rescue in time to ask for a boost in the frequency of the cpus. Your
patch has the potential to solve this. We are now considering calling
calculate_cpu_capacities() in the scheduler tick. Will this solve the
above mentioned scenario? Or is the above scenario hypothetical?
I am just thinking out loud.

Regards
Preeti U Murthy

2013-07-15 07:53:42

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 13 July 2013 12:23, Catalin Marinas <[email protected]> wrote:
> Hi Peter,
>
> (Morten's away for a week, I'll try cover some bits in the meantime)
>
> On Sat, Jul 13, 2013 at 07:49:09AM +0100, Peter Zijlstra wrote:
>> On Tue, Jul 09, 2013 at 04:55:29PM +0100, Morten Rasmussen wrote:
>> > This patch set is an initial prototype aiming at the overall power-aware
>> > scheduler design proposal that I previously described
>> > <http://permalink.gmane.org/gmane.linux.kernel/1508480>.
>> >
>> > The patch set introduces a cpu capacity managing 'power scheduler' which lives
>> > by the side of the existing (process) scheduler. Its role is to monitor the
>> > system load and decide which cpus that should be available to the process
>> > scheduler.
>>
>> Hmm...
>>
>> This looks like a userspace hotplug deamon approach lifted to kernel space :/
>
> The difference is that this is faster. We even had hotplug in mind some
> years ago for big.LITTLE but it wouldn't give the performance we need
> (hotplug is incredibly slow even if driven from the kernel).
>
>> How about instead of layering over the load-balancer to constrain its behaviour
>> you change the behaviour to not need constraint? Fix it so it does the right
>> thing, instead of limiting it.
>>
>> I don't think its _that_ hard to make the balancer do packing over spreading.
>> The power balance code removed in 8e7fbcbc had things like that (although it
>> was broken). And I'm sure I've seen patches over the years that did similar
>> things. Didn't Vincent and Alex also do things like that?
>>
>> We should take the good bits from all that and make something of it. And I
>> think its easier now that we have the per task and per rq utilization numbers
>> [1].
>
> That's what we've been pushing for. From a big.LITTLE perspective, I
> would probably vote for Vincent's patches but I guess we could probably
> adapt any of the other options.
>
> But then we got Ingo NAK'ing all these approaches. Taking the best bits
> from the current load balancing patches would create yet another set of
> patches which don't fall under Ingo's requirements (at least as I
> understand them).

In fact we are currently updating our patchset based on Ingo's
feedback. The move of cpuidle and cpufreq statistic was planned to
appear later in our dev but we are now integrating it based in Ingo's
request. We start with cpuidle statistics and are moving it into the
scheduler. In addition, we want to integrate the current C-state of a
core in the wake up decision.

>
>> Just start by changing the balancer to pack instead of spread. Once that works,
>> see where the two modes diverge and put a knob in.
>
> That's the approach we've had so far (not sure about the knob). But it
> doesn't solve Ingo's complain about fragmentation between scheduler,
> cpufreq and cpuidle policies.
>
>> Then worry about power thingies.
>
> To quote Ingo: "To create a new low level idle driver mechanism the
> scheduler could use and integrate proper power saving / idle policy into
> the scheduler."
>
> That's unless we all agree (including Ingo) that the above requirement
> is orthogonal to task packing and, as a *separate* project, we look at
> better integrating the cpufreq/cpuidle with the scheduler, possibly with
> a new driver model and governors as libraries used by such drivers. In
> which case the current packing patches shouldn't be NAK'ed but reviewed
> so that they can be improved further or rewritten.
>
>> The integration of cpuidle and cpufreq should start by unifying all the
>> statistics stuff. For cpuidle we need to pull in the per-cpu idle time
>> guestimator. For cpufreq the per-cpu usage stuff -- which we already have in
>> the scheduler these days!
>>
>> Once we have all the statistics in place, its also easier to see what we can do
>> with them and what might be missing.
>>
>> At this point mandate that policy drivers may not do any statistics gathering
>> of their own. If they feel the need to do so, we're missing something and
>> that's not right.
>
> I agree in general but there is the intel_pstate.c driver which has it's
> own separate statistics that the scheduler does not track. We could move
> to invariant task load tracking which uses aperf/mperf (and could do
> similar things with perf counters on ARM). As I understand from Arjan,
> the new pstate driver will be different, so we don't know exactly what
> it requires.
>
>> I'm not entirely sold on differentiating between short running and other tasks
>> either. Although I suppose I see where that comes from. A task that would run
>> 50% on a big core would unlikely be qualified as small, however if it would
>> require 85% of a small core and there's room on the small cores its a good move
>> to run it there.
>>
>> So where's the limit for being small? It seems like an artificial limit and
>> such should be avoided where possible.
>
> I agree. With Morten's approach, it doesn't care about how small a task
> is but rather when a CPU (or cluster) is loaded to a certain threshold,
> just spread tasks to the next. I think small task threshold on its own
> doesn't make much sense if you have lots of such 'small' tasks, so you
> need a view of the overall load or a more dynamic threshold.
>
> --
> Catalin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-07-15 09:55:13

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

Hi Preeti,

On Mon, Jul 15, 2013 at 04:43:47AM +0100, Preeti U Murthy wrote:
> On 07/12/2013 07:18 PM, Morten Rasmussen wrote:
> > On Thu, Jul 11, 2013 at 12:34:49PM +0100, Preeti U Murthy wrote:
> >> I have a few quick comments.
> >>
> >> I am concerned too about scheduler making its load balancing decisions
> >> based on the cpu frequency for the reason that it could create an
> >> imbalance in the load across cpus.
> >>
> >> Scheduler could keep loading a cpu, because its cpu frequency goes on
> >> increasing, and it could keep un-loading a cpu because its cpu frequency
> >> goes on decreasing. This increase and decrease as an effect of the load
> >> itself. This is of course assuming that the driver would make its
> >> decisions proportional to the cpu load. There could be many more
> >> complications, if the driver makes its decisions on factors unknown to
> >> the scheduler.
> >>
> >> Therefore my suggestion is that we should simply have the scheduler
> >> asking for increase/decrease in the frequency and letting it at that.
> >
> > If I understand correctly your concern is about the effect of frequency
> > scaling on load-balancing when using tracked load (PJT's) for the task
> > loads as it is done in Alex Shi's patches.
> >
> > That problem is present even with the existing cpufreq governors and has
> > not been addressed yet. Tasks on cpus at low frequencies appear bigger
> > since they run longer, which will cause the load-balancer to think the
> > cpu loaded and move tasks to other cpus. That will cause cpufreq to
> > lower the frequency of that cpu and make any remaining tasks look even
> > bigger. The story repeats itself.
> >
> > One might be tempted to suggest to use arch_scale_freq_power to tell the
> > load-balancer about frequency scaling. But in its current form it will
> > actually make it worse, as cpu_power is currently used to indicate max
> > compute capacity and not the current one.
> >
> > I don't understand how a simple up/down request from the scheduler would
> > solve that problem. It would just make frequency scaling slower if you
> > only go up or down one step at the time. Much like the existing
> > conservative cpufreq governor that nobody uses. Maybe I am missing
> > something?
> >
> > I think we should look into scaling the tracked load by some metric that
> > represents the current performance of the cpu whenever the tracked load
> > is updated as it was suggested by Arjan in our previous discussion. I
> > included it in my power scheduler design proposal, but I haven't done
> > anything about it yet.
> >
> > In short, I agree that there is a problem around load-balancing and
> > frequency scaling that needs to be fixed. Without Alex's patches the
> > problem is not present as task load doesn't depend on the cpu load of the
> > task.
>
> My concern is something like this:
>
> Scheduler sees a cpu loaded, asks the driver for an increase in its
> frequency. Let us assume now that the driver agrees to increase the
> frequency. Next time the scheduler checks this cpu, it has higher
> capacity due to the increase in the frequency. It loads it more. Now the
> load is high again, an increase in cpu frequency is asked. This cycle if
> it repeats will see a few cpus heavily loaded with the maximum frequency
> that it could possibly run at, while the rest are not at all. Will this
> patch result in such a see-saw situation? This is something I am unable
> to make out.

I don't think Morten's patches change the current behaviour when
cpu_power is set to maximum for all the CPUs. In this first prototype it
actually makes this behaviour explicit by setting cpu_power to max for
the first core and 1 for the rest and gradually allowing next cores to
be used if the previous are loaded. But that's because it doesn't yet
consider the topology. With this in place and feedback from the low
level driver, it could simply tell the load balancer to use the entire
socket as all the cores have the same frequency or that it doesn't make
sense from a power perspective to only use a core within a socket.

> Currently the scheduler sees all cpus alike at a core level. So the bias
> towards some cpu is based only on the load. But in this patch, the bias
> in scheduling can be based on cpu frequency as well. What kind of an
> impact can this have on load balancing? This is my primary concern.
> Probably you will be able to see this in your testing. But just bringing
> out this point.

I don't think we could overload cpu_power any further. It's used mainly
for CPU capacity (Morten's patch sets it to either 1 or 1024). I think
the way around is to make the load tracking frequency-invariant,
possibly using things like aperf/mperf or other counters. It's not
perfect either but probably better than time-based load-tracking for
this scenario.

> >> Secondly, I think we should spend more time on when to make a call to
> >> the frequency driver in your patchset regarding the change in the
> >> frequency of the CPU, the scheduler wishes to request. The reason being,
> >> the whole effort of integrating the knowledge of cpu frequency
> >> statistics into the scheduler is being done because the scheduler can
> >> call the frequency driver at times *complimenting* load balancing,
> >> unlike now.
> >
> > I don't think I get your point here. The current policy in this patch
> > set is just a prototype that should be improved. The power scheduler
> > does complement the load-balancer already by asking for frequency
> > changes as the cpu load changes.
>
> Scenario : Lets say the scheduler at some point finds that load
> balancing cannot be done for performance at some point in time. At this
> time, it would be good to have the frequencies of the cpus boosted.
>
> In the existing implementation, the cpu frequency governor gets called
> after certain intervals of time, asynchronous with the load balancing.
> In the above scenario the frequency governor would probably not come to
> the rescue in time to ask for a boost in the frequency of the cpus. Your
> patch has the potential to solve this. We are now considering calling
> calculate_cpu_capacities() in the scheduler tick. Will this solve the
> above mentioned scenario? Or is the above scenario hypothetical?

Morten's patches calculate the CPU capacities periodically but it will
be tight to the scheduler tick in a new version. The power scheduler has
tighter integration with the task scheduler, so it gets statistics like
load, number of tasks. It can easily detect whether the load can be
spread to other CPUs or it just needs to boost the current frequency.

In terms of how it boosts the performance, a suggestion was to keep the
power scheduler relatively simple with an API to a new model of power
driver and have the actual scaling algorithm (governor) as library used
by the low-level driver. We can keep the API simple like
get_max_performance() etc. but the driver has the potential to choose
what is best suited for the hardware.

--
Catalin

2013-07-15 15:25:34

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/15/2013 2:55 AM, Catalin Marinas wrote:
> In terms of how it boosts the performance, a suggestion was to keep the
> power scheduler relatively simple with an API to a new model of power
> driver and have the actual scaling algorithm (governor) as library used
> by the low-level driver. We can keep the API simple like
> get_max_performance() etc. but the driver has the potential to choose
> what is best suited for the hardware.

I like simple ;-)
I like descriptive and intent-driven as well (rather than prescriptive) for high level concepts.
and I like libraries of functionality you can pull from.

one thing we're skirting around in this whole discussion is the concept of performance sensitivity.
or to phrase it in the form of a question "Is more performance desired to have right now?"
Some of these answers certainly can come from the scheduler, at certain specific cases
it will know that the answer is "yes" to that question. An oversubscribed runqueue
is certainly such a case. Scheduling a realtime/highpriority/whatever task.. the scheduler
knows more than anyone else about that.
There are other cases elsewhere in the kernel (the graphics driver may have ideas if it just missed a frame
for example).
Very high interrupt rates are another clear case of such sensitivity.

(and I'm quite fine presuming a "no unless" policy for the question)

what is hard for the scheduler is that by the time the scheduler realizes it's in a hole,
it may already be too late. Yes P states change relatively quickly... and it is certainly
worth saying "I'm in the hole, go faster!".
But seeing the impact of the "go faster" on the RQ will take time, e.g. only some time later
(say 10 to 100 msec) is the scheduler able to evaluate if the change helped enough.
It's tempting to just wait.. but maybe the right answer is to do two things: Load balance right now,
AND boost the P state of the cpus that run the load after the balance. And then 10 to 100 msec later,
evaluate if they can be balanced/consolidated back.
E.g. jump out of the whole instantly, and then look later if the hole is filled enough to jump back into later ;-)

2013-07-15 20:00:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Sat, Jul 13, 2013 at 07:40:08AM -0700, Arjan van de Ven wrote:
> On 7/12/2013 11:49 PM, Peter Zijlstra wrote:
> >
> >Arjan; from reading your emails you're mostly busy explaining what cannot be
> >done. Please explain what _can_ be done and what Intel wants. From what I can
> >see you basically promote a max P state max concurrency race to idle FTW.
>
> >
> >Since you can't say what the max P state is; and I think I understand the
> >reasons for that, and the hardware might not even respect the P state you tell
> >it to run at, does it even make sense to talk about Intel P states? When would
> >you not program the max P state?
>
> this is where it gets complicated ;-( the race-to-idle depends on the type of
> code that is running, if things are memory bound it's outright not true, but
> for compute bound it often is.

So you didn't actually answer the question about when you'd program a less than
max P state. Your recommended interface also glaringly lacks the
arch_please_go_slower_noaw() function.

What's the point of having a 'go faster' button if you can't also go slower?

So you can program any P state; but the hardware is free do as it pleases but
not slower than the lowest P state. So clearly the hardware is 'smart'.

Going by your interface there's also not much influence as to where the 'power'
goes; can we for example force the GPU to clock lower in order to 'free' up
power for cores?

If we can, we should very much include that in the entire discussion.


> What I would like to see is
>
> 1) Move the idle predictor logic into the scheduler, or at least a library
> (I'm not sure the scheduler can do better than the current code, but it might,
> and what menu does today is at least worth putting in some generic library)

Right, so the idea is that these days we have much better task runtime
behaviour tracking than we used to have and this might help. I also realize the
idle guestimator uses more than just task activity, interrupt activity is also
very important.

This also makes it not a pure scheduling thing so I wouldn't be too bothered if
it lived in kernel/cpu/idle.c instead of in the scheduler proper.

Not sure calling it a generic library would be wise; that has such an optional
sound to it. The thing we want to avoid is people brewing their own etc..

Also, my interest in it is that the scheduler wants to use it; and when we go
do power aware scheduling I feel it should live very near the scheduler if not
in the scheduler for the simple reason that part of being power aware is trying
to stay idle as long as possible; the idle guestimator is the measure of that.

So in that sense they are closely related.

> 2) An interface between scheduler and P state code in the form of (and don't take the names as actual function names ;-)
> void arch_please_go_fastest(void); /* or maybe int cpunr as argument, but that's harder to implement */

Here again, the only thing this allows is max P state race for idle. Why would
Intel still pretend to have P states if they're so useless and mean so little?

> int arch_can_you_go_faster(void); /* if the scheduler would like to know this instead of load balancing .. unsure */

You said Intel could not say if it were at the max P state; so how could it
possibly answer this one?

> unsigned long arch_instructions_executed(void); /* like tsc, but on instructions, so the scheduler can account actual work done */

To what purpose? People mostly still care about wall-time for things like
response and such. Also, its not something most arch will be able to provide
without sacrificing a PMU counter if they even have such a thing. Also not
everybody is as 'fast' in reading PMU state as one would like.

>
> the first one is for the scheduler to call when it sees a situation of "we
> care deeply about performance now" coming, for example near overload, or
> when a realtime (or otherwise high priority) task gets scheduled. the
> second one I am dubious about, but maybe you have a use for it; some folks
> think that there is value in deciding to ramp up the performance rather
> than load balancing. For load balancing to an idle cpu, I don't see that
> value (in terms of power efficiency) but I do see a case where your 2 cores
> happen to be busy (some sort of thundering herd effect) but imbalanced; in
> that case going faster rather than rebalance... I can certainly see the
> point.

(reformatted to 80 col text)

The entire scheme seems to disregards everybody who doesn't have a 'smart'
micro controller doing the P state management. Some people will have to
actually control the cpufreq.


> 3) an interface from the C state hardware driver to the scheduler to say "oh
> btw, the LLC got flushed, forget about past cache affinity". The C state
> driver can sometimes know this.. and linux today tries to keep affinity
> anyway while we could get more optimal by being allowed to balance more
> freely

This shouldn't be hard to implement at all.

> 4) this is the most important one, but like the hardest one: An interface
> from the scheduler that says "we are performance sensitive now": void
> arch_sched_performance_sensitive(int duration_ms);
>
> I've put a duration as argument, rather than a "arch_no_longer_sensitive",
> to avoid for the scheduler to run some periodic timer/whatever to keep
> this; rather it is sort of a "lease", that the scheduler can renew as
> often as it wants; but it auto-expires eventually.
>
> with this the hardware and/or hardware drivers can make a performance bias
> in their decisions based on what is actually the driving force behind both
> P and C state decisions: performance sensitivity. (all this utilization
> stuff menu but also the P state drivers try to do is estimating how
> sensitive we are to performance, and if we're not sensitive, consider
> sacrificing some performance for power. Even with race-to-halt, sometimes
> sacrificing a little performance gives a power benefit at the top of the
> range)

Right, trouble is of course we have nothing to base this on. Our task model
completely lacks any clue for this. And the problem with introducing something
like that would also be that I suspect that within a few years every single
task on the system would find itself 'important'.

> >IIRC you at one point said there was a time limit below which concurrency
> >spread wasn't useful anymore?
>
> there is a time below which waking up a core (not hyperthread pair, that is
> ALWAYS worth it since it's insanely cheap) is not worth it. Think in the
> order of "+/- 50 microseconds".

OK.

> >Also, most what you say for single socket systems; what does Intel want for
> >multi-socket systems?
>
> for multisocket, rule number one is "don't screw up numa".
> for tasks where numa matters, that's the top priority.

OK, so again, make sure to get the work done as quickly as possible and go idle
again.

2013-07-15 20:37:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/15/2013 12:59 PM, Peter Zijlstra wrote:

>> this is where it gets complicated ;-( the race-to-idle depends on the type of
>> code that is running, if things are memory bound it's outright not true, but
>> for compute bound it often is.
>
> So you didn't actually answer the question about when you'd program a less than
> max P state. Your recommended interface also glaringly lacks the
> arch_please_go_slower_noaw() function.

an arch_you_may_go_slower_now() might make sense, sure.
(I am not aware of anything DEMANDING to go slower, unlike the go faster side of things)
I can see that be useful when you stop running that realtime task
or similar conditions.

Alternative would be to make the "go faster" side be a lease kind of thing
that you can later cancel.

> So you can program any P state; but the hardware is free do as it pleases but
> not slower than the lowest P state. So clearly the hardware is 'smart'.

any device on the market has some level of smarts there, just by virtue of dual core
and on board graphics. Even the ARM world has various smarts there (and will get more
no doubt over time)

> Going by your interface there's also not much influence as to where the 'power'
> goes; can we for example force the GPU to clock lower in order to 'free' up
> power for cores?

I would love that to be the case. And the GPU driver certainly has some knobs/influence
there. That being separate from CPU PM is one of the huge holes we have today
(much more so than the whole scheduler-vs-power thing)


> If we can, we should very much include that in the entire discussion.

absolute. Note that it's not an easy topic, as in... very much unsolved
anywhere and everywhere, and not for lack of trying.

>> What I would like to see is
>>
>> 1) Move the idle predictor logic into the scheduler, or at least a library
>> (I'm not sure the scheduler can do better than the current code, but it might,
>> and what menu does today is at least worth putting in some generic library)
>
> Right, so the idea is that these days we have much better task runtime
> behaviour tracking than we used to have and this might help. I also realize the
> idle guestimator uses more than just task activity, interrupt activity is also
> very important.

when I wrote that part of the menu governor, it was ALL about interrupts.
the task side is well known, at least in the short term, since we know
that that will come via a timer.
(I'm counting IPI's as interrupts here)

Now, the other half of this is the "how performance sensitive are we", and I sure
hope the scheduler has a better idea than the menu governor....


> Not sure calling it a generic library would be wise; that has such an optional
> sound to it. The thing we want to avoid is people brewing their own etc..

well, if it works well, people will use it.
if it sucks horribly, people won't and make something else...
... after which we turn that into the library function.
If the concepts and interfaces are at the right level, that can be done.

Especially for things like "when do we expect the next event to pull us out of idle",
that's a very generic concept that is not hardware dependent....


> Also, my interest in it is that the scheduler wants to use it; and when we go
> do power aware scheduling I feel it should live very near the scheduler if not
> in the scheduler for the simple reason that part of being power aware is trying
> to stay idle as long as possible; the idle guestimator is the measure of that.
>
> So in that sense they are closely related.

yeah as I said, I can see the point of turning this more generic.
I can even see the block layer or the GPU layer give input as well.

>
>> 2) An interface between scheduler and P state code in the form of (and don't take the names as actual function names ;-)
>> void arch_please_go_fastest(void); /* or maybe int cpunr as argument, but that's harder to implement */
>
> Here again, the only thing this allows is max P state race for idle. Why would
> Intel still pretend to have P states if they're so useless and mean so little?

race-to-idle is not universal, it depends on what type of instructions are being executed
(memory versus compute) and slightly on the physics.
>
>> int arch_can_you_go_faster(void); /* if the scheduler would like to know this instead of load balancing .. unsure */
>
> You said Intel could not say if it were at the max P state; so how could it
> possibly answer this one?

we do know if we asked for max... since it was us asking.


>
>> unsigned long arch_instructions_executed(void); /* like tsc, but on instructions, so the scheduler can account actual work done */
>
> To what purpose? People mostly still care about wall-time for things like
> response and such. Also, its not something most arch will be able to provide
> without sacrificing a PMU counter if they even have such a thing. Also not
> everybody is as 'fast' in reading PMU state as one would like.

well, right now for various scheduler priorities we use "time" as a metric for
timeslicing/etc without regard for the cpu performance at the time.
There likely is room for a different measure for "system capacity used"
that is a bit more finegrained than just time. Time is not bad,
and if there's no cheap special HW, it'll do... but I can see value for
doing something more advanced. Surely the big.little guys want this
(more than I'd want it)



>>
>> the first one is for the scheduler to call when it sees a situation of "we
>> care deeply about performance now" coming, for example near overload, or
>> when a realtime (or otherwise high priority) task gets scheduled. the
>> second one I am dubious about, but maybe you have a use for it; some folks
>> think that there is value in deciding to ramp up the performance rather
>> than load balancing. For load balancing to an idle cpu, I don't see that
>> value (in terms of power efficiency) but I do see a case where your 2 cores
>> happen to be busy (some sort of thundering herd effect) but imbalanced; in
>> that case going faster rather than rebalance... I can certainly see the
>> point.
>
> (reformatted to 80 col text)
>
> The entire scheme seems to disregards everybody who doesn't have a 'smart'
> micro controller doing the P state management. Some people will have to
> actually control the cpufreq.

that is ok, but the whole point is to make that control part of the hardware
specific driver side. The interface from the scheduler should be generic
enough that you can plug in various hardware specific parts on the other side.
Most certainly different CPU chips will use different algorithms over time.
(and of course there will be a library of such algorithms so that not every
cpu vendor/implementation has to reinvent the wheel from scratch).

heck, Linus waaay back insisted on this for cpufreq, since the Transmeta
cpus at the time did most of this purely in "hardware".


>> 3) an interface from the C state hardware driver to the scheduler to say "oh
>> btw, the LLC got flushed, forget about past cache affinity". The C state
>> driver can sometimes know this.. and linux today tries to keep affinity
>> anyway while we could get more optimal by being allowed to balance more
>> freely
>
> This shouldn't be hard to implement at all.

great!
Do you think it's worth having on the scheduler side? E.g. does it give you
more freedom in placement?
It's not completely free to get (think "an MSR read") and
there's the interesting question if this would be a per cpu
or a global statement... but we can get this

And at least for client systems (read: relatively low core counts) the cache
will get flushed quite a lot on Intel.
(and then refilled quickly of course)

>> 4) this is the most important one, but like the hardest one: An interface
>> from the scheduler that says "we are performance sensitive now": void
>> arch_sched_performance_sensitive(int duration_ms);
>>
>> I've put a duration as argument, rather than a "arch_no_longer_sensitive",
>> to avoid for the scheduler to run some periodic timer/whatever to keep
>> this; rather it is sort of a "lease", that the scheduler can renew as
>> often as it wants; but it auto-expires eventually.
>>
>> with this the hardware and/or hardware drivers can make a performance bias
>> in their decisions based on what is actually the driving force behind both
>> P and C state decisions: performance sensitivity. (all this utilization
>> stuff menu but also the P state drivers try to do is estimating how
>> sensitive we are to performance, and if we're not sensitive, consider
>> sacrificing some performance for power. Even with race-to-halt, sometimes
>> sacrificing a little performance gives a power benefit at the top of the
>> range)
>
> Right, trouble is of course we have nothing to base this on. Our task model
> completely lacks any clue for this. And the problem with introducing something
> like that would also be that I suspect that within a few years every single
> task on the system would find itself 'important'.

there are some clear cases we can do.
but yes it's hard.
BUT we try to do the same thing today implicitly. Basically "using cpu time" is
used as proxy for performance sensitive in the "ondemand" governor.


>> there is a time below which waking up a core (not hyperthread pair, that is
>> ALWAYS worth it since it's insanely cheap) is not worth it. Think in the
>> order of "+/- 50 microseconds".
>
> OK.
>
>>> Also, most what you say for single socket systems; what does Intel want for
>>> multi-socket systems?
>>
>> for multisocket, rule number one is "don't screw up numa".
>> for tasks where numa matters, that's the top priority.
>
> OK, so again, make sure to get the work done as quickly as possible and go idle
> again.

it's more about "don't run inefficient".
If you run, say, 10% less efficient than you could, any power saving feature will
first need to make up those 10% before it starts winning.

A simple example would be bubble sort versus quicksort for a sizable data set.
If some theoretical CPU could run bubble sort instructions faster than quicksort instructions,
it's still a bad idea due to the general inefficiency of bubble sort.

Doing NUMA badly is not quite THAT bad, but still, it causes quite big inefficiencies
for tasks where NUMA matters... and winning that back in power tricks is going to be hard.


2013-07-15 20:40:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Sat, Jul 13, 2013 at 11:23:51AM +0100, Catalin Marinas wrote:
> > This looks like a userspace hotplug deamon approach lifted to kernel space :/
>
> The difference is that this is faster. We even had hotplug in mind some
> years ago for big.LITTLE but it wouldn't give the performance we need
> (hotplug is incredibly slow even if driven from the kernel).

faster, slower, still horrid :-)

> That's what we've been pushing for. From a big.LITTLE perspective, I
> would probably vote for Vincent's patches but I guess we could probably
> adapt any of the other options.
>
> But then we got Ingo NAK'ing all these approaches. Taking the best bits
> from the current load balancing patches would create yet another set of
> patches which don't fall under Ingo's requirements (at least as I
> understand them).

Right, so Ingo is currently away as well -- should be back 'today' or tomorrow.
But I suspect he mostly fell over the presentation.

I've never known Ingo to object to doing incremental development; in fact he
often suggests doing so.

So don't present the packing thing as a power aware scheduler; that
presentation suggests its the complete deal. Give instead a complete
description of the problem; and tell how the current patch set fits into that
and which aspect it solves; and that further patches will follow to sort the
other issues.

That keeps the entire thing much clearer.

> > Then worry about power thingies.
>
> To quote Ingo: "To create a new low level idle driver mechanism the
> scheduler could use and integrate proper power saving / idle policy into
> the scheduler."
>
> That's unless we all agree (including Ingo) that the above requirement
> is orthogonal to task packing and, as a *separate* project, we look at
> better integrating the cpufreq/cpuidle with the scheduler, possibly with
> a new driver model and governors as libraries used by such drivers. In
> which case the current packing patches shouldn't be NAK'ed but reviewed
> so that they can be improved further or rewritten.

Right, so first thing would be to list all the thing that need doing:

- integrate idle guestimator
- intergrate cpufreq stats
- fix per entity runtime vs cpufreq
- intrgrate/redo cpufreq
- add packing features
- {all the stuff I forgot}

Then see what is orthogonal and what is most important and get people to agree
to an order. Then go..

> I agree in general but there is the intel_pstate.c driver which has it's
> own separate statistics that the scheduler does not track.

Right, question is how much of that will survive Arjan next-gen effort.

> We could move
> to invariant task load tracking which uses aperf/mperf (and could do
> similar things with perf counters on ARM). As I understand from Arjan,
> the new pstate driver will be different, so we don't know exactly what
> it requires.

Right, so part of the effort should be understanding what the various parties
want/need. As far as I understand the Intel stuff, P states are basically
useless and the only useful state to ever program is the max one -- although
I'm sure Arjan will eventually explain how that is wrong :-)

We could do optional things; I'm not much for 'requiring' stuff that other
arch simply cannot support, or only support at great effort/cost.

Stealing PMU counters for sched work would be crossing the line for me, that
must be optional.

2013-07-15 20:41:58

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/15/2013 12:59 PM, Peter Zijlstra wrote:
> On Sat, Jul 13, 2013 at 07:40:08AM -0700, Arjan van de Ven wrote:
>> On 7/12/2013 11:49 PM, Peter Zijlstra wrote:
>>>
>>> Arjan; from reading your emails you're mostly busy explaining what cannot be
>>> done. Please explain what _can_ be done and what Intel wants. From what I can
>>> see you basically promote a max P state max concurrency race to idle FTW.
>>
>>>
>>> Since you can't say what the max P state is; and I think I understand the
>>> reasons for that, and the hardware might not even respect the P state you tell
>>> it to run at, does it even make sense to talk about Intel P states? When would
>>> you not program the max P state?
>>
>> this is where it gets complicated ;-( the race-to-idle depends on the type of
>> code that is running, if things are memory bound it's outright not true, but
>> for compute bound it often is.
>
> So you didn't actually answer the question about when you'd program a less than
> max P state.
(oops missed this part in my previous reply)

so race to halt is all great, but it has a core limitation, it is fundamentally
assuming that if you go at a higher clock frequency, the code actually finishes sooner.
This is generally true for the normal "compute" kind of instructions, but
if you have an instruction that goes to memory (and misses caches), that is not the
case because memory itself does not go faster or slower with the CPU frequency.

so depending of the mix of compute and memory instructions, different tradeoffs
might be needed.

(for an example of this, AMD exposes a CPU counter for this as of recently and added
patches to "ondemand" to use it)

2013-07-15 21:04:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Mon, Jul 15, 2013 at 01:37:44PM -0700, Arjan van de Ven wrote:
> On 7/15/2013 12:59 PM, Peter Zijlstra wrote:
>
> >>this is where it gets complicated ;-( the race-to-idle depends on the type of
> >>code that is running, if things are memory bound it's outright not true, but
> >>for compute bound it often is.
> >
> >So you didn't actually answer the question about when you'd program a less than
> >max P state. Your recommended interface also glaringly lacks the
> >arch_please_go_slower_noaw() function.
>
> an arch_you_may_go_slower_now() might make sense, sure.
> (I am not aware of anything DEMANDING to go slower, unlike the go faster side of things)
> I can see that be useful when you stop running that realtime task
> or similar conditions.

Well, if you ever want to go faster there must've been a moment to slow down.
Without means and reason to slow down the entire 'can I go fast noaw pls?'
thing simply doesn't make sense.

> >So you can program any P state; but the hardware is free do as it pleases but
> >not slower than the lowest P state. So clearly the hardware is 'smart'.
>
> any device on the market has some level of smarts there, just by virtue of
> dual core and on board graphics. Even the ARM world has various smarts there
> (and will get more no doubt over time)
>
> >Going by your interface there's also not much influence as to where the 'power'
> >goes; can we for example force the GPU to clock lower in order to 'free' up
> >power for cores?
>
> I would love that to be the case. And the GPU driver certainly has some
> knobs/influence there. That being separate from CPU PM is one of the huge
> holes we have today (much more so than the whole scheduler-vs-power thing)

OK, so drag them gfx people into this. I suppose the 'big' issue is going to be
how to figure out what is more important than the other :-)

But just leaving them do their thing clearly isn't an option.

> >If we can, we should very much include that in the entire discussion.
>
> absolute. Note that it's not an easy topic, as in... very much unsolved
> anywhere and everywhere, and not for lack of trying.

Right, well, I'm not aware of people trying, so it might be good to 'educate'
those of us who do not know on what didn't work and why.

> >>What I would like to see is
> >>
> >>1) Move the idle predictor logic into the scheduler, or at least a library
> >> (I'm not sure the scheduler can do better than the current code, but it might,
> >> and what menu does today is at least worth putting in some generic library)
> >
> >Right, so the idea is that these days we have much better task runtime
> >behaviour tracking than we used to have and this might help. I also realize the
> >idle guestimator uses more than just task activity, interrupt activity is also
> >very important.
>
> when I wrote that part of the menu governor, it was ALL about interrupts.
> the task side is well known, at least in the short term, since we know
> that that will come via a timer.
> (I'm counting IPI's as interrupts here)
>
> Now, the other half of this is the "how performance sensitive are we", and I sure
> hope the scheduler has a better idea than the menu governor....
>
>
> >Not sure calling it a generic library would be wise; that has such an optional
> >sound to it. The thing we want to avoid is people brewing their own etc..
>
> well, if it works well, people will use it.
> if it sucks horribly, people won't and make something else...
> ... after which we turn that into the library function.
> If the concepts and interfaces are at the right level, that can be done.

I think we might be talking about the same thing here, but I'd rather there
ever only lives one instance of this logic in the entire kernel, and that when
people find it doesn't work for them they fix it for everybody, not hack their
own little world.

> Especially for things like "when do we expect the next event to pull us out of idle",
> that's a very generic concept that is not hardware dependent....

Clean concepts can help but are not required; the entire kernel is open source
and if you need something do a tree wide fix-up. That never stopped anybody.

> >> int arch_can_you_go_faster(void); /* if the scheduler would like to know this instead of load balancing .. unsure */
> >
> >You said Intel could not say if it were at the max P state; so how could it
> >possibly answer this one?
>
> we do know if we asked for max... since it was us asking.

Sure, but you can't tell if programming a higher P state will actually make you
go faster. Which is what the function asks for, can we go faster, you don't
know. You could program a higher P state, but it might not actually go any
faster simply because you're already at your thermal limits.

> well, right now for various scheduler priorities we use "time" as a metric for
> timeslicing/etc without regard for the cpu performance at the time.
> There likely is room for a different measure for "system capacity used"
> that is a bit more finegrained than just time. Time is not bad,
> and if there's no cheap special HW, it'll do... but I can see value for
> doing something more advanced. Surely the big.little guys want this
> (more than I'd want it)

Ah, I see what you mean. I think this issue will get sorted when we 'fix' the
runtime vs cpufreq issue. Using actual instructions executed might be one
solution; another would be to simply scale the measured time by the frequency
at which we ran.

I suppose it depends on what's cheapest etc. on the specific platforms and/or
makes most sense.

> >The entire scheme seems to disregards everybody who doesn't have a 'smart'
> >micro controller doing the P state management. Some people will have to
> >actually control the cpufreq.
>
> that is ok, but the whole point is to make that control part of the hardware
> specific driver side. The interface from the scheduler should be generic
> enough that you can plug in various hardware specific parts on the other side.
> Most certainly different CPU chips will use different algorithms over time.
> (and of course there will be a library of such algorithms so that not every
> cpu vendor/implementation has to reinvent the wheel from scratch).
>
> heck, Linus waaay back insisted on this for cpufreq, since the Transmeta
> cpus at the time did most of this purely in "hardware".

Hmm,. okay, but I feel I'm still missing something. Notably the entire
go-faster thing. That simply cannot live without a matching go-slower side.


>
>
> >>3) an interface from the C state hardware driver to the scheduler to say "oh
> >>btw, the LLC got flushed, forget about past cache affinity". The C state
> >>driver can sometimes know this.. and linux today tries to keep affinity
> >>anyway while we could get more optimal by being allowed to balance more
> >>freely
> >
> >This shouldn't be hard to implement at all.
>
> great!
> Do you think it's worth having on the scheduler side? E.g. does it give you
> more freedom in placement?
> It's not completely free to get (think "an MSR read") and
> there's the interesting question if this would be a per cpu
> or a global statement... but we can get this
>
> And at least for client systems (read: relatively low core counts) the cache
> will get flushed quite a lot on Intel.
> (and then refilled quickly of course)

Now idea, give it a go -- completely untested and such ;-)

----
kernel/sched/fair.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f77f9c5..ef83361 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3895,6 +3895,20 @@ static void move_task(struct task_struct *p, struct lb_env *env)
check_preempt_curr(env->dst_rq, p, 0);
}

+DEFINE_PER_CPU(u64, llc_wipe_stamp);
+
+void arch_sched_wipe_llc(int cpu)
+{
+ struct sched_domain *sd;
+ u64 now = sched_clock_cpu(cpu);
+
+ rcu_read_lock();
+ sd = rcu_dereference(per_cpu(sd_llc, cpu));
+ if (sd) for_each_cpu(cpu, sched_domain_span(sd))
+ per_cpu(llc_wipe_stamp, cpu) = now;
+ rcu_read_unlock();
+}
+
/*
* Is this task likely cache-hot:
*/
@@ -3910,6 +3925,12 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
return 0;

/*
+ * Can't be hot if the LLC got wiped since we ran last.
+ */
+ if (p->se.exec_start < this_cpu_read(llc_wipe_stamp))
+ return 0;
+
+ /*
* Buddy candidates are cache hot:
*/
if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&

2013-07-15 21:07:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Mon, Jul 15, 2013 at 01:41:47PM -0700, Arjan van de Ven wrote:
> On 7/15/2013 12:59 PM, Peter Zijlstra wrote:
> >On Sat, Jul 13, 2013 at 07:40:08AM -0700, Arjan van de Ven wrote:
> >>On 7/12/2013 11:49 PM, Peter Zijlstra wrote:
> >>>
> >>>Arjan; from reading your emails you're mostly busy explaining what cannot be
> >>>done. Please explain what _can_ be done and what Intel wants. From what I can
> >>>see you basically promote a max P state max concurrency race to idle FTW.
> >>
> >>>
> >>>Since you can't say what the max P state is; and I think I understand the
> >>>reasons for that, and the hardware might not even respect the P state you tell
> >>>it to run at, does it even make sense to talk about Intel P states? When would
> >>>you not program the max P state?
> >>
> >>this is where it gets complicated ;-( the race-to-idle depends on the type of
> >>code that is running, if things are memory bound it's outright not true, but
> >>for compute bound it often is.
> >
> >So you didn't actually answer the question about when you'd program a less than
> >max P state.
> (oops missed this part in my previous reply)
>
> so race to halt is all great, but it has a core limitation, it is fundamentally
> assuming that if you go at a higher clock frequency, the code actually finishes sooner.
> This is generally true for the normal "compute" kind of instructions, but
> if you have an instruction that goes to memory (and misses caches), that is not the
> case because memory itself does not go faster or slower with the CPU frequency.
>
> so depending of the mix of compute and memory instructions, different tradeoffs
> might be needed.
>
> (for an example of this, AMD exposes a CPU counter for this as of recently and added
> patches to "ondemand" to use it)

OK, but isn't that part of why the micro controller might not make you go
faster even if you do program a higher P state?

But yes, I understand this issue in the 'traditional' cpufreq sense. There's no
point in ramping the speed if all you do is stall more.

But I was under the impression the 'hardware' was doing this. If not then we
need the whole go-faster and go-slower thing and places to call them and means
to determine to call them etc.

2013-07-15 21:13:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Mon, Jul 15, 2013 at 11:06:50PM +0200, Peter Zijlstra wrote:
> OK, but isn't that part of why the micro controller might not make you go
> faster even if you do program a higher P state?
>
> But yes, I understand this issue in the 'traditional' cpufreq sense. There's no
> point in ramping the speed if all you do is stall more.
>
> But I was under the impression the 'hardware' was doing this. If not then we
> need the whole go-faster and go-slower thing and places to call them and means
> to determine to call them etc.


So with the scheduler measuring cpu utilization we could say to go-faster when
u>0.8 and go-slower when u<0.75 or so. Lacking any better metrics like the
stall stuff etc.

So I understand that ondemand spends quite a lot of time 'sampling' what the
system does while the scheduler mostly already knows this. It also has problems
because of the whole sampling thing, either it samples too often and becomes
too expensive/disruptive or it samples too little and misses world+dog.

I was hoping we could do better.

2013-07-15 22:46:11

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/15/2013 2:03 PM, Peter Zijlstra wrote:
> Well, if you ever want to go faster there must've been a moment to slow down.
> Without means and reason to slow down the entire 'can I go fast noaw pls?'
> thing simply doesn't make sense.

I kind of tried to hint at this

there's either

go_fastest_now()

with the contract that the policy drivers can override this after some time (few ms)

or you have to treat it as a lease:

go_fastest()

and then

no_need_to_go_fastest_anymore_so_forget_I_asked()

this is NOT the same as

go_slow_now()

the former has a specific request, and then an end to that specific request,
the later is just a new unbounded command

if you have requests (that either time out or get canceled), you can have
requests from multiple parts of the kernel (and potentially even from
hardware in the thermal case), and some arbiter
who resolves multiple requests existing.

if you only have unbounded commands, you cannot really have such an arbiter.

2013-07-15 22:47:01

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal


>> so depending of the mix of compute and memory instructions, different tradeoffs
>> might be needed.
>>
>> (for an example of this, AMD exposes a CPU counter for this as of recently and added
>> patches to "ondemand" to use it)
>
> OK, but isn't that part of why the micro controller might not make you go
> faster even if you do program a higher P state?
>
> But yes, I understand this issue in the 'traditional' cpufreq sense. There's no
> point in ramping the speed if all you do is stall more.
>
> But I was under the impression the 'hardware' was doing this. If not then we
> need the whole go-faster and go-slower thing and places to call them and means
> to determine to call them etc.

so the answer is "somewhat" and "on some cpus"
not all generations of Intel cpus are the same in this regard ;-(

2013-07-15 22:52:36

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/15/2013 2:12 PM, Peter Zijlstra wrote:
> On Mon, Jul 15, 2013 at 11:06:50PM +0200, Peter Zijlstra wrote:
>> OK, but isn't that part of why the micro controller might not make you go
>> faster even if you do program a higher P state?
>>
>> But yes, I understand this issue in the 'traditional' cpufreq sense. There's no
>> point in ramping the speed if all you do is stall more.
>>
>> But I was under the impression the 'hardware' was doing this. If not then we
>> need the whole go-faster and go-slower thing and places to call them and means
>> to determine to call them etc.
>
>
> So with the scheduler measuring cpu utilization we could say to go-faster when
> u>0.8 and go-slower when u<0.75 or so. Lacking any better metrics like the
> stall stuff etc.
>
> So I understand that ondemand spends quite a lot of time 'sampling' what the
> system does while the scheduler mostly already knows this.

yeah ondemand does this, but ondemand is actually a pretty bad governor.
not because of the sampling, but because of its algorithm.

if you look at what the ondemand algorithm tries to do, it's trying to
manage the cpu "frequency" primarily for when the system is idle.
Ten to twelve years ago, this was actually important and it does a decent
job on that.

HOWEVER, on modern CPUs, even many of the ARM ones, the frequency
when you're idle is zero anyway regardless of what you as OS ask for.

And when Linux went tickless, ondemand went to deferred timers, which make it
even worse.

btw technically ondemand does not sample things, you may (or may not) understand
what it does.
Every 10 (or 100) milliseconds, ondemand makes a new P state decision.
It does this by asking the scheduler the time used, does a delta and
ends up at a utilization %age which then goes into a formula.
It's not that ondemand samples inbetween decision moments to see if the system
is busy or not; the microaccounting that the scheduler does is used instead,
and only at decision moments.

2013-07-16 12:44:08

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Mon, Jul 15, 2013 at 09:39:22PM +0100, Peter Zijlstra wrote:
> On Sat, Jul 13, 2013 at 11:23:51AM +0100, Catalin Marinas wrote:
> > > This looks like a userspace hotplug deamon approach lifted to kernel space :/
> >
> > The difference is that this is faster. We even had hotplug in mind some
> > years ago for big.LITTLE but it wouldn't give the performance we need
> > (hotplug is incredibly slow even if driven from the kernel).
>
> faster, slower, still horrid :-)

Hotplug for power management is horrid, I agree, but it depends on how
you look at the problem. What we need (at least or ARM) is to leave a
socket/cluster idle when the number of tasks is sufficient to run on the
other. The old power saving scheduling used to have some hierarchy with
different balancing policies per level of hierarchy. IIRC this was too
complex with 9 possible states and some chance of going to to 27. To get
a simpler replacement, just left-packing of tasks does not work either,
so you need some power topology information into the scheduler.

I can see two approaches with regards to task placement:

1. Get the load balancer to pack tasks in a way to optimise performance
within a socket but let other sockets idle.
2. Have another entity (power scheduler as per Morten's patches) decide
which sockets to be used and let the main scheduler do its best
within those constraints.

With (2) you have little changes to the main load balancer with reduced
state space (basically it only cares about CPU capacities rather than
balancing policies at different levels). We then keep the power
topology, feedback from the low-level driver (like what can/cannot be
done) into the separate power scheduler entity. I would say the load
balancer state space from a power awareness perspective is linearised.

> > That's what we've been pushing for. From a big.LITTLE perspective, I
> > would probably vote for Vincent's patches but I guess we could probably
> > adapt any of the other options.
> >
> > But then we got Ingo NAK'ing all these approaches. Taking the best bits
> > from the current load balancing patches would create yet another set of
> > patches which don't fall under Ingo's requirements (at least as I
> > understand them).
>
> Right, so Ingo is currently away as well -- should be back 'today' or tomorrow.
> But I suspect he mostly fell over the presentation.
>
> I've never known Ingo to object to doing incremental development; in fact he
> often suggests doing so.
>
> So don't present the packing thing as a power aware scheduler; that
> presentation suggests its the complete deal. Give instead a complete
> description of the problem; and tell how the current patch set fits into that
> and which aspect it solves; and that further patches will follow to sort the
> other issues.

Thanks for the clarification ;).

> > > Then worry about power thingies.
> >
> > To quote Ingo: "To create a new low level idle driver mechanism the
> > scheduler could use and integrate proper power saving / idle policy into
> > the scheduler."
> >
> > That's unless we all agree (including Ingo) that the above requirement
> > is orthogonal to task packing and, as a *separate* project, we look at
> > better integrating the cpufreq/cpuidle with the scheduler, possibly with
> > a new driver model and governors as libraries used by such drivers. In
> > which case the current packing patches shouldn't be NAK'ed but reviewed
> > so that they can be improved further or rewritten.
>
> Right, so first thing would be to list all the thing that need doing:
>
> - integrate idle guestimator
> - intergrate cpufreq stats
> - fix per entity runtime vs cpufreq
> - intrgrate/redo cpufreq
> - add packing features
> - {all the stuff I forgot}
>
> Then see what is orthogonal and what is most important and get people to agree
> to an order. Then go..

It sounds fine, not different from what we've thought. A problem is that
task packing on its own doesn't give any clear view of what the overall
solution will look like, so I assume you/Ingo would like to see the
bigger picture (though probably not the complete implementation but
close enough).

Morten's power scheduler tries to address the above and it will grow
into controlling a new model of power driver (and taking into account
Arjan's and others' comments regarding the API). At the same time, we
need some form of task packing. The power scheduler can drive this
(currently via cpu_power) or can simply turn a knob if there are better
options that will be accepted in the scheduler.

> > I agree in general but there is the intel_pstate.c driver which has it's
> > own separate statistics that the scheduler does not track.
>
> Right, question is how much of that will survive Arjan next-gen effort.

I think all Arjan's care about is a simple go_fastest() API ;).

> > We could move
> > to invariant task load tracking which uses aperf/mperf (and could do
> > similar things with perf counters on ARM). As I understand from Arjan,
> > the new pstate driver will be different, so we don't know exactly what
> > it requires.
>
> Right, so part of the effort should be understanding what the various parties
> want/need. As far as I understand the Intel stuff, P states are basically
> useless and the only useful state to ever program is the max one -- although
> I'm sure Arjan will eventually explain how that is wrong :-)
>
> We could do optional things; I'm not much for 'requiring' stuff that other
> arch simply cannot support, or only support at great effort/cost.
>
> Stealing PMU counters for sched work would be crossing the line for me, that
> must be optional.

I agree, it should be optional.

--
Catalin

2013-07-16 15:23:15

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/16/2013 5:42 AM, Catalin Marinas wrote:
> Morten's power scheduler tries to address the above and it will grow
> into controlling a new model of power driver (and taking into account
> Arjan's and others' comments regarding the API). At the same time, we
> need some form of task packing. The power scheduler can drive this
> (currently via cpu_power) or can simply turn a knob if there are better
> options that will be accepted in the scheduler.

how much would you be helped if there was a simple switch

sort left versus sort right

(assuming the big cores are all either low or high numbers)

the sorting is mostly statistical, but that's good enough in practice..
each time a task wakes up, you get a bias towards either low or high
numbered idle cpus

very quickly all tasks will be on one side, unless your system is so
loaded that all cpus are full.

2013-07-16 17:40:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Mon, Jul 15, 2013 at 03:52:32PM -0700, Arjan van de Ven wrote:

> yeah ondemand does this, but ondemand is actually a pretty bad governor.
> not because of the sampling, but because of its algorithm.

Is it good for any class of hardware still out there? Or should the thing be
shot in the head?

You saying AMD patched the thing makes me confused; why would they patch a
piece of crap?

> HOWEVER, on modern CPUs, even many of the ARM ones, the frequency
> when you're idle is zero anyway regardless of what you as OS ask for.

Right, entire cores are power gated.

So power wise the voltage you run at is important; so for hardware where lower
frequencies allow lower voltage, does it still make sense to run the lowest
possible voltage such that there is still some idle time?

Or is the fact that you're running so much longer negating the power save from
the lower voltage?

> Every 10 (or 100) milliseconds, ondemand makes a new P state decision.
> It does this by asking the scheduler the time used, does a delta and
> ends up at a utilization %age which then goes into a formula.
> It's not that ondemand samples inbetween decision moments to see if the system
> is busy or not; the microaccounting that the scheduler does is used instead,
> and only at decision moments.

OK.. So up to now you've mostly said what you want of the scheduler to make a
better governor for the new Intel chips.

However a power aware scheduler/balancer needs to interact with the policy as a
whole; and I got confused by the fact that you never talked about
raising/lowering speeds. As said there's already a very 'fine' problem where
the cpufreq interacts with the utilization/runnable accounting we now do.

So we very much need to consider the entire stack; not just new hooks you want
to make it go fastest.

2013-07-16 18:44:18

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/16/2013 10:38 AM, Peter Zijlstra wrote:
> On Mon, Jul 15, 2013 at 03:52:32PM -0700, Arjan van de Ven wrote:
>
>> yeah ondemand does this, but ondemand is actually a pretty bad governor.
>> not because of the sampling, but because of its algorithm.
>
> Is it good for any class of hardware still out there? Or should the thing be
> shot in the head?

for Intel, it's not too bad for anything predating Nehalem.


> You saying AMD patched the thing makes me confused; why would they patch a
> piece of crap?

it's still an improvement over something that's in use ;-)

>
>> HOWEVER, on modern CPUs, even many of the ARM ones, the frequency
>> when you're idle is zero anyway regardless of what you as OS ask for.
>
> Right, entire cores are power gated.
>
> So power wise the voltage you run at is important; so for hardware where lower
> frequencies allow lower voltage, does it still make sense to run the lowest
> possible voltage such that there is still some idle time?
>
> Or is the fact that you're running so much longer negating the power save from
> the lower voltage?

the race-to-idle argument again ;-)

>
>> Every 10 (or 100) milliseconds, ondemand makes a new P state decision.
>> It does this by asking the scheduler the time used, does a delta and
>> ends up at a utilization %age which then goes into a formula.
>> It's not that ondemand samples inbetween decision moments to see if the system
>> is busy or not; the microaccounting that the scheduler does is used instead,
>> and only at decision moments.
>
> OK.. So up to now you've mostly said what you want of the scheduler to make a
> better governor for the new Intel chips.
>
> However a power aware scheduler/balancer needs to interact with the policy as a
> whole; and I got confused by the fact that you never talked about
> raising/lowering speeds. As said there's already a very 'fine' problem where
> the cpufreq interacts with the utilization/runnable accounting we now do.

the interaction is "using the scheduler data using the scheduler provided function".

So I don't just want something that makes sense for todays Intel ;-)
We need something that has an interface that makes sense, where the things
that vary between chip generations/vendors are on the driver side
of the interface, and the things that are generic concepts or generically
enough useful are on the core side of the interface. Hardware has changed,
and hardware will be changing for all vendors for as far as we can even see
into the future, since power matters in the market a lot.
This means we need a level of interface that has some chance of being useful
for at least a while.

What frequency to run at is for me clearly a driver side thing since what
goes into choosing a P state that may translate into a frequency is a hardware
specific choice; the translation from "I need at least this much performance
and be power efficient at that" to a hardware register write is very hardware specific.

Things like "I need more compute capacity" or "This is very performance critical" or
"This is very latency critical" are a generic concepts.
As is "behavior is now changed a lot in <this direction>" as a callback kind of thing.
(just as "I no longer need it" is a generic concept to complement the first one)

The scheduler already has the utilization interfaces that are high enough level
for those who want to use utilization on the driver side to guide their hw decisions
(ondemand does not keep its own utilization, it uses straight scheduler data
for that); the very thin layer that ondemand and co add on top is the
percentage = (usage_at_time_b - usage_at_time_a) / (elapsed time) * 100%
formula so that they can do this over the interval of their choosing.
You can argue that the scheduler can do this; that's for me a small detail that we could
do either way; it's not anything relevant in the big picture.
With intervals being quite variable it might make sense to keep it on the driver side
just because its hard to put this one formula into a nice interface.






2013-07-16 19:22:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Tue, Jul 16, 2013 at 11:44:15AM -0700, Arjan van de Ven wrote:
> the interaction is "using the scheduler data using the scheduler provided function".

I'm so not following.

> So I don't just want something that makes sense for todays Intel ;-)
> We need something that has an interface that makes sense, where the things
> that vary between chip generations/vendors are on the driver side
> of the interface, and the things that are generic concepts or generically
> enough useful are on the core side of the interface. Hardware has changed,
> and hardware will be changing for all vendors for as far as we can even see
> into the future, since power matters in the market a lot.
> This means we need a level of interface that has some chance of being useful
> for at least a while.
>
> What frequency to run at is for me clearly a driver side thing since what
> goes into choosing a P state that may translate into a frequency is a hardware
> specific choice; the translation from "I need at least this much performance
> and be power efficient at that" to a hardware register write is very hardware specific.

Be that as it may, we still need to consider the ramifications of these
'mystserious arch specific actions'.

> Things like "I need more compute capacity" or "This is very performance critical" or
> "This is very latency critical" are a generic concepts.
> As is "behavior is now changed a lot in <this direction>" as a callback kind of thing.
> (just as "I no longer need it" is a generic concept to complement the first one)

That is what cpufreq would like of the scheduler; but isn't at all
sufficient to solve the problems the scheduler has with cpufreq. You
still only seem to see things one way.

Suppose a 2 cpu system, one cpu is running 3/4 throttle, the other is
running at half speed. Both cpus are equally utilized. A new task
comes on.

Where do we run it?

We need to know that there's head-room on the 1/2 speed cpu and should
crank its pace and place the task there.

Even without the new task; its not a 'balanced' situation, but it
appears that way because the cpu's are nearly equally utilized. Maybe if
we crank one cpu to the max it could run all tasks and have the other
cpu power gated. Or maybe they could both drop to 60% and run equal
loads.

We need feedback for these problems; but you're telling us new Intel
stuff can't really tell us much of anything :/

What I'm saying is; sure the cpufreq driver might have chip specific
magic but it very much needs to tell us things too we can't have it do
its own thing and not care.

2013-07-16 19:57:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/16/2013 12:21 PM, Peter Zijlstra wrote:

> Suppose a 2 cpu system, one cpu is running 3/4 throttle, the other is
> running at half speed. Both cpus are equally utilized. A new task
> comes on.
>
> Where do we run it?
>
> We need to know that there's head-room on the 1/2 speed cpu and should
> crank its pace and place the task there.

ok so you are interested in past "real" utilization of the hardware resources;
that is available generally (and tends to come from hardware counters, on ARM
as well).

you may not get it as a percentage, but in some absolute term, so you
can know which of the two is least loaded... that might be enough

Today cpufreq uses a library to get these counters, moving that library to the scheduler
or some similar place.... sounds like a great idea.
There is an argument for what to do on systems where such counters are either
absent or very expensive and that's good question; maybe one of the ARM folks
can say how expensive these counters are for them to see if there really is such
a problem?

> Even without the new task; its not a 'balanced' situation, but it
> appears that way because the cpu's are nearly equally utilized. Maybe if
> we crank one cpu to the max it could run all tasks and have the other
> cpu power gated. Or maybe they could both drop to 60% and run equal
> loads.

which way is better for energy consumption is likely a per arch question,
and having the architecture provide some runtime configuration about how
valueable it is to spread out sounds sensible to me.

then the question of how much remaining capacity; this is a hard one, and not just
for Intel. Almost all mobile devices today are thermally constrained, ARM and Intel
alike (at least the higher performance ones)... the curse of wanting very thin and light
phones that are made of thermally isolating plastic (so that radio waves can go through)
and have a nice and bright screen...

With thermals as a whole you tend to not know you're hitting the wall until you try;
you may think you can go another gigahertz on a core, but when you go there you near instantly
hit a thermal limit that whacks you waaaay back down again.

(that reminds me, I'd love investigate for the scheduler to look at core temperature as one of the
factors in its decision... that might actually be one of the more interesting inputs to
scheduler decisions, both in terms of capacity planning and efficiency)


> We need feedback for these problems; but you're telling us new Intel
> stuff can't really tell us much of anything :/

s/new/existing/ to be honest; chips we've been selling in the last 4+ years.

> What I'm saying is; sure the cpufreq driver might have chip specific
> magic but it very much needs to tell us things too we can't have it do
> its own thing and not care.

some of the things may come from other things than the P state selection part;
a lot of the things you're asking for will tend to come from counters I suspect.

2013-07-16 20:17:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Tue, Jul 16, 2013 at 12:57:34PM -0700, Arjan van de Ven wrote:
> then the question of how much remaining capacity; this is a hard one, and not just
> for Intel. Almost all mobile devices today are thermally constrained, ARM and Intel
> alike (at least the higher performance ones)... the curse of wanting very thin and light
> phones that are made of thermally isolating plastic (so that radio waves can go through)
> and have a nice and bright screen...

Right, so we might need to track a !idle avg over the thermal domain to
guestimate the head-room and inter-cpu relations.

But yeah, I suppose what I've been saying is that we need to drag
cpufreq into the SMP era. Only considering a single cpu isn't going to
work anymore.

2013-07-16 20:21:07

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/16/2013 1:17 PM, Peter Zijlstra wrote:
> On Tue, Jul 16, 2013 at 12:57:34PM -0700, Arjan van de Ven wrote:
>> then the question of how much remaining capacity; this is a hard one, and not just
>> for Intel. Almost all mobile devices today are thermally constrained, ARM and Intel
>> alike (at least the higher performance ones)... the curse of wanting very thin and light
>> phones that are made of thermally isolating plastic (so that radio waves can go through)
>> and have a nice and bright screen...
>
> Right, so we might need to track a !idle avg over the thermal domain to
> guestimate the head-room and inter-cpu relations.

mostly the thermal domain will be the cpu half of the SOC if not the whole SOC, in the mobile space.
For big servers, sure.
But why count idle average when most chips have thermal sensors built in?
(needed for the same thermal limiting reasons)



2013-07-16 20:32:09

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On 7/16/2013 1:17 PM, Peter Zijlstra wrote:
> On Tue, Jul 16, 2013 at 12:57:34PM -0700, Arjan van de Ven wrote:
>> then the question of how much remaining capacity; this is a hard one, and not just
>> for Intel. Almost all mobile devices today are thermally constrained, ARM and Intel
>> alike (at least the higher performance ones)... the curse of wanting very thin and light
>> phones that are made of thermally isolating plastic (so that radio waves can go through)
>> and have a nice and bright screen...
>
> Right, so we might need to track a !idle avg over the thermal domain to
> guestimate the head-room and inter-cpu relations.


btw one thing to realize is that many of the thermal limits in mobile devices don't have the CPU cores
as the primary reason.
Other components on the board (screens, modems etc) as well as the GPU likely impact
this at least as much as actual cpu usage does.
It's usually just that the CPU is the easiest to control down,
so that tends to be the first one impacted.

2013-07-16 20:47:20

by David Lang

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Mon, 15 Jul 2013, Arjan van de Ven wrote:

> On 7/15/2013 2:03 PM, Peter Zijlstra wrote:
>> Well, if you ever want to go faster there must've been a moment to slow
>> down.
>> Without means and reason to slow down the entire 'can I go fast noaw pls?'
>> thing simply doesn't make sense.
>
> I kind of tried to hint at this
>
> there's either
>
> go_fastest_now()
>
> with the contract that the policy drivers can override this after some time
> (few ms)
>
> or you have to treat it as a lease:
>
> go_fastest()
>
> and then
>
> no_need_to_go_fastest_anymore_so_forget_I_asked()
>
> this is NOT the same as
>
> go_slow_now()
>
> the former has a specific request, and then an end to that specific request,
> the later is just a new unbounded command
>
> if you have requests (that either time out or get canceled), you can have
> requests from multiple parts of the kernel (and potentially even from
> hardware in the thermal case), and some arbiter
> who resolves multiple requests existing.
>
> if you only have unbounded commands, you cannot really have such an arbiter.

Sometimes the user has something running that they want to keep running, but
they don't need it to be going fast.

An example is if you are monitoring something.

Unless you can go to sleep between monitoring polls, it makes more sense to run
slowly than to run at full speed with lots of idle cycles.

David Lang

2013-07-17 14:16:14

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Tue, Jul 16, 2013 at 04:23:08PM +0100, Arjan van de Ven wrote:
> On 7/16/2013 5:42 AM, Catalin Marinas wrote:
> > Morten's power scheduler tries to address the above and it will grow
> > into controlling a new model of power driver (and taking into account
> > Arjan's and others' comments regarding the API). At the same time, we
> > need some form of task packing. The power scheduler can drive this
> > (currently via cpu_power) or can simply turn a knob if there are better
> > options that will be accepted in the scheduler.
>
> how much would you be helped if there was a simple switch
>
> sort left versus sort right
>
> (assuming the big cores are all either low or high numbers)

It helps a bit compared to the current behaviour but there is a lot of
room for improvement.

> the sorting is mostly statistical, but that's good enough in practice..
> each time a task wakes up, you get a bias towards either low or high
> numbered idle cpus

If cores within a cluster (socket) are not power-gated individually
(implementation dependent), it makes more sense to spread the tasks
among the cores to either get a lower frequency or just get to idle
quicker. For little cores, even when they are individually power-gated,
they don't consume much so we would rather spread the tasks equally.

> very quickly all tasks will be on one side, unless your system is so
> loaded that all cpus are full.

It should be more like left socket vs both sockets with the possibility
of different balancing within a socket. But then we get back to the
sched_smt/sched_mc power aware scheduling that was removed from the
kernel.

It's also important when to make this decision to sort left vs right and
we want to avoid migrating threads unnecessarily. There could be small
threads (e.g. an mp3 decoding thread) that should stay on the little
core.

Power aware scheduling should not affect the performance (call them
benchmarks) but the scheduler could take power implications into
account. The hard part is formalising this with differences between
architectures and SoCs. Maybe a low-level driver or arch hook like "get
me the most power efficient CPU that can run a task" but it's not clear
how this would work (we can't easily predict what the future load will
be).

Our proposal is to split the balancing into two problems: equal
balancing vs. CPU capacity (the latter can be improved to address arch
concerns). These two problems can be later unified once we have a better
understanding of its implications across architectures.

For big.LITTLE we could work around the scheduler (in a very hacky way)
with a combination of pstate/powerclamp driver which forces idle on the
big cores when not needed but I would rather get the scheduler to make
such decisions.

--
Catalin

2013-07-24 13:16:44

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Sat, Jul 13, 2013 at 07:49:09AM +0100, Peter Zijlstra wrote:
> On Tue, Jul 09, 2013 at 04:55:29PM +0100, Morten Rasmussen wrote:
> > Hi,
> >
> > This patch set is an initial prototype aiming at the overall power-aware
> > scheduler design proposal that I previously described
> > <http://permalink.gmane.org/gmane.linux.kernel/1508480>.
> >
> > The patch set introduces a cpu capacity managing 'power scheduler' which lives
> > by the side of the existing (process) scheduler. Its role is to monitor the
> > system load and decide which cpus that should be available to the process
> > scheduler.
>
> Hmm...
>
> This looks like a userspace hotplug deamon approach lifted to kernel space :/

I know I'm arriving a bit late to the party...

I do see what you mean, but I think comparing it to a userspace hotplug
deamon is a bit harsh :) As Catalin has already pointed out, the
intention behind the design is to separate cpu capacity management from
load-balancing and runqueue management to avoid adding further
complexity to main load balancer.

> How about instead of layering over the load-balancer to constrain its behaviour
> you change the behaviour to not need constraint? Fix it so it does the right
> thing, instead of limiting it.
>
> I don't think its _that_ hard to make the balancer do packing over spreading.
> The power balance code removed in 8e7fbcbc had things like that (although it
> was broken). And I'm sure I've seen patches over the years that did similar
> things. Didn't Vincent and Alex also do things like that?
>
> We should take the good bits from all that and make something of it. And I
> think its easier now that we have the per task and per rq utilization numbers
> [1].

IMHO proper packing (capacity management) is a quite complex problem,
that will require a major modifications to the load-balance logic if we
want to integrate it there. Essentially getting rid of all the implicit
assumptions that only made sense when task load weight was static and we
didn't have a clue about the true cpu load.

I don't think a load-balance code clean up can be avoided even if we go
with the power scheduler scheduler design. For example, the scaling of
load weight by priority makes packing based on task load weight so
conservative that it is not usable. Any tiny high priority task may
completely take over a cpu if it happens to be on the runqueue during
load balance. Vincent and Alex don't use task load weight in their
packing patches but use their own metrics instead.

I agree that we should take the good bits of those patches, but they are
far from the complete solution we are looking for in their current form.

The proposed design would let us deal with the complexity of interacting
power drivers and capacity management outside the main scheduler and use
it more or less unmodified. At lest to begin with. Down the line, we
will have to have a look at the load balance logic. But hopefully it
will be simpler or at least not more complex than it is now.

>
> Just start by changing the balancer to pack instead of spread. Once that works,
> see where the two modes diverge and put a knob in.
>
> Then worry about power thingies.

I don't think packing and the power stuff can be considered completely
orthogonal. Packing should to take power stuff like frequency domains
and cluster/package C-states into account.


>
> I'm not entirely sold on differentiating between short running and other tasks
> either. Although I suppose I see where that comes from. A task that would run
> 50% on a big core would unlikely be qualified as small, however if it would
> require 85% of a small core and there's room on the small cores its a good move
> to run it there.
>
> So where's the limit for being small? It seems like an artificial limit and
> such should be avoided where possible.

I agree. But having too many small tasks on a single cpu to get to 90%
(or whatever we consider to be full) is not ideal either as the tasks
may wait for very long to run compared to their actual running time.

Vincent's patches actually tries to address this problem by reducing the
'full' threshold depending when the number of tasks on the cpu
increases. If I remember correctly, Vincent has removed the small task
limit in his latest patches.

For packing, I don't think we need a strict limit for when a task is
small. Just pack until the cpu is full or the running/runnable ratio of
the tasks on the runqueue gets too low.
There is no small task limit in the very simplistic packing done in this
patch set either.

Part of the reason for trying to identify small tasks is that these are
often not performance sensitive. This is related to the 'which task is
important/this task is performance sensitive' discussion.

Morten

2013-07-24 13:50:16

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Wed, Jul 17, 2013 at 03:14:26PM +0100, Catalin Marinas wrote:
> On Tue, Jul 16, 2013 at 04:23:08PM +0100, Arjan van de Ven wrote:
> > On 7/16/2013 5:42 AM, Catalin Marinas wrote:
> > > Morten's power scheduler tries to address the above and it will grow
> > > into controlling a new model of power driver (and taking into account
> > > Arjan's and others' comments regarding the API). At the same time, we
> > > need some form of task packing. The power scheduler can drive this
> > > (currently via cpu_power) or can simply turn a knob if there are better
> > > options that will be accepted in the scheduler.
> >
> > how much would you be helped if there was a simple switch
> >
> > sort left versus sort right
> >
> > (assuming the big cores are all either low or high numbers)
>
> It helps a bit compared to the current behaviour but there is a lot of
> room for improvement.
>
> > the sorting is mostly statistical, but that's good enough in practice..
> > each time a task wakes up, you get a bias towards either low or high
> > numbered idle cpus
>
> If cores within a cluster (socket) are not power-gated individually
> (implementation dependent), it makes more sense to spread the tasks
> among the cores to either get a lower frequency or just get to idle
> quicker. For little cores, even when they are individually power-gated,
> they don't consume much so we would rather spread the tasks equally.
>
> > very quickly all tasks will be on one side, unless your system is so
> > loaded that all cpus are full.
>
> It should be more like left socket vs both sockets with the possibility
> of different balancing within a socket. But then we get back to the
> sched_smt/sched_mc power aware scheduling that was removed from the
> kernel.
>
> It's also important when to make this decision to sort left vs right and
> we want to avoid migrating threads unnecessarily. There could be small
> threads (e.g. an mp3 decoding thread) that should stay on the little
> core.

Given that the power topology is taken into account, a sort
left/right-like mechanism would only help performance insensitive tasks
on big.LITTLE. Performance sensitive tasks that each can use more than
a little cpu should move in the opposite direction. Well, directly to a
big cpu, even if some little cpus are idle.

It can be discussed whether smaller performance sensitive tasks that
would fit on a little cpu should be put on a little or big cpu. That
would depend on the nature of the task and if other tasks depend on it.

2013-07-24 15:16:42

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal


> Given that the power topology is taken into account, a sort
> left/right-like mechanism would only help performance insensitive tasks
> on big.LITTLE. Performance sensitive tasks that each can use more than
> a little cpu should move in the opposite direction. Well, directly to a
> big cpu, even if some little cpus are idle.
>
> It can be discussed whether smaller performance sensitive tasks that
> would fit on a little cpu should be put on a little or big cpu. That
> would depend on the nature of the task and if other tasks depend on it.

yeah that makes it fun

just a question for my education; is there overlap between big and little?
meaning, is the "highest speed of little" as fast, or faster than "lowest speed of big"
or are those strictly disjoint?

(if there's overlap that gives some room for the scheduler to experiment)

2013-07-24 16:46:12

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Wed, Jul 24, 2013 at 04:16:36PM +0100, Arjan van de Ven wrote:
>
> > Given that the power topology is taken into account, a sort
> > left/right-like mechanism would only help performance insensitive tasks
> > on big.LITTLE. Performance sensitive tasks that each can use more than
> > a little cpu should move in the opposite direction. Well, directly to a
> > big cpu, even if some little cpus are idle.
> >
> > It can be discussed whether smaller performance sensitive tasks that
> > would fit on a little cpu should be put on a little or big cpu. That
> > would depend on the nature of the task and if other tasks depend on it.
>
> yeah that makes it fun
>
> just a question for my education; is there overlap between big and little?
> meaning, is the "highest speed of little" as fast, or faster than "lowest speed of big"
> or are those strictly disjoint?
>
> (if there's overlap that gives some room for the scheduler to experiment)
>

It is implementation dependent. And it depends on how you define
performance :-)

That is hardly an answer to your question.

The big and little uarchs are quite different and typically support
different frequencies. For memory bound tasks there is more likely to be
an overlap than for cpu intensive tasks.

I would expect performance to be disjoint for most tasks. If there was
an overlap, the big would probably be less power efficient (as in
energy/instruction) than the little so you would prefer to run on the
little anyway.

In what way would you use the overlap?

2013-07-24 16:48:46

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

>
> I would expect performance to be disjoint for most tasks. If there was
> an overlap, the big would probably be less power efficient (as in
> energy/instruction) than the little so you would prefer to run on the
> little anyway.
>
> In what way would you use the overlap?

if the scheduler thinks a task would be better off on the other side
than where it is now, it could first move it into the "overlap area" on the
same side by means of experiment, and if the task behaves as expected there,
THEN move it over.

2013-07-25 08:00:35

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

On Wed, Jul 24, 2013 at 05:48:42PM +0100, Arjan van de Ven wrote:
> >
> > I would expect performance to be disjoint for most tasks. If there was
> > an overlap, the big would probably be less power efficient (as in
> > energy/instruction) than the little so you would prefer to run on the
> > little anyway.
> >
> > In what way would you use the overlap?
>
> if the scheduler thinks a task would be better off on the other side
> than where it is now, it could first move it into the "overlap area" on the
> same side by means of experiment, and if the task behaves as expected there,
> THEN move it over.

You could do that, but due to the different uarchs you wouldn't really
know how a cpu bound task would behave on the other side. It would
probably work for memory bound tasks.

Also, for interactive applications (smartphones and such) intermediate
steps will increase latency when going little to big. Going the other
way it would be fine.