2015-05-04 22:11:35

by Mike Turquette

[permalink] [raw]
Subject: [PATCH 0/4] scheduler-based cpu frequency scaling

This series implements an event-driven cpufreq governor that scales cpu
frequency as a function of cfs runqueue utilization. The intent of this RFC is
to get some discussion going about how the scheduler can become the policy
engine for selecting cpu frequency, what limitations exist and what design do
we want to take to get to a solution.

This work is a different take on the patches I posted in November:
http://lkml.kernel.org/r/<[email protected]>

This series depends on having frequency-invariant representations for load.
This requires Vincent's recently merged cpu capacity rework patches, as well as
a new patch from Morten included here. Morten's patch will likely make an
appearance in his energy aware scheduling v4 series.

Thanks to Juri Lelli <[email protected]> for contributing to the development
of the governor.

A git branch with these patches can be pulled from here:
https://git.linaro.org/people/mike.turquette/linux.git sched-freq

Smoke testing has been done on an OMAP4 Pandaboard and an Exynos 5800
Chromebook2. Extensive benchmarking and regression testing has not yet been
done. Before sinking too much time into extensive testing I'd like to get
feedback on the general design.

Michael Turquette (3):
sched: sched feature for cpu frequency selection
sched: export get_cpu_usage & capacity_orig_of
sched: cpufreq_sched_cfs: PELT-based cpu frequency scaling

Morten Rasmussen (1):
arm: Frequency invariant scheduler load-tracking support

arch/arm/include/asm/topology.h | 7 +
arch/arm/kernel/smp.c | 53 ++++++-
arch/arm/kernel/topology.c | 17 +++
drivers/cpufreq/Kconfig | 24 ++++
include/linux/cpufreq.h | 3 +
kernel/sched/Makefile | 1 +
kernel/sched/cpufreq_cfs.c | 311 ++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 48 +++----
kernel/sched/features.h | 6 +
kernel/sched/sched.h | 39 +++++
10 files changed, 475 insertions(+), 34 deletions(-)
create mode 100644 kernel/sched/cpufreq_cfs.c

--
1.9.1


2015-05-04 22:11:58

by Mike Turquette

[permalink] [raw]
Subject: [PATCH 1/4] arm: Frequency invariant scheduler load-tracking support

From: Morten Rasmussen <[email protected]>

Implements arch-specific function to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking. The
factor is:

current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)

This implementation only provides frequency invariance. No
micro-architecture invariance yet.

Cc: Russell King <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
Morten supplied this patch to me in "preview" form. He hasn't posted it
yet that I am aware of and I expect it to show up in his EAS v4 patch
series. The patch may look different at that point. It replaces the
following two patches from his EAS v3 series:

"cpufreq: Architecture specific callback for frequency changes"
"arm: Frequency invariant scheduler load-tracking support"

arch/arm/include/asm/topology.h | 7 ++++++
arch/arm/kernel/smp.c | 53 +++++++++++++++++++++++++++++++++++++++--
arch/arm/kernel/topology.c | 17 +++++++++++++
3 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 2fe85ff..4b985dc 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -24,6 +24,13 @@ void init_cpu_topology(void);
void store_cpu_topology(unsigned int cpuid);
const struct cpumask *cpu_coregroup_mask(int cpu);

+#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
+struct sched_domain;
+extern
+unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
+DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
+
#else

static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 86ef244..297ce1b 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -672,12 +672,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
static unsigned long global_l_p_j_ref;
static unsigned long global_l_p_j_ref_freq;
+static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
+DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
+
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling through arch_scale_freq_capacity()
+ * (implemented in topology.c).
+ */
+static inline
+void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max)
+{
+ unsigned long capacity;
+
+ if (!max)
+ return;
+
+ capacity = (curr << SCHED_CAPACITY_SHIFT) / max;
+ atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
+}

static int cpufreq_callback(struct notifier_block *nb,
unsigned long val, void *data)
{
struct cpufreq_freqs *freq = data;
int cpu = freq->cpu;
+ unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));

if (freq->flags & CPUFREQ_CONST_LOOPS)
return NOTIFY_OK;
@@ -702,6 +724,9 @@ static int cpufreq_callback(struct notifier_block *nb,
per_cpu(l_p_j_ref_freq, cpu),
freq->new);
}
+
+ scale_freq_capacity(cpu, freq->new, max);
+
return NOTIFY_OK;
}

@@ -709,11 +734,35 @@ static struct notifier_block cpufreq_notifier = {
.notifier_call = cpufreq_callback,
};

+static int cpufreq_policy_callback(struct notifier_block *nb,
+ unsigned long val, void *data)
+{
+ struct cpufreq_policy *policy = data;
+ int i;
+
+ for_each_cpu(i, policy->cpus) {
+ scale_freq_capacity(i, policy->cur, policy->max);
+ atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block cpufreq_policy_notifier = {
+ .notifier_call = cpufreq_policy_callback,
+};
+
static int __init register_cpufreq_notifier(void)
{
- return cpufreq_register_notifier(&cpufreq_notifier,
+ int ret;
+
+ ret = cpufreq_register_notifier(&cpufreq_notifier,
CPUFREQ_TRANSITION_NOTIFIER);
+ if (ret)
+ return ret;
+
+ return cpufreq_register_notifier(&cpufreq_policy_notifier,
+ CPUFREQ_POLICY_NOTIFIER);
}
core_initcall(register_cpufreq_notifier);
-
#endif
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 08b7847..9c09e6e 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -169,6 +169,23 @@ static void update_cpu_capacity(unsigned int cpu)
cpu, arch_scale_cpu_capacity(NULL, cpu));
}

+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling
+ * factor is updated in smp.c
+ */
+unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+ unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));
+
+ if (!curr)
+ return SCHED_CAPACITY_SCALE;
+
+ return curr;
+}
+
#else
static inline void parse_dt_topology(void) {}
static inline void update_cpu_capacity(unsigned int cpuid) {}
--
1.9.1

2015-05-04 22:11:48

by Mike Turquette

[permalink] [raw]
Subject: [PATCH 2/4] sched: sched feature for cpu frequency selection

This patch introduces the SCHED_ENERGY_FREQ sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set to false when SCHED_DEBUG is not defined and thus
disabled by default.

Signed-off-by: Michael Turquette <[email protected]>
---
kernel/sched/fair.c | 5 +++++
kernel/sched/features.h | 6 ++++++
2 files changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46855d0..75aec8d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4207,6 +4207,11 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+static inline bool sched_energy_freq(void)
+{
+ return sched_feat(SCHED_ENERGY_FREQ);
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 91e33cd..77381cf 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
*/
SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
+
+/*
+ * Scheduler-driven CPU frequency selection aimed to save energy based on
+ * load tracking
+ */
+SCHED_FEAT(SCHED_ENERGY_FREQ, false)
--
1.9.1

2015-05-04 22:13:13

by Mike Turquette

[permalink] [raw]
Subject: [PATCH 3/4] sched: export get_cpu_usage & capacity_orig_of

get_cpu_usage and capacity_orig_of are useful for a cpu frequency
scaling policy which is based on cfs load tracking and cpu capacity
metrics. Expose these calls in sched.h so that they can be used in such
a policy.

Signed-off-by: Michael Turquette <[email protected]>
---
kernel/sched/fair.c | 32 --------------------------------
kernel/sched/sched.h | 33 +++++++++++++++++++++++++++++++++
2 files changed, 33 insertions(+), 32 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 75aec8d..9e37d49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4366,11 +4366,6 @@ static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
}

-static unsigned long capacity_orig_of(int cpu)
-{
- return cpu_rq(cpu)->cpu_capacity_orig;
-}
-
static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -4784,33 +4779,6 @@ next:
done:
return target;
}
-/*
- * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
- * tasks. The unit of the return value must be the one of capacity so we can
- * compare the usage with the capacity of the CPU that is available for CFS
- * task (ie cpu_capacity).
- * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
- * CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
- * capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
- * because of unfortunate rounding in avg_period and running_load_avg or just
- * after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the usage stays into the range
- * [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the usage, a group could be seen as overloaded (CPU0 usage
- * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
- */
-static int get_cpu_usage(int cpu)
-{
- unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
- unsigned long capacity = capacity_orig_of(cpu);
-
- if (usage >= SCHED_LOAD_SCALE)
- return capacity;
-
- return (usage * capacity) >> SCHED_LOAD_SHIFT;
-}

/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e0e1299..8bf35d3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1396,6 +1396,39 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
}
#endif

+static inline unsigned long capacity_orig_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
+/*
+ * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must be the one of capacity so we can
+ * compare the usage with the capacity of the CPU that is available for CFS
+ * task (ie cpu_capacity).
+ * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * CPU. It represents the amount of utilization of a CPU in the range
+ * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
+ * capacity of the CPU because it's about the running time on this CPU.
+ * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
+ * because of unfortunate rounding in avg_period and running_load_avg or just
+ * after migrating tasks until the average stabilizes with the new running
+ * time. So we need to check that the usage stays into the range
+ * [0..cpu_capacity_orig] and cap if necessary.
+ * Without capping the usage, a group could be seen as overloaded (CPU0 usage
+ * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
+ */
+static inline int get_cpu_usage(int cpu)
+{
+ unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long capacity = capacity_orig_of(cpu);
+
+ if (usage >= SCHED_LOAD_SCALE)
+ return capacity;
+
+ return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
+
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
--
1.9.1

2015-05-04 22:12:10

by Mike Turquette

[permalink] [raw]
Subject: [PATCH 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

Scheduler-driven cpu frequency selection is desirable as part of the
on-going effort to make the scheduler better aware of energy
consumption. No piece of the Linux kernel has a better view of the
factors that affect a cpu frequency selection policy than the
scheduler[0], and this patch is an attempt to get that discussion going
again.

This patch implements a cpufreq governor that directly accesses
scheduler statistics, in particular the pelt data from cfs via the
get_cpu_usage() function.

Put plainly, this governor selects the lowest cpu frequency that will
prevent a runqueue from being over-utilized (until we hit the highest
frequency of course). This is accomplished by requesting a frequency
that matches the current capacity utilization, plus a margin.

Unlike the previous posting from 2014[1] this governor implements a
"follow the utilization" method, where utilization is defined as the
frequency-invariant product of cfs.utilization_load_avg and
cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu
idle time nor any other method which is unsynchronized with the
scheduler. The entry points for this policy are in fair.c:
enqueue_task_fair, dequeue_task_fair and task_tick_fair.

This policy is implemented using the cpufreq governor interface for two
main reasons:

1) re-using the cpufreq machine drivers without using the governor
interface is hard.

2) using the cpufreq interface allows us to switch between the
scheduler-driven policy and legacy cpufreq governors such as ondemand at
run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all
scheduling classes except for cfs. It is possible to add support for
deadline and other other classes here, but I also wonder if a
multi-governor approach would be a more maintainable solution, where the
cpufreq core aggregates the constraints set by multiple governors.
Supporting such an approach in the cpufreq core would also allow for
peripheral devices to place constraint on cpu frequency without having
to hack such behavior in at the governor level.

Thanks to Juri Lelli <[email protected]> for doing a good bit of
testing, bug fixing and contributing towards the design. I've included
his sign-off with his permission after stealing lots of his code.

[0] http://article.gmane.org/gmane.linux.kernel/1499836
[1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Michael Turquette <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
For those that are very curious, there were recently two previous
postings of these patches to the public eas-dev mailing list. Of
interest in those threads is the discussion around using a utilization
threadhold versus purely matching frequency to capacity utilization
versus using a margin (as this series does):

https://lists.linaro.org/pipermail/eas-dev/2015-April/000074.html
https://lists.linaro.org/pipermail/eas-dev/2015-April/000115.html

drivers/cpufreq/Kconfig | 24 ++++
include/linux/cpufreq.h | 3 +
kernel/sched/Makefile | 1 +
kernel/sched/cpufreq_cfs.c | 311 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 11 ++
kernel/sched/sched.h | 6 +
6 files changed, 356 insertions(+)
create mode 100644 kernel/sched/cpufreq_cfs.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index a171fef..83d51b4 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_CFS
+ bool "cfs"
+ select CPU_FREQ_GOV_CFS
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the CPUfreq governor 'cfs' as default. This scales
+ cpu frequency from the scheduler as per-entity load tracking
+ statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE
@@ -183,6 +192,21 @@ config CPU_FREQ_GOV_CONSERVATIVE

If in doubt, say N.

+config CPU_FREQ_GOV_CFS
+ tristate "'cfs' cpufreq governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_COMMON
+ help
+ 'cfs' - this governor scales cpu frequency from the
+ scheduler as a function of cpu capacity utilization. It does
+ not evaluate utilization on a periodic basis (as ondemand
+ does) but instead is invoked from the completely fair
+ scheduler when updating per-entity load tracking statistics.
+ Latency to respond to changes in load is improved over polling
+ governors due to its event-driven design.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"

config CPUFREQ_DT
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 2ee4888..62e8152 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -485,6 +485,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
extern struct cpufreq_governor cpufreq_gov_conservative;
#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV)
+extern struct cpufreq_governor cpufreq_gov_cap_gov;
+#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov)
#endif

/*********************************************************************
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 46be870..466960d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ_GOV_CFS) += cpufreq_cfs.o
diff --git a/kernel/sched/cpufreq_cfs.c b/kernel/sched/cpufreq_cfs.c
new file mode 100644
index 0000000..49edc81
--- /dev/null
+++ b/kernel/sched/cpufreq_cfs.c
@@ -0,0 +1,311 @@
+/*
+ * Copyright (C) 2015 Michael Turquette <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/percpu.h>
+#include <linux/irq_work.h>
+
+#include "sched.h"
+
+#define MARGIN_PCT 125 /* taken from imbalance_pct = 125 */
+#define THROTTLE_NSEC 50000000 /* 50ms default */
+
+/**
+ * gov_data - per-policy data internal to the governor
+ * @throttle: next throttling period expiry. Derived from throttle_nsec
+ * @throttle_nsec: throttle period length in nanoseconds
+ * @task: worker thread for dvfs transition that may block/sleep
+ * @irq_work: callback used to wake up worker thread
+ *
+ * struct gov_data is the per-policy cpufreq_cfs-specific data structure. A
+ * per-policy instance of it is created when the cpufreq_cfs governor receives
+ * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
+ * member of struct cpufreq_policy.
+ *
+ * Readers of this data must call down_read(policy->rwsem). Writers must
+ * call down_write(policy->rwsem).
+ */
+struct gov_data {
+ ktime_t throttle;
+ unsigned int throttle_nsec;
+ struct task_struct *task;
+ struct irq_work irq_work;
+ struct cpufreq_policy *policy;
+};
+
+/**
+ * cpufreq_cfs_select_freq - pick the next frequency for a cpu
+ * @policy: the cpufreq policy whose frequency may be changed
+ *
+ * cpufreq_cfs_select_freq selects a frequency based on pelt load statistics
+ * tracked by cfs. First it finds the most utilized cpu in the policy and then
+ * maps that utilization value onto a cpu frequency and returns it.
+ *
+ * Additionally, cpufreq_cfs_select_freq adds a margin to the cpu utilization value
+ * before converting it to a frequency. The margin is derived from MARGIN_PCT,
+ * which itself is inspired by imbalance_pct in cfs. This is needed to
+ * proactively increase frequency in the case of increasing load.
+ *
+ * This approach attempts to maintain headroom of 25% unutilized cpu capacity.
+ * A traditional way of doing this is to take 75% of the current capacity and
+ * check if current utilization crosses that threshold. The only problem with
+ * that approach is determining the next cpu frequency target if that threshold
+ * is crossed.
+ *
+ * Instead of using the 75% threshold, cpufreq_cfs_select_freq adds a 25%
+ * utilization margin to the utilization and converts that to a frequency. This
+ * removes conditional logic around checking thresholds and better supports
+ * drivers that use non-discretized frequency ranges (i.e. no pre-defined
+ * frequency tables or operating points).
+ *
+ * Returns frequency selected.
+ */
+static unsigned long cpufreq_cfs_select_freq(struct cpufreq_policy *policy)
+{
+ int cpu = 0;
+ struct gov_data *gd;
+ unsigned max_usage = 0, usage = 0;
+
+ if (!policy->governor_data)
+ return 0;
+
+ gd = policy->governor_data;
+
+ /*
+ * get_cpu_usage is called without locking the runqueues. This is the
+ * same behavior used by find_busiest_cpu in load_balance. We are
+ * willing to accept occasionally stale data here in exchange for
+ * lockless behavior.
+ */
+ for_each_cpu(cpu, policy->cpus) {
+ usage = get_cpu_usage(cpu);
+ if (usage > max_usage)
+ max_usage = usage;
+ }
+
+ /* add margin to max_usage based on imbalance_pct */
+ max_usage = max_usage * MARGIN_PCT / 100;
+
+ cpu = cpumask_first(policy->cpus);
+
+ /* freq is current utilization + 25% */
+ return max_usage * policy->max / capacity_orig_of(cpu);
+}
+
+/*
+ * we pass in struct cpufreq_policy. This is safe because changing out the
+ * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
+ * which tears down all of the data structures and __cpufreq_governor(policy,
+ * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
+ * new policy pointer
+ */
+static int cpufreq_cfs_thread(void *data)
+{
+ struct sched_param param;
+ struct cpufreq_policy *policy;
+ struct gov_data *gd;
+ unsigned long freq;
+ int ret;
+
+ policy = (struct cpufreq_policy *) data;
+ if (!policy) {
+ pr_warn("%s: missing policy\n", __func__);
+ do_exit(-EINVAL);
+ }
+
+ gd = policy->governor_data;
+ if (!gd) {
+ pr_warn("%s: missing governor data\n", __func__);
+ do_exit(-EINVAL);
+ }
+
+ param.sched_priority = 50;
+ ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);
+ if (ret) {
+ pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
+ do_exit(-EINVAL);
+ } else {
+ pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
+ __func__, gd->task->pid);
+ }
+
+ ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus);
+ if (ret) {
+ pr_warn("%s: failed to set allowed ptr\n", __func__);
+ do_exit(-EINVAL);
+ }
+
+ /* main loop of the per-policy kthread */
+ do {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule();
+ if (kthread_should_stop())
+ break;
+
+ /* avoid race with cpufreq_cfs_stop */
+ if (!down_write_trylock(&policy->rwsem))
+ continue;
+
+ freq = cpufreq_cfs_select_freq(policy);
+
+ ret = __cpufreq_driver_target(policy, freq,
+ CPUFREQ_RELATION_L);
+ if (ret)
+ pr_debug("%s: __cpufreq_driver_target returned %d\n",
+ __func__, ret);
+
+ gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
+ up_write(&policy->rwsem);
+ } while (!kthread_should_stop());
+
+ do_exit(0);
+}
+
+static void cpufreq_cfs_irq_work(struct irq_work *irq_work)
+{
+ struct gov_data *gd;
+
+ gd = container_of(irq_work, struct gov_data, irq_work);
+ if (!gd) {
+ return;
+ }
+
+ wake_up_process(gd->task);
+}
+
+/**
+ * cpufreq_cfs_update_cpu - interface to scheduler for changing capacity values
+ * @cpu: cpu whose capacity utilization has recently changed
+ *
+ * cpufreq_cfs_udpate_cpu is an interface exposed to the scheduler so that the
+ * scheduler may inform the governor of updates to capacity utilization and
+ * make changes to cpu frequency. Currently this interface is designed around
+ * PELT values in CFS. It can be expanded to other scheduling classes in the
+ * future if needed.
+ *
+ * cpufreq_cfs_update_cpu raises an IPI. The irq_work handler for that IPI wakes up
+ * the thread that does the actual work, cpufreq_cfs_thread.
+ */
+void cpufreq_cfs_update_cpu(int cpu)
+{
+ struct cpufreq_policy *policy;
+ struct gov_data *gd;
+
+ /* XXX put policy pointer in per-cpu data? */
+ policy = cpufreq_cpu_get(cpu);
+ if (IS_ERR_OR_NULL(policy)) {
+ return;
+ }
+
+ if (!policy->governor_data) {
+ goto out;
+ }
+
+ gd = policy->governor_data;
+
+ /* bail early if we are throttled */
+ if (ktime_before(ktime_get(), gd->throttle)) {
+ goto out;
+ }
+
+ irq_work_queue_on(&gd->irq_work, cpu);
+
+out:
+ cpufreq_cpu_put(policy);
+ return;
+}
+
+static void cpufreq_cfs_start(struct cpufreq_policy *policy)
+{
+ struct gov_data *gd;
+
+ /* prepare per-policy private data */
+ gd = kzalloc(sizeof(*gd), GFP_KERNEL);
+ if (!gd) {
+ pr_debug("%s: failed to allocate private data\n", __func__);
+ return;
+ }
+
+ /*
+ * Don't ask for freq changes at an higher rate than what
+ * the driver advertises as transition latency.
+ */
+ gd->throttle_nsec = policy->cpuinfo.transition_latency ?
+ policy->cpuinfo.transition_latency :
+ THROTTLE_NSEC;
+ pr_debug("%s: throttle threshold = %u [ns]\n",
+ __func__, gd->throttle_nsec);
+
+ /* init per-policy kthread */
+ gd->task = kthread_run(cpufreq_cfs_thread, policy, "kcpufreq_cfs_task");
+ if (IS_ERR_OR_NULL(gd->task))
+ pr_err("%s: failed to create kcpufreq_cfs_task thread\n", __func__);
+
+ init_irq_work(&gd->irq_work, cpufreq_cfs_irq_work);
+ policy->governor_data = gd;
+ gd->policy = policy;
+}
+
+static void cpufreq_cfs_stop(struct cpufreq_policy *policy)
+{
+ struct gov_data *gd;
+
+ gd = policy->governor_data;
+ kthread_stop(gd->task);
+
+ policy->governor_data = NULL;
+
+ /* FIXME replace with devm counterparts? */
+ kfree(gd);
+}
+
+static int cpufreq_cfs_setup(struct cpufreq_policy *policy, unsigned int event)
+{
+ switch (event) {
+ case CPUFREQ_GOV_START:
+ /* Start managing the frequency */
+ cpufreq_cfs_start(policy);
+ return 0;
+
+ case CPUFREQ_GOV_STOP:
+ cpufreq_cfs_stop(policy);
+ return 0;
+
+ case CPUFREQ_GOV_LIMITS: /* unused */
+ case CPUFREQ_GOV_POLICY_INIT: /* unused */
+ case CPUFREQ_GOV_POLICY_EXIT: /* unused */
+ break;
+ }
+ return 0;
+}
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED_CFS
+static
+#endif
+struct cpufreq_governor cpufreq_cfs = {
+ .name = "cfs",
+ .governor = cpufreq_cfs_setup,
+ .owner = THIS_MODULE,
+};
+
+static int __init cpufreq_cfs_init(void)
+{
+ return cpufreq_register_governor(&cpufreq_cfs);
+}
+
+static void __exit cpufreq_cfs_exit(void)
+{
+ cpufreq_unregister_governor(&cpufreq_cfs);
+}
+
+/* Try to make this the default governor */
+fs_initcall(cpufreq_cfs_init);
+
+MODULE_LICENSE("GPL");
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9e37d49..3cf3024 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4257,6 +4257,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_rq_runnable_avg(rq, rq->nr_running);
add_nr_running(rq, 1);
}
+
+ if(sched_energy_freq())
+ cpufreq_cfs_update_cpu(cpu_of(rq));
+
hrtick_update(rq);
}

@@ -4318,6 +4322,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
sub_nr_running(rq, 1);
update_rq_runnable_avg(rq, 1);
}
+
+ if(sched_energy_freq())
+ cpufreq_cfs_update_cpu(cpu_of(rq));
+
hrtick_update(rq);
}

@@ -7789,6 +7797,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
task_tick_numa(rq, curr);

update_rq_runnable_avg(rq, 1);
+
+ if(sched_energy_freq())
+ cpufreq_cfs_update_cpu(cpu_of(rq));
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8bf35d3..457d98b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1429,6 +1429,12 @@ static inline int get_cpu_usage(int cpu)
return (usage * capacity) >> SCHED_LOAD_SHIFT;
}

+#ifdef CONFIG_CPU_FREQ_GOV_CFS
+void cpufreq_cfs_update_cpu(int cpu);
+#else
+static inline void cpufreq_cfs_update_cpu(int cpu) {}
+#endif
+
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
--
1.9.1

2015-05-04 23:02:21

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 0/4] scheduler-based cpu frequency scaling

On Monday, May 04, 2015 03:10:37 PM Michael Turquette wrote:
> This series implements an event-driven cpufreq governor that scales cpu
> frequency as a function of cfs runqueue utilization. The intent of this RFC is
> to get some discussion going about how the scheduler can become the policy
> engine for selecting cpu frequency, what limitations exist and what design do
> we want to take to get to a solution.
>
> This work is a different take on the patches I posted in November:
> http://lkml.kernel.org/r/<[email protected]>
>
> This series depends on having frequency-invariant representations for load.
> This requires Vincent's recently merged cpu capacity rework patches, as well as
> a new patch from Morten included here. Morten's patch will likely make an
> appearance in his energy aware scheduling v4 series.
>
> Thanks to Juri Lelli <[email protected]> for contributing to the development
> of the governor.
>
> A git branch with these patches can be pulled from here:
> https://git.linaro.org/people/mike.turquette/linux.git sched-freq
>
> Smoke testing has been done on an OMAP4 Pandaboard and an Exynos 5800
> Chromebook2. Extensive benchmarking and regression testing has not yet been
> done. Before sinking too much time into extensive testing I'd like to get
> feedback on the general design.
>
> Michael Turquette (3):
> sched: sched feature for cpu frequency selection
> sched: export get_cpu_usage & capacity_orig_of
> sched: cpufreq_sched_cfs: PELT-based cpu frequency scaling
>
> Morten Rasmussen (1):
> arm: Frequency invariant scheduler load-tracking support
>
> arch/arm/include/asm/topology.h | 7 +
> arch/arm/kernel/smp.c | 53 ++++++-
> arch/arm/kernel/topology.c | 17 +++
> drivers/cpufreq/Kconfig | 24 ++++
> include/linux/cpufreq.h | 3 +
> kernel/sched/Makefile | 1 +
> kernel/sched/cpufreq_cfs.c | 311 ++++++++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 48 +++----
> kernel/sched/features.h | 6 +
> kernel/sched/sched.h | 39 +++++
> 10 files changed, 475 insertions(+), 34 deletions(-)
> create mode 100644 kernel/sched/cpufreq_cfs.c

Can you *please* always CC PM-related patches to linux-pm?


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2015-05-05 09:01:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

On Mon, May 04, 2015 at 03:10:41PM -0700, Michael Turquette wrote:
> This policy is implemented using the cpufreq governor interface for two
> main reasons:
>
> 1) re-using the cpufreq machine drivers without using the governor
> interface is hard.
>
> 2) using the cpufreq interface allows us to switch between the
> scheduler-driven policy and legacy cpufreq governors such as ondemand at
> run-time. This is very useful for comparative testing and tuning.

Urgh,. so I don't really like that. It adds a lot of noise to the
system. You do the irq work thing to kick the cpufreq threads which do
their little thing -- and their wakeup will influence the cfs
accounting, which in turn will start the whole thing anew.

I would really prefer you did a whole new system with directly invoked
drivers that avoid the silly dance. Your 'new' ARM systems should be
well capable of that.

You can still do 2 if you create a cpufreq off switch. You can then
either enable the sched one or the legacy cpufreq -- or both if you want
a trainwreck ;-)

As to the drivers, they're mostly fairly small and self contained, it
should not be too hard to hack them up to work without cpufreq.

2015-05-05 09:01:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

On Mon, May 04, 2015 at 03:10:41PM -0700, Michael Turquette wrote:
> For those that are very curious, there were recently two previous
> postings of these patches to the public eas-dev mailing list. Of
> interest in those threads is the discussion around using a utilization
> threadhold versus purely matching frequency to capacity utilization
> versus using a margin (as this series does):
>
> https://lists.linaro.org/pipermail/eas-dev/2015-April/000074.html
> https://lists.linaro.org/pipermail/eas-dev/2015-April/000115.html
>

So why wasn't that done here? (and on linux-pm of course)

2015-05-05 09:04:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling



*groan* do _NOT_ cross-post with moderated lists, like this eas drivel.

Also, you typoed 'linaro-kernel@..' getting me even more noise for every
reply.

2015-05-05 12:16:01

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

Hi Peter,

thanks a lot for the fast reply! :)

On 05/05/15 10:00, Peter Zijlstra wrote:
> On Mon, May 04, 2015 at 03:10:41PM -0700, Michael Turquette wrote:
>> This policy is implemented using the cpufreq governor interface for two
>> main reasons:
>>
>> 1) re-using the cpufreq machine drivers without using the governor
>> interface is hard.
>>
>> 2) using the cpufreq interface allows us to switch between the
>> scheduler-driven policy and legacy cpufreq governors such as ondemand at
>> run-time. This is very useful for comparative testing and tuning.
>
> Urgh,. so I don't really like that. It adds a lot of noise to the
> system. You do the irq work thing to kick the cpufreq threads which do
> their little thing -- and their wakeup will influence the cfs
> accounting, which in turn will start the whole thing anew.
>

Right, we introduce some overhead, but in the end it should be less or
at least similar to what we already have today with ondemand, for
example. The idea here is that we should trigger this kthread wrapper
only when it is really needed, and maybe reduce the things it needs to
do. The irq work thing is one way we could do it from wherever we want
to.

Regarding cfs accounting, the bad trick is that we run these kthreads
with fifo. One reason is that we expect them to run quickly and we
usually want them to run as soon as they are woken up (possibly
preempting the task for which they are adapting the frequency). Of
course we'll have the same accounting problem within RT, but maybe we
could associate some special flag to them and treat them differently.
Or else we could just realize that we need this kind of small wrapper
tasks, of which we should know the behaviour of, and live with that.

Anyway, I'm currently experimenting in driving this thing a bit
differently from what we have in this patchset. I'm trying to reduce
the need to trigger the whole machinery to the least. Do you think is
still valuable to give it a look?

> I would really prefer you did a whole new system with directly invoked
> drivers that avoid the silly dance. Your 'new' ARM systems should be
> well capable of that.
>

Right, this thing is maybe not the cleanest solution we could come up
with, and how 'new' ARM systems will work may help us designing a better
one, but I'm not sure of what we can really do about today systems,
though. We of course need to support them (and for few years I guess)
and we would also like to have an event-driven solution to drive OPP
selection from the scheduler at the same time.

>From what I can tell, this "non-cpufreq" new system will probably have
to re-implement what current drivers are doing, and it will still have
to sleep during freq changes (at least for ARM). This will require some
asynch way of doing the freq changes, which is what the kthread solution
is already doing.

Best,

- Juri

> You can still do 2 if you create a cpufreq off switch. You can then
> either enable the sched one or the legacy cpufreq -- or both if you want
> a trainwreck ;-)
>
> As to the drivers, they're mostly fairly small and self contained, it
> should not be too hard to hack them up to work without cpufreq.
>

2015-05-05 18:29:06

by Mike Turquette

[permalink] [raw]
Subject: Re: [PATCH 0/4] scheduler-based cpu frequency scaling

On Mon, May 4, 2015 at 4:27 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Monday, May 04, 2015 03:10:37 PM Michael Turquette wrote:
>> This series implements an event-driven cpufreq governor that scales cpu
>> frequency as a function of cfs runqueue utilization. The intent of this RFC is
>> to get some discussion going about how the scheduler can become the policy
>> engine for selecting cpu frequency, what limitations exist and what design do
>> we want to take to get to a solution.
>>
>> This work is a different take on the patches I posted in November:
>> http://lkml.kernel.org/r/<[email protected]>
>>
>> This series depends on having frequency-invariant representations for load.
>> This requires Vincent's recently merged cpu capacity rework patches, as well as
>> a new patch from Morten included here. Morten's patch will likely make an
>> appearance in his energy aware scheduling v4 series.
>>
>> Thanks to Juri Lelli <[email protected]> for contributing to the development
>> of the governor.
>>
>> A git branch with these patches can be pulled from here:
>> https://git.linaro.org/people/mike.turquette/linux.git sched-freq
>>
>> Smoke testing has been done on an OMAP4 Pandaboard and an Exynos 5800
>> Chromebook2. Extensive benchmarking and regression testing has not yet been
>> done. Before sinking too much time into extensive testing I'd like to get
>> feedback on the general design.
>>
>> Michael Turquette (3):
>> sched: sched feature for cpu frequency selection
>> sched: export get_cpu_usage & capacity_orig_of
>> sched: cpufreq_sched_cfs: PELT-based cpu frequency scaling
>>
>> Morten Rasmussen (1):
>> arm: Frequency invariant scheduler load-tracking support
>>
>> arch/arm/include/asm/topology.h | 7 +
>> arch/arm/kernel/smp.c | 53 ++++++-
>> arch/arm/kernel/topology.c | 17 +++
>> drivers/cpufreq/Kconfig | 24 ++++
>> include/linux/cpufreq.h | 3 +
>> kernel/sched/Makefile | 1 +
>> kernel/sched/cpufreq_cfs.c | 311 ++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/fair.c | 48 +++----
>> kernel/sched/features.h | 6 +
>> kernel/sched/sched.h | 39 +++++
>> 10 files changed, 475 insertions(+), 34 deletions(-)
>> create mode 100644 kernel/sched/cpufreq_cfs.c
>
> Can you *please* always CC PM-related patches to linux-pm?

Will do. Apologies for the oversight.

Regards,
Mike

>
>
> --
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.

2015-05-06 12:23:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

On Tue, May 05, 2015 at 11:23:47AM -0700, Michael Turquette wrote:
> Quoting Peter Zijlstra (2015-05-05 02:00:42)
> > On Mon, May 04, 2015 at 03:10:41PM -0700, Michael Turquette wrote:
> > > This policy is implemented using the cpufreq governor interface for two
> > > main reasons:
> > >
> > > 1) re-using the cpufreq machine drivers without using the governor
> > > interface is hard.
> > >
> > > 2) using the cpufreq interface allows us to switch between the
> > > scheduler-driven policy and legacy cpufreq governors such as ondemand at
> > > run-time. This is very useful for comparative testing and tuning.
> >
> > Urgh,. so I don't really like that. It adds a lot of noise to the
> > system. You do the irq work thing to kick the cpufreq threads which do
> > their little thing -- and their wakeup will influence the cfs
> > accounting, which in turn will start the whole thing anew.
> >
> > I would really prefer you did a whole new system with directly invoked
> > drivers that avoid the silly dance. Your 'new' ARM systems should be
> > well capable of that.
>
> Thanks for the review Peter.

Well, I didn't actually get beyond the Changelog; although I have
rectified this now. A few more comments below.

> We'll need something in process context for the many cpufreq drivers
> that might sleep during their cpu frequency transition, no? This is due
> to calls into the regulator framework, the clock framework and sometimes
> other things such as conversing with a power management IC or voting
> with some system controller.

Yes, we need _something_. I just spoke to a bunch of people on IRC and
it does indeed seem that I was mistaken in my assumption that modern ARM
systems were 'easy' in this regard.

> > As to the drivers, they're mostly fairly small and self contained, it
> > should not be too hard to hack them up to work without cpufreq.
>
> The drivers are not the only thing. I want to leverage the existing
> cpufreq core infrastructure:
>
> * rate change notifiers
> * cpu hotplug awareness
> * methods to fetch frequency tables from firmware (acpi, devicetree)
> * other stuff I can't think of now
>
> So I do not think we should throw out the baby with the bath water. The
> thing that people really don't like about cpufreq are the governors IMO.
> Let's fix that by creating a list of requirements that we really want
> for scheduler-driven cpu frequency selection:
>
> 0) do something smart with scheduler statistics to create an improved
> frequency selection algorithm over existing cpufreq governors
>
> 1) support upcoming and legacy hardware, within reason
>
> 2) if a system exports a fast, async frequency selection interface to
> Linux, then provide a solution that doesn't do any irq_work or kthread
> stuff. Do it all in the fast path
>
> 3) if a system has a slow, synchronous interface for frequency
> selection, then provide an acceptable solution for kicking this work to
> process context. Limit time in the fast path
>
> The patch set tries to tackle 0, 1 and 3. Would the inclusion of #2 make
> you feel better about supporting "newer" hardware with a fire-and-forget
> frequency selection interface?

I should probably look at doing a x86 support patch to try some of this
out, I'll try and find some time to play with that.

So two comments on the actual code:

1) Ideally these hooks would be called from places where we've just
computed the cpu utilization already. I understand this is currently not
the case and we need to do it in-situ.

That said; it would be good to have the interface such that we pass the
utilization in; this would of course mean we'd have to compute it at the
call sites, this should be fine I think.


2) I dislike how policy->cpus is handled; it feels upside down.
If per 1 each CPU already computed its utilization and provides it in
the call, we should not have to recompute it and its scale factor (which
btw seems done slightly odd too, I know humans like 10 base, but
computers suck at it).

Why not something like:

usage = ((1024 + 256) * usage) >> 10; /* +25% */

old_usage = __this_cpu_read(gd->usage);
__this_cpu_write(gd->usage, usage);

max_usage = 0;
for_each_cpu(cpu, policy->cpus)
max_usage = max(max_usage, per_cpu(gd->usage, cpu));

if (max_usage < old_usage || /* dropped */
(max_usage == usage && max_usage != old_usage)) /* raised */
request_change(max_usage);

2015-05-06 16:50:22

by Abel Vesa

[permalink] [raw]
Subject: [PATCH] sched/core: Add empty 'gov_cfs_update_cpu' function definition for NON-SMP systems

If CONFIG_SMP is not defined the build will fail due to
function 'gov_cfs_update_cpu' definition missing.
Added empty static inline definition for NON-SMP systems.

This patch applies to:
https://git.linaro.org/people/mike.turquette/linux.git sched-freq

Signed-off-by: Abel Vesa <[email protected]>
---
kernel/sched/sched.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec23523..3d0996e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1413,6 +1413,7 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
#else
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
+static inline void gov_cfs_update_cpu(int cpu) {}
#endif

extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
--
1.9.1

2015-05-07 04:18:28

by Mike Turquette

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Add empty 'gov_cfs_update_cpu' function definition for NON-SMP systems

Quoting Abel Vesa (2015-05-06 09:50:40)
> If CONFIG_SMP is not defined the build will fail due to
> function 'gov_cfs_update_cpu' definition missing.
> Added empty static inline definition for NON-SMP systems.
>
> This patch applies to:
> https://git.linaro.org/people/mike.turquette/linux.git sched-freq
>
> Signed-off-by: Abel Vesa <[email protected]>

Thanks Abel. I'll fold this change in.

Regards,
Mike

> ---
> kernel/sched/sched.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec23523..3d0996e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1413,6 +1413,7 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
> #else
> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
> static inline void sched_avg_update(struct rq *rq) { }
> +static inline void gov_cfs_update_cpu(int cpu) {}
> #endif
>
> extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
> --
> 1.9.1
>

2015-05-07 06:24:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

On Wed, May 06, 2015 at 09:17:25PM -0700, Michael Turquette wrote:
> Are you thinking of placing the hook somewhere such as
> update_entity_load_avg?

Nah, I was more thinking of Morten's series where we already compute the
utilization for the energy aware scheduling reasons.

2015-05-07 10:48:27

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

Hi Mike,

On 07/05/15 05:17, Michael Turquette wrote:
> Quoting Peter Zijlstra (2015-05-06 05:22:40)
>> On Tue, May 05, 2015 at 11:23:47AM -0700, Michael Turquette wrote:
>>> Quoting Peter Zijlstra (2015-05-05 02:00:42)
>>>> On Mon, May 04, 2015 at 03:10:41PM -0700, Michael Turquette wrote:
>>>>> This policy is implemented using the cpufreq governor interface for two
>>>>> main reasons:
>>>>>
>>>>> 1) re-using the cpufreq machine drivers without using the governor
>>>>> interface is hard.
>>>>>
>>>>> 2) using the cpufreq interface allows us to switch between the
>>>>> scheduler-driven policy and legacy cpufreq governors such as ondemand at
>>>>> run-time. This is very useful for comparative testing and tuning.
>>>>
>>>> Urgh,. so I don't really like that. It adds a lot of noise to the
>>>> system. You do the irq work thing to kick the cpufreq threads which do
>>>> their little thing -- and their wakeup will influence the cfs
>>>> accounting, which in turn will start the whole thing anew.
>>>>
>>>> I would really prefer you did a whole new system with directly invoked
>>>> drivers that avoid the silly dance. Your 'new' ARM systems should be
>>>> well capable of that.
>>>
>>> Thanks for the review Peter.
>>
>> Well, I didn't actually get beyond the Changelog; although I have
>> rectified this now. A few more comments below.
>
> Thanks for the Real Deal review Peter.
>
>>
>>> We'll need something in process context for the many cpufreq drivers
>>> that might sleep during their cpu frequency transition, no? This is due
>>> to calls into the regulator framework, the clock framework and sometimes
>>> other things such as conversing with a power management IC or voting
>>> with some system controller.
>>
>> Yes, we need _something_. I just spoke to a bunch of people on IRC and
>> it does indeed seem that I was mistaken in my assumption that modern ARM
>> systems were 'easy' in this regard.
>>
>>>> As to the drivers, they're mostly fairly small and self contained, it
>>>> should not be too hard to hack them up to work without cpufreq.
>>>
>>> The drivers are not the only thing. I want to leverage the existing
>>> cpufreq core infrastructure:
>>>
>>> * rate change notifiers
>>> * cpu hotplug awareness
>>> * methods to fetch frequency tables from firmware (acpi, devicetree)
>>> * other stuff I can't think of now
>>>
>>> So I do not think we should throw out the baby with the bath water. The
>>> thing that people really don't like about cpufreq are the governors IMO.
>>> Let's fix that by creating a list of requirements that we really want
>>> for scheduler-driven cpu frequency selection:
>>>
>>> 0) do something smart with scheduler statistics to create an improved
>>> frequency selection algorithm over existing cpufreq governors
>>>
>>> 1) support upcoming and legacy hardware, within reason
>>>
>>> 2) if a system exports a fast, async frequency selection interface to
>>> Linux, then provide a solution that doesn't do any irq_work or kthread
>>> stuff. Do it all in the fast path
>>>
>>> 3) if a system has a slow, synchronous interface for frequency
>>> selection, then provide an acceptable solution for kicking this work to
>>> process context. Limit time in the fast path
>>>
>>> The patch set tries to tackle 0, 1 and 3. Would the inclusion of #2 make
>>> you feel better about supporting "newer" hardware with a fire-and-forget
>>> frequency selection interface?
>>
>> I should probably look at doing a x86 support patch to try some of this
>> out, I'll try and find some time to play with that.
>>
>> So two comments on the actual code:
>>
>> 1) Ideally these hooks would be called from places where we've just
>> computed the cpu utilization already. I understand this is currently not
>> the case and we need to do it in-situ.
>
> Are you thinking of placing the hook somewhere such as
> update_entity_load_avg?
>
> I did not choose that as a call site since I was worried that it would
> be too noisy; we can enter that function multiple times in the same
> pass through the scheduler.
>
> On the other hand if dvfs transitions are cheap for a platform then it
> might not be so bad to call it multiple times. For platforms where dvfs
> transitions are expensive we could store per-cpu new_capacity data
> multiple times and make sure that we only try to program the hardware as
> deferred work later on.
>
>>
>> That said; it would be good to have the interface such that we pass the
>> utilization in; this would of course mean we'd have to compute it at the
>> call sites, this should be fine I think.
>
> I changed the code to do this but didn't have time to test it. Will send
> the patch with these changes tomorrow.
>
>>
>>
>> 2) I dislike how policy->cpus is handled; it feels upside down.
>
> I agree. It feels better to push the utilization from cfs instead of
> pull it from the frequency selection policy.
>
> I chose to do it this way since there isn't an obvious way to pass some
> private data to an irq_work callback, but I've overcome this with some
> per-cpu data internal to the governor.
>

This new interface looks like what I was also proposing in the
discussion we had [1], are you going to send out something along this
line?

I think it would be good to have something like that. As Peter is
saying, together with Morten's/Dietmar's series, we could try to do
wiser decisions on when to trigger the whole machinery.

Thanks,

- Juri

[1] https://lists.linaro.org/pipermail/eas-dev/2015-April/000129.html

>> If per 1 each CPU already computed its utilization and provides it in
>> the call, we should not have to recompute it and its scale factor (which
>> btw seems done slightly odd too, I know humans like 10 base, but
>> computers suck at it).
>>
>> Why not something like:
>>
>> usage = ((1024 + 256) * usage) >> 10; /* +25% */
>
> This works when the max capacity is always 1024, but that is not always
> the case for some new ARM systems. For cpu's whose max capacity is less
> than 1024 we would still need to normalize against capacity_orig_of().
>
> Doing the SCHED_LOAD_SCALE thing is fine for me now, but at some point
> it will probably need to be changed by Morten/Juri/Dietmar for their EAS
> patch series, which deals with some of these cpu capacity scale issues.
>
>>
>> old_usage = __this_cpu_read(gd->usage);
>> __this_cpu_write(gd->usage, usage);
>>
>> max_usage = 0;
>> for_each_cpu(cpu, policy->cpus)
>> max_usage = max(max_usage, per_cpu(gd->usage, cpu));
>>
>> if (max_usage < old_usage || /* dropped */
>> (max_usage == usage && max_usage != old_usage)) /* raised */
>> request_change(max_usage);
>
> Yes this should work fine. I've pulled it into the patch that I'll
> test and publish tomorrow.
>
> Thanks,
> Mike
>
>>
>>
>

2015-05-07 14:25:26

by Mike Turquette

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

Quoting Juri Lelli (2015-05-07 03:49:27)
> Hi Mike,
>
> On 07/05/15 05:17, Michael Turquette wrote:
> > Quoting Peter Zijlstra (2015-05-06 05:22:40)
> >> On Tue, May 05, 2015 at 11:23:47AM -0700, Michael Turquette wrote:
> >>> Quoting Peter Zijlstra (2015-05-05 02:00:42)
> >>>> On Mon, May 04, 2015 at 03:10:41PM -0700, Michael Turquette wrote:
> >>>>> This policy is implemented using the cpufreq governor interface for two
> >>>>> main reasons:
> >>>>>
> >>>>> 1) re-using the cpufreq machine drivers without using the governor
> >>>>> interface is hard.
> >>>>>
> >>>>> 2) using the cpufreq interface allows us to switch between the
> >>>>> scheduler-driven policy and legacy cpufreq governors such as ondemand at
> >>>>> run-time. This is very useful for comparative testing and tuning.
> >>>>
> >>>> Urgh,. so I don't really like that. It adds a lot of noise to the
> >>>> system. You do the irq work thing to kick the cpufreq threads which do
> >>>> their little thing -- and their wakeup will influence the cfs
> >>>> accounting, which in turn will start the whole thing anew.
> >>>>
> >>>> I would really prefer you did a whole new system with directly invoked
> >>>> drivers that avoid the silly dance. Your 'new' ARM systems should be
> >>>> well capable of that.
> >>>
> >>> Thanks for the review Peter.
> >>
> >> Well, I didn't actually get beyond the Changelog; although I have
> >> rectified this now. A few more comments below.
> >
> > Thanks for the Real Deal review Peter.
> >
> >>
> >>> We'll need something in process context for the many cpufreq drivers
> >>> that might sleep during their cpu frequency transition, no? This is due
> >>> to calls into the regulator framework, the clock framework and sometimes
> >>> other things such as conversing with a power management IC or voting
> >>> with some system controller.
> >>
> >> Yes, we need _something_. I just spoke to a bunch of people on IRC and
> >> it does indeed seem that I was mistaken in my assumption that modern ARM
> >> systems were 'easy' in this regard.
> >>
> >>>> As to the drivers, they're mostly fairly small and self contained, it
> >>>> should not be too hard to hack them up to work without cpufreq.
> >>>
> >>> The drivers are not the only thing. I want to leverage the existing
> >>> cpufreq core infrastructure:
> >>>
> >>> * rate change notifiers
> >>> * cpu hotplug awareness
> >>> * methods to fetch frequency tables from firmware (acpi, devicetree)
> >>> * other stuff I can't think of now
> >>>
> >>> So I do not think we should throw out the baby with the bath water. The
> >>> thing that people really don't like about cpufreq are the governors IMO.
> >>> Let's fix that by creating a list of requirements that we really want
> >>> for scheduler-driven cpu frequency selection:
> >>>
> >>> 0) do something smart with scheduler statistics to create an improved
> >>> frequency selection algorithm over existing cpufreq governors
> >>>
> >>> 1) support upcoming and legacy hardware, within reason
> >>>
> >>> 2) if a system exports a fast, async frequency selection interface to
> >>> Linux, then provide a solution that doesn't do any irq_work or kthread
> >>> stuff. Do it all in the fast path
> >>>
> >>> 3) if a system has a slow, synchronous interface for frequency
> >>> selection, then provide an acceptable solution for kicking this work to
> >>> process context. Limit time in the fast path
> >>>
> >>> The patch set tries to tackle 0, 1 and 3. Would the inclusion of #2 make
> >>> you feel better about supporting "newer" hardware with a fire-and-forget
> >>> frequency selection interface?
> >>
> >> I should probably look at doing a x86 support patch to try some of this
> >> out, I'll try and find some time to play with that.
> >>
> >> So two comments on the actual code:
> >>
> >> 1) Ideally these hooks would be called from places where we've just
> >> computed the cpu utilization already. I understand this is currently not
> >> the case and we need to do it in-situ.
> >
> > Are you thinking of placing the hook somewhere such as
> > update_entity_load_avg?
> >
> > I did not choose that as a call site since I was worried that it would
> > be too noisy; we can enter that function multiple times in the same
> > pass through the scheduler.
> >
> > On the other hand if dvfs transitions are cheap for a platform then it
> > might not be so bad to call it multiple times. For platforms where dvfs
> > transitions are expensive we could store per-cpu new_capacity data
> > multiple times and make sure that we only try to program the hardware as
> > deferred work later on.
> >
> >>
> >> That said; it would be good to have the interface such that we pass the
> >> utilization in; this would of course mean we'd have to compute it at the
> >> call sites, this should be fine I think.
> >
> > I changed the code to do this but didn't have time to test it. Will send
> > the patch with these changes tomorrow.
> >
> >>
> >>
> >> 2) I dislike how policy->cpus is handled; it feels upside down.
> >
> > I agree. It feels better to push the utilization from cfs instead of
> > pull it from the frequency selection policy.
> >
> > I chose to do it this way since there isn't an obvious way to pass some
> > private data to an irq_work callback, but I've overcome this with some
> > per-cpu data internal to the governor.
> >
>
> This new interface looks like what I was also proposing in the
> discussion we had [1], are you going to send out something along this
> line?

Yes, it will look something like your patch, but different as well. I'll
send it out shortly.

I'm not reducing how we kick the whole machinery, but I'm adding extra
points to bail early if:

1) capacity utilization did not change
2) frequency won't be changed as a result of capacity utilization change

Regards,
Mike

>
> I think it would be good to have something like that. As Peter is
> saying, together with Morten's/Dietmar's series, we could try to do
> wiser decisions on when to trigger the whole machinery.
>
> Thanks,
>
> - Juri
>
> [1] https://lists.linaro.org/pipermail/eas-dev/2015-April/000129.html
>
> >> If per 1 each CPU already computed its utilization and provides it in
> >> the call, we should not have to recompute it and its scale factor (which
> >> btw seems done slightly odd too, I know humans like 10 base, but
> >> computers suck at it).
> >>
> >> Why not something like:
> >>
> >> usage = ((1024 + 256) * usage) >> 10; /* +25% */
> >
> > This works when the max capacity is always 1024, but that is not always
> > the case for some new ARM systems. For cpu's whose max capacity is less
> > than 1024 we would still need to normalize against capacity_orig_of().
> >
> > Doing the SCHED_LOAD_SCALE thing is fine for me now, but at some point
> > it will probably need to be changed by Morten/Juri/Dietmar for their EAS
> > patch series, which deals with some of these cpu capacity scale issues.
> >
> >>
> >> old_usage = __this_cpu_read(gd->usage);
> >> __this_cpu_write(gd->usage, usage);
> >>
> >> max_usage = 0;
> >> for_each_cpu(cpu, policy->cpus)
> >> max_usage = max(max_usage, per_cpu(gd->usage, cpu));
> >>
> >> if (max_usage < old_usage || /* dropped */
> >> (max_usage == usage && max_usage != old_usage)) /* raised */
> >> request_change(max_usage);
> >
> > Yes this should work fine. I've pulled it into the patch that I'll
> > test and publish tomorrow.
> >
> > Thanks,
> > Mike
> >
> >>
> >>
> >
>