2015-05-12 02:14:30

by Mike Turquette

[permalink] [raw]
Subject: [PATCH RFC v2 0/4] scheduler-driven cpu frequency selection

This series implements an event-driven cpufreq governor that scales cpu
frequency as a function of cfs runqueue utilization. The intent of this RFC is
to get some discussion going about how the scheduler can become the policy
engine for selecting cpu frequency, what limitations exist and what design do
we want to take to get to a solution.

V2 changes the interface exposed from the governor to cfs. Instead of being a
"pull" model where get_cpu_usage is used to fetch the utilization, that
information is pushed into the governor. After making this change it becomes
clear that selecting a new capacity target for a cpu can be done entirely
within fair.c without any knowledge of cpufreq or the hardware. I didn't go
that far in this version of the series, but it is something to consider. Such a
change would mean that we do not pass in a utilization value but instead a
capacity target.

RFC v1 from May 4, 2015:
http://lkml.kernel.org/r/<[email protected]>

Old, original idea from October/November of 2014:
http://lkml.kernel.org/r/<[email protected]>

This series depends on having frequency-invariant representations for load.
This requires Vincent's recently merged cpu capacity rework patches, as well as
a new patch from Morten included here. Morten's patch will likely make an
appearance in his energy aware scheduling v4 series.

Thanks to Juri Lelli <[email protected]> for contributing to the development
of the governor.

A git branch with these patches can be pulled from here:
https://git.linaro.org/people/mike.turquette/linux.git sched-freq

Smoke testing has been done on an OMAP4 Pandaboard and an Exynos 5800
Chromebook2. Extensive benchmarking and regression testing has not yet been
done.

Michael Turquette (3):
sched: sched feature for cpu frequency selection
sched: expose capacity_of in sched.h
sched: cpufreq_cfs: pelt-based cpu frequency scaling

Morten Rasmussen (1):
arm: Frequency invariant scheduler load-tracking support

arch/arm/include/asm/topology.h | 7 +
arch/arm/kernel/smp.c | 53 ++++++-
arch/arm/kernel/topology.c | 17 ++
drivers/cpufreq/Kconfig | 24 +++
include/linux/cpufreq.h | 3 +
kernel/sched/Makefile | 1 +
kernel/sched/cpufreq_cfs.c | 343 ++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 24 ++-
kernel/sched/features.h | 6 +
kernel/sched/sched.h | 13 ++
10 files changed, 484 insertions(+), 7 deletions(-)
create mode 100644 kernel/sched/cpufreq_cfs.c

--
1.9.1


2015-05-12 02:14:39

by Mike Turquette

[permalink] [raw]
Subject: [PATCH RFC v2 1/4] arm: Frequency invariant scheduler load-tracking support

From: Morten Rasmussen <[email protected]>

Implements arch-specific function to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking. The
factor is:

current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)

This implementation only provides frequency invariance. No
micro-architecture invariance yet.

Cc: Russell King <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
Changes in v2:
none

arch/arm/include/asm/topology.h | 7 ++++++
arch/arm/kernel/smp.c | 53 +++++++++++++++++++++++++++++++++++++++--
arch/arm/kernel/topology.c | 17 +++++++++++++
3 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 2fe85ff..4b985dc 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -24,6 +24,13 @@ void init_cpu_topology(void);
void store_cpu_topology(unsigned int cpuid);
const struct cpumask *cpu_coregroup_mask(int cpu);

+#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
+struct sched_domain;
+extern
+unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
+DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
+
#else

static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 86ef244..297ce1b 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -672,12 +672,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
static unsigned long global_l_p_j_ref;
static unsigned long global_l_p_j_ref_freq;
+static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
+DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
+
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling through arch_scale_freq_capacity()
+ * (implemented in topology.c).
+ */
+static inline
+void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max)
+{
+ unsigned long capacity;
+
+ if (!max)
+ return;
+
+ capacity = (curr << SCHED_CAPACITY_SHIFT) / max;
+ atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
+}

static int cpufreq_callback(struct notifier_block *nb,
unsigned long val, void *data)
{
struct cpufreq_freqs *freq = data;
int cpu = freq->cpu;
+ unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));

if (freq->flags & CPUFREQ_CONST_LOOPS)
return NOTIFY_OK;
@@ -702,6 +724,9 @@ static int cpufreq_callback(struct notifier_block *nb,
per_cpu(l_p_j_ref_freq, cpu),
freq->new);
}
+
+ scale_freq_capacity(cpu, freq->new, max);
+
return NOTIFY_OK;
}

@@ -709,11 +734,35 @@ static struct notifier_block cpufreq_notifier = {
.notifier_call = cpufreq_callback,
};

+static int cpufreq_policy_callback(struct notifier_block *nb,
+ unsigned long val, void *data)
+{
+ struct cpufreq_policy *policy = data;
+ int i;
+
+ for_each_cpu(i, policy->cpus) {
+ scale_freq_capacity(i, policy->cur, policy->max);
+ atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block cpufreq_policy_notifier = {
+ .notifier_call = cpufreq_policy_callback,
+};
+
static int __init register_cpufreq_notifier(void)
{
- return cpufreq_register_notifier(&cpufreq_notifier,
+ int ret;
+
+ ret = cpufreq_register_notifier(&cpufreq_notifier,
CPUFREQ_TRANSITION_NOTIFIER);
+ if (ret)
+ return ret;
+
+ return cpufreq_register_notifier(&cpufreq_policy_notifier,
+ CPUFREQ_POLICY_NOTIFIER);
}
core_initcall(register_cpufreq_notifier);
-
#endif
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 08b7847..9c09e6e 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -169,6 +169,23 @@ static void update_cpu_capacity(unsigned int cpu)
cpu, arch_scale_cpu_capacity(NULL, cpu));
}

+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling
+ * factor is updated in smp.c
+ */
+unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+ unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));
+
+ if (!curr)
+ return SCHED_CAPACITY_SCALE;
+
+ return curr;
+}
+
#else
static inline void parse_dt_topology(void) {}
static inline void update_cpu_capacity(unsigned int cpuid) {}
--
1.9.1

2015-05-12 02:14:44

by Mike Turquette

[permalink] [raw]
Subject: [PATCH RFC v2 2/4] sched: sched feature for cpu frequency selection

This patch introduces the SCHED_ENERGY_FREQ sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set to false when SCHED_DEBUG is not defined and thus
disabled by default.

Signed-off-by: Michael Turquette <[email protected]>
---
Changes in v2:
none

kernel/sched/fair.c | 5 +++++
kernel/sched/features.h | 6 ++++++
2 files changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46855d0..75aec8d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4207,6 +4207,11 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+static inline bool sched_energy_freq(void)
+{
+ return sched_feat(SCHED_ENERGY_FREQ);
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 91e33cd..77381cf 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
*/
SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
+
+/*
+ * Scheduler-driven CPU frequency selection aimed to save energy based on
+ * load tracking
+ */
+SCHED_FEAT(SCHED_ENERGY_FREQ, false)
--
1.9.1

2015-05-12 02:15:18

by Mike Turquette

[permalink] [raw]
Subject: [PATCH RFC v2 3/4] sched: expose capacity_of in sched.h

capacity_of is of use to a cpu frequency scaling policy based on cfs
load tracking and cpu capacity utilization metrics. Expose this call in
sched.h so it can be used in such a policy.

Signed-off-by: Michael Turquette <[email protected]>
---
Changes in v2:
Do not expose get_cpu_usage or capacity_orig_of in sched.h
Expose capacity_of instead

kernel/sched/fair.c | 5 -----
kernel/sched/sched.h | 5 +++++
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 75aec8d..d27ded9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4361,11 +4361,6 @@ static unsigned long target_load(int cpu, int type)
return max(rq->cpu_load[type-1], total);
}

-static unsigned long capacity_of(int cpu)
-{
- return cpu_rq(cpu)->cpu_capacity;
-}
-
static unsigned long capacity_orig_of(int cpu)
{
return cpu_rq(cpu)->cpu_capacity_orig;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e0e1299..4925bc4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1396,6 +1396,11 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
}
#endif

+static inline unsigned long capacity_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity;
+}
+
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
--
1.9.1

2015-05-12 02:14:47

by Mike Turquette

[permalink] [raw]
Subject: [PATCH RFC v2 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

Scheduler-driven cpu frequency selection is desirable as part of the
on-going effort to make the scheduler better aware of energy
consumption. No piece of the Linux kernel has a better view of the
factors that affect a cpu frequency selection policy than the
scheduler[0], and this patch is an attempt to converge on an initial
solution.

This patch implements a cpufreq governor that directly accesses
scheduler statistics, in particular per-runqueue capacity utilization
data from cfs via cfs.utilization_load_avg.

Put plainly, this governor selects the lowest cpu frequency that will
prevent a runqueue from being over-utilized (until we hit the highest
frequency of course). This is accomplished by requesting a frequency
that matches the current capacity utilization, plus a margin.

Unlike the previous posting from 2014[1] this governor implements a
"follow the utilization" method, where utilization is defined as the
frequency-invariant product of cfs.utilization_load_avg and
cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu
idle time nor any other method which is unsynchronized with the
scheduler. The entry points for this policy are in fair.c:
enqueue_task_fair, dequeue_task_fair and task_tick_fair.

This policy is implemented using the cpufreq governor interface for two
main reasons:

1) re-using the cpufreq machine drivers without using the governor
interface is hard.

2) using the cpufreq interface allows us to switch between the
scheduler-driven policy and legacy cpufreq governors such as ondemand at
run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all
scheduling classes except for cfs. It is possible to add support for
deadline and other other classes here, but I also wonder if a
multi-governor approach would be a more maintainable solution, where the
cpufreq core aggregates the constraints set by multiple governors.
Supporting such an approach in the cpufreq core would also allow for
peripheral devices to place constraint on cpu frequency without having
to hack such behavior in at the governor level.

Thanks to Juri Lelli <[email protected]> for contributing design ideas,
code and test results.

[0] http://article.gmane.org/gmane.linux.kernel/1499836
[1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Michael Turquette <[email protected]>
---
Changes in v2:
Folded in Abel's patch to fix builds for non-SMP. Thanks!
Dropped use of get_cpu_usage. Instead pass in
cfs.utilization_load_avg from fair.c
Added two additional conditions to quickly bail from _update_cpu
Return requested capacity from cpufreq_cfs_update_cpu
Handle frequency-table based systems more gooder
Internal data structures and the way data is shared with the
thread are changed considerably

Food for thought: in cpufreq_cfs_update_cpu we could break out
all of the code preceeding the call to cpufreq_cpu_get into
fair.c. The interface would change from,
unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long util);
to,
unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long cap_target);
This would give fair.c more control over the capacity it wants
to target, and makes the governor interface a bit more flexible
and useful.

drivers/cpufreq/Kconfig | 24 ++++
include/linux/cpufreq.h | 3 +
kernel/sched/Makefile | 1 +
kernel/sched/cpufreq_cfs.c | 343 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 14 ++
kernel/sched/sched.h | 8 ++
6 files changed, 393 insertions(+)
create mode 100644 kernel/sched/cpufreq_cfs.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index a171fef..83d51b4 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_CFS
+ bool "cfs"
+ select CPU_FREQ_GOV_CFS
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the CPUfreq governor 'cfs' as default. This scales
+ cpu frequency from the scheduler as per-entity load tracking
+ statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE
@@ -183,6 +192,21 @@ config CPU_FREQ_GOV_CONSERVATIVE

If in doubt, say N.

+config CPU_FREQ_GOV_CFS
+ tristate "'cfs' cpufreq governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_COMMON
+ help
+ 'cfs' - this governor scales cpu frequency from the
+ scheduler as a function of cpu capacity utilization. It does
+ not evaluate utilization on a periodic basis (as ondemand
+ does) but instead is invoked from the completely fair
+ scheduler when updating per-entity load tracking statistics.
+ Latency to respond to changes in load is improved over polling
+ governors due to its event-driven design.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"

config CPUFREQ_DT
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 2ee4888..62e8152 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -485,6 +485,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
extern struct cpufreq_governor cpufreq_gov_conservative;
#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV)
+extern struct cpufreq_governor cpufreq_gov_cap_gov;
+#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov)
#endif

/*********************************************************************
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 46be870..466960d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ_GOV_CFS) += cpufreq_cfs.o
diff --git a/kernel/sched/cpufreq_cfs.c b/kernel/sched/cpufreq_cfs.c
new file mode 100644
index 0000000..bcb63b6
--- /dev/null
+++ b/kernel/sched/cpufreq_cfs.c
@@ -0,0 +1,343 @@
+/*
+ * Copyright (C) 2015 Michael Turquette <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/percpu.h>
+#include <linux/irq_work.h>
+
+#include "sched.h"
+
+#define MARGIN_PCT 125 /* taken from imbalance_pct = 125 */
+#define THROTTLE_NSEC 50000000 /* 50ms default */
+
+static DEFINE_PER_CPU(unsigned long, pcpu_util);
+static DEFINE_PER_CPU(struct cpufreq_policy *, pcpu_policy);
+
+/**
+ * gov_data - per-policy data internal to the governor
+ * @throttle: next throttling period expiry. Derived from throttle_nsec
+ * @throttle_nsec: throttle period length in nanoseconds
+ * @task: worker thread for dvfs transition that may block/sleep
+ * @irq_work: callback used to wake up worker thread
+ * @freq: new frequency stored in *_cfs_update_cpu and used in *_cfs_thread
+ *
+ * struct gov_data is the per-policy cpufreq_cfs-specific data structure. A
+ * per-policy instance of it is created when the cpufreq_cfs governor receives
+ * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
+ * member of struct cpufreq_policy.
+ *
+ * Readers of this data must call down_read(policy->rwsem). Writers must
+ * call down_write(policy->rwsem).
+ */
+struct gov_data {
+ ktime_t throttle;
+ unsigned int throttle_nsec;
+ struct task_struct *task;
+ struct irq_work irq_work;
+ struct cpufreq_policy *policy;
+ unsigned int freq;
+};
+
+/*
+ * we pass in struct cpufreq_policy. This is safe because changing out the
+ * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
+ * which tears down all of the data structures and __cpufreq_governor(policy,
+ * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
+ * new policy pointer
+ */
+static int cpufreq_cfs_thread(void *data)
+{
+ struct sched_param param;
+ struct cpufreq_policy *policy;
+ struct gov_data *gd;
+ int ret;
+
+ policy = (struct cpufreq_policy *) data;
+ if (!policy) {
+ pr_warn("%s: missing policy\n", __func__);
+ do_exit(-EINVAL);
+ }
+
+ gd = policy->governor_data;
+ if (!gd) {
+ pr_warn("%s: missing governor data\n", __func__);
+ do_exit(-EINVAL);
+ }
+
+ param.sched_priority = 50;
+ ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);
+ if (ret) {
+ pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
+ do_exit(-EINVAL);
+ } else {
+ pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
+ __func__, gd->task->pid);
+ }
+
+ ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus);
+ if (ret) {
+ pr_warn("%s: failed to set allowed ptr\n", __func__);
+ do_exit(-EINVAL);
+ }
+
+ /* main loop of the per-policy kthread */
+ do {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule();
+ if (kthread_should_stop())
+ break;
+
+ /* avoid race with cpufreq_cfs_stop */
+ if (!down_write_trylock(&policy->rwsem))
+ continue;
+
+ ret = __cpufreq_driver_target(policy, gd->freq,
+ CPUFREQ_RELATION_L);
+ if (ret)
+ pr_debug("%s: __cpufreq_driver_target returned %d\n",
+ __func__, ret);
+
+ gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
+ up_write(&policy->rwsem);
+ } while (!kthread_should_stop());
+
+ do_exit(0);
+}
+
+static void cpufreq_cfs_irq_work(struct irq_work *irq_work)
+{
+ struct gov_data *gd;
+
+ gd = container_of(irq_work, struct gov_data, irq_work);
+ if (!gd) {
+ return;
+ }
+
+ wake_up_process(gd->task);
+}
+
+/**
+ * cpufreq_cfs_update_cpu - interface to scheduler for changing capacity values
+ * @cpu: cpu whose capacity utilization has recently changed
+ *
+ * cpufreq_cfs_update_cpu is an interface exposed to the scheduler so that the
+ * scheduler may inform the governor of updates to capacity utilization and
+ * make changes to cpu frequency. Currently this interface is designed around
+ * PELT values in CFS. It can be expanded to other scheduling classes in the
+ * future if needed.
+ *
+ * cpufreq_cfs_update_cpu raises an IPI. The irq_work handler for that IPI wakes up
+ * the thread that does the actual work, cpufreq_cfs_thread.
+ *
+ * This functions bails out early if either condition is true:
+ * 1) this cpu is not the new maximum utilization for its frequency domain
+ * 2) no change in cpu frequency is necessary to meet the new capacity request
+ *
+ * Returns the newly chosen capacity. Note that this may not reflect reality if
+ * the hardware fails to transition to this new capacity state.
+ */
+unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long util)
+{
+ unsigned long util_new, util_old, util_max, capacity_new;
+ unsigned int freq_new, freq_tmp, cpu_tmp;
+ struct cpufreq_policy *policy;
+ struct gov_data *gd;
+ struct cpufreq_frequency_table *pos;
+
+ /* handle rounding errors */
+ util_new = util > SCHED_LOAD_SCALE ? SCHED_LOAD_SCALE : util;
+
+ /* update per-cpu utilization */
+ util_old = __this_cpu_read(pcpu_util);
+ __this_cpu_write(pcpu_util, util_new);
+
+ /* avoid locking policy for now; accessing .cpus only */
+ policy = per_cpu(pcpu_policy, cpu);
+
+ /* find max utilization of cpus in this policy */
+ util_max = 0;
+ for_each_cpu(cpu_tmp, policy->cpus)
+ util_max = max(util_max, per_cpu(pcpu_util, cpu_tmp));
+
+ /*
+ * We only change frequency if this cpu's utilization represents a new
+ * max. If another cpu has increased its utilization beyond the
+ * previous max then we rely on that cpu to hit this code path and make
+ * the change. IOW, the cpu with the new max utilization is responsible
+ * for setting the new capacity/frequency.
+ *
+ * If this cpu is not the new maximum then bail, returning the current
+ * capacity.
+ */
+ if (util_max > util_new)
+ return capacity_of(cpu);
+
+ /*
+ * We are going to request a new capacity, which might result in a new
+ * cpu frequency. From here on we need to serialize access to the
+ * policy and the governor private data.
+ */
+ policy = cpufreq_cpu_get(cpu);
+ if (IS_ERR_OR_NULL(policy)) {
+ return capacity_of(cpu);
+ }
+
+ capacity_new = capacity_of(cpu);
+ if (!policy->governor_data) {
+ goto out;
+ }
+
+ gd = policy->governor_data;
+
+ /* bail early if we are throttled */
+ if (ktime_before(ktime_get(), gd->throttle)) {
+ goto out;
+ }
+
+ /*
+ * Convert the new maximum capacity utilization into a cpu frequency
+ *
+ * It is possible to convert capacity utilization directly into a
+ * frequency, but that implies that we would be 100% utilized. Instead,
+ * first add a margin (default 25% capacity increase) to the new
+ * capacity request. This provides some head room if load increases.
+ */
+ capacity_new = util_new + (SCHED_CAPACITY_SCALE >> 2);
+ freq_new = capacity_new * policy->max >> SCHED_CAPACITY_SHIFT;
+
+ /*
+ * If a frequency table is available then find the frequency
+ * corresponding to freq_new.
+ *
+ * For cpufreq drivers without a frequency table, use the frequency
+ * directly computed from capacity_new + 25% margin.
+ */
+ if (policy->freq_table) {
+ freq_tmp = policy->max;
+ cpufreq_for_each_entry(pos, policy->freq_table) {
+ if (pos->frequency >= freq_new &&
+ pos->frequency < freq_tmp)
+ freq_tmp = pos->frequency;
+ }
+ freq_new = freq_tmp;
+ capacity_new = (freq_new << SCHED_CAPACITY_SHIFT) / policy->max;
+ }
+
+ /* No change in frequency? Bail and return current capacity. */
+ if (freq_new == policy->cur) {
+ capacity_new = capacity_of(cpu);
+ goto out;
+ }
+
+ /* store the new frequency and kick the thread */
+ gd->freq = freq_new;
+
+ /* XXX can we use something like try_to_wake_up_local here instead? */
+ irq_work_queue_on(&gd->irq_work, cpu);
+
+out:
+ cpufreq_cpu_put(policy);
+ return capacity_new;
+}
+
+static void cpufreq_cfs_start(struct cpufreq_policy *policy)
+{
+ struct gov_data *gd;
+ int cpu;
+
+ /* prepare per-policy private data */
+ gd = kzalloc(sizeof(*gd), GFP_KERNEL);
+ if (!gd) {
+ pr_debug("%s: failed to allocate private data\n", __func__);
+ return;
+ }
+
+ /* initialize per-cpu data */
+ for_each_cpu(cpu, policy->cpus) {
+ per_cpu(pcpu_util, cpu) = 0;
+ per_cpu(pcpu_policy, cpu) = policy;
+ }
+
+ /*
+ * Don't ask for freq changes at an higher rate than what
+ * the driver advertises as transition latency.
+ */
+ gd->throttle_nsec = policy->cpuinfo.transition_latency ?
+ policy->cpuinfo.transition_latency :
+ THROTTLE_NSEC;
+ pr_debug("%s: throttle threshold = %u [ns]\n",
+ __func__, gd->throttle_nsec);
+
+ /* init per-policy kthread */
+ gd->task = kthread_run(cpufreq_cfs_thread, policy, "kcpufreq_cfs_task");
+ if (IS_ERR_OR_NULL(gd->task))
+ pr_err("%s: failed to create kcpufreq_cfs_task thread\n", __func__);
+
+ init_irq_work(&gd->irq_work, cpufreq_cfs_irq_work);
+ policy->governor_data = gd;
+ gd->policy = policy;
+}
+
+static void cpufreq_cfs_stop(struct cpufreq_policy *policy)
+{
+ struct gov_data *gd;
+
+ gd = policy->governor_data;
+ kthread_stop(gd->task);
+
+ policy->governor_data = NULL;
+
+ /* FIXME replace with devm counterparts? */
+ kfree(gd);
+}
+
+static int cpufreq_cfs_setup(struct cpufreq_policy *policy, unsigned int event)
+{
+ switch (event) {
+ case CPUFREQ_GOV_START:
+ /* Start managing the frequency */
+ cpufreq_cfs_start(policy);
+ return 0;
+
+ case CPUFREQ_GOV_STOP:
+ cpufreq_cfs_stop(policy);
+ return 0;
+
+ case CPUFREQ_GOV_LIMITS: /* unused */
+ case CPUFREQ_GOV_POLICY_INIT: /* unused */
+ case CPUFREQ_GOV_POLICY_EXIT: /* unused */
+ break;
+ }
+ return 0;
+}
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED_CFS
+static
+#endif
+struct cpufreq_governor cpufreq_cfs = {
+ .name = "cfs",
+ .governor = cpufreq_cfs_setup,
+ .owner = THIS_MODULE,
+};
+
+static int __init cpufreq_cfs_init(void)
+{
+ return cpufreq_register_governor(&cpufreq_cfs);
+}
+
+static void __exit cpufreq_cfs_exit(void)
+{
+ cpufreq_unregister_governor(&cpufreq_cfs);
+}
+
+/* Try to make this the default governor */
+fs_initcall(cpufreq_cfs_init);
+
+MODULE_LICENSE("GPL");
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d27ded9..f3c93b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4257,6 +4257,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_rq_runnable_avg(rq, rq->nr_running);
add_nr_running(rq, 1);
}
+
+ if(sched_energy_freq())
+ cpufreq_cfs_update_cpu(cpu_of(rq),
+ rq->cfs.utilization_load_avg);
+
hrtick_update(rq);
}

@@ -4318,6 +4323,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
sub_nr_running(rq, 1);
update_rq_runnable_avg(rq, 1);
}
+
+ if(sched_energy_freq())
+ cpufreq_cfs_update_cpu(cpu_of(rq),
+ rq->cfs.utilization_load_avg);
+
hrtick_update(rq);
}

@@ -7816,6 +7826,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
task_tick_numa(rq, curr);

update_rq_runnable_avg(rq, 1);
+
+ if(sched_energy_freq())
+ cpufreq_cfs_update_cpu(cpu_of(rq),
+ rq->cfs.utilization_load_avg);
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4925bc4..a8585e1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1401,6 +1401,13 @@ static inline unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
}

+#ifdef CONFIG_CPU_FREQ_GOV_CFS
+unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long util);
+#else
+static inline unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long util)
+{ }
+#endif
+
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
@@ -1409,6 +1416,7 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
#else
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
+static inline void gov_cfs_update_cpu(int cpu) {}
#endif

extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
--
1.9.1

2015-05-13 09:19:44

by Paul Bolle

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

One nit and one question follow.

>--- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig

> +config CPU_FREQ_DEFAULT_GOV_CFS
> + bool "cfs"
> + select CPU_FREQ_GOV_CFS
> + select CPU_FREQ_GOV_PERFORMANCE
> + help
> + Use the CPUfreq governor 'cfs' as default. This scales
> + cpu frequency from the scheduler as per-entity load tracking
> + statistics are updated.

> +config CPU_FREQ_GOV_CFS
> + tristate "'cfs' cpufreq governor"
> + depends on CPU_FREQ
> + select CPU_FREQ_GOV_COMMON
> + help
> + 'cfs' - this governor scales cpu frequency from the
> + scheduler as a function of cpu capacity utilization. It does
> + not evaluate utilization on a periodic basis (as ondemand
> + does) but instead is invoked from the completely fair
> + scheduler when updating per-entity load tracking statistics.
> + Latency to respond to changes in load is improved over polling
> + governors due to its event-driven design.
> +
> + If in doubt, say N.

> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -485,6 +485,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
> #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
> extern struct cpufreq_governor cpufreq_gov_conservative;
> #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative)
> +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV)
> +extern struct cpufreq_governor cpufreq_gov_cap_gov;
> +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov)
> #endif

Question: grepping for CPU_FREQ_DEFAULT_GOV_CAP_GOV and
cpufreq_gov_cap_gov didn't gave me hits in next-20150513. And this
series doesn't add them, does it? So where do they com from?

(Since you add CPU_FREQ_DEFAULT_GOV_CFS and cpufreq_cfs in this patch it
seems you intended to add those. I briefly looked into the way
CPUFREQ_DEFAULT_GOVERNOR is set in cpufreq.h when v1 came along, found
it to complicated for me, and promptly forgot the details, so please
double check.)

> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile

> +obj-$(CONFIG_CPU_FREQ_GOV_CFS) += cpufreq_cfs.o

> --- /dev/null
> +++ b/kernel/sched/cpufreq_cfs.c

> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.

This states the license is GPL v2.

> +MODULE_LICENSE("GPL");

And, according to include/linux/module.h, this states the license is GPL
v2 or later. So I think that either the comment at the top of this file
or the ident used in the MODULE_LICENSE() macro needs to be changed.

Thanks,


Paul Bolle

2015-05-18 16:42:19

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

Hi Mike,

On 12/05/15 03:13, Michael Turquette wrote:
> Scheduler-driven cpu frequency selection is desirable as part of the
> on-going effort to make the scheduler better aware of energy
> consumption. No piece of the Linux kernel has a better view of the
> factors that affect a cpu frequency selection policy than the
> scheduler[0], and this patch is an attempt to converge on an initial
> solution.
>
> This patch implements a cpufreq governor that directly accesses
> scheduler statistics, in particular per-runqueue capacity utilization
> data from cfs via cfs.utilization_load_avg.
>
> Put plainly, this governor selects the lowest cpu frequency that will
> prevent a runqueue from being over-utilized (until we hit the highest
> frequency of course). This is accomplished by requesting a frequency
> that matches the current capacity utilization, plus a margin.
>
> Unlike the previous posting from 2014[1] this governor implements a
> "follow the utilization" method, where utilization is defined as the
> frequency-invariant product of cfs.utilization_load_avg and
> cpu_capacity_orig.
>
> This governor is event-driven. There is no polling loop to check cpu
> idle time nor any other method which is unsynchronized with the
> scheduler. The entry points for this policy are in fair.c:
> enqueue_task_fair, dequeue_task_fair and task_tick_fair.
>
> This policy is implemented using the cpufreq governor interface for two
> main reasons:
>
> 1) re-using the cpufreq machine drivers without using the governor
> interface is hard.
>
> 2) using the cpufreq interface allows us to switch between the
> scheduler-driven policy and legacy cpufreq governors such as ondemand at
> run-time. This is very useful for comparative testing and tuning.
>
> Finally, it is worth mentioning that this approach neglects all
> scheduling classes except for cfs. It is possible to add support for
> deadline and other other classes here, but I also wonder if a
> multi-governor approach would be a more maintainable solution, where the
> cpufreq core aggregates the constraints set by multiple governors.
> Supporting such an approach in the cpufreq core would also allow for
> peripheral devices to place constraint on cpu frequency without having
> to hack such behavior in at the governor level.
>
> Thanks to Juri Lelli <[email protected]> for contributing design ideas,
> code and test results.
>
> [0] http://article.gmane.org/gmane.linux.kernel/1499836
> [1] https://lkml.org/lkml/2014/10/22/22
>
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Michael Turquette <[email protected]>
> ---
> Changes in v2:
> Folded in Abel's patch to fix builds for non-SMP. Thanks!
> Dropped use of get_cpu_usage. Instead pass in
> cfs.utilization_load_avg from fair.c
> Added two additional conditions to quickly bail from _update_cpu
> Return requested capacity from cpufreq_cfs_update_cpu
> Handle frequency-table based systems more gooder
> Internal data structures and the way data is shared with the
> thread are changed considerably
>
> Food for thought: in cpufreq_cfs_update_cpu we could break out
> all of the code preceeding the call to cpufreq_cpu_get into
> fair.c. The interface would change from,
> unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long util);
> to,
> unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long cap_target);
> This would give fair.c more control over the capacity it wants
> to target, and makes the governor interface a bit more flexible
> and useful.
>
> drivers/cpufreq/Kconfig | 24 ++++
> include/linux/cpufreq.h | 3 +
> kernel/sched/Makefile | 1 +
> kernel/sched/cpufreq_cfs.c | 343 +++++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 14 ++
> kernel/sched/sched.h | 8 ++
> 6 files changed, 393 insertions(+)
> create mode 100644 kernel/sched/cpufreq_cfs.c
>
> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> index a171fef..83d51b4 100644
> --- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig
> @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> Be aware that not all cpufreq drivers support the conservative
> governor. If unsure have a look at the help section of the
> driver. Fallback governor will be the performance governor.
> +
> +config CPU_FREQ_DEFAULT_GOV_CFS
> + bool "cfs"
> + select CPU_FREQ_GOV_CFS
> + select CPU_FREQ_GOV_PERFORMANCE
> + help
> + Use the CPUfreq governor 'cfs' as default. This scales
> + cpu frequency from the scheduler as per-entity load tracking
> + statistics are updated.
> endchoice
>
> config CPU_FREQ_GOV_PERFORMANCE
> @@ -183,6 +192,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
>
> If in doubt, say N.
>
> +config CPU_FREQ_GOV_CFS
> + tristate "'cfs' cpufreq governor"
> + depends on CPU_FREQ
> + select CPU_FREQ_GOV_COMMON
> + help
> + 'cfs' - this governor scales cpu frequency from the
> + scheduler as a function of cpu capacity utilization. It does
> + not evaluate utilization on a periodic basis (as ondemand
> + does) but instead is invoked from the completely fair
> + scheduler when updating per-entity load tracking statistics.
> + Latency to respond to changes in load is improved over polling
> + governors due to its event-driven design.
> +
> + If in doubt, say N.
> +
> comment "CPU frequency scaling drivers"
>
> config CPUFREQ_DT
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 2ee4888..62e8152 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -485,6 +485,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
> #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
> extern struct cpufreq_governor cpufreq_gov_conservative;
> #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative)
> +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV)
> +extern struct cpufreq_governor cpufreq_gov_cap_gov;
> +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov)
> #endif
>
> /*********************************************************************
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 46be870..466960d 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
> obj-$(CONFIG_SCHEDSTATS) += stats.o
> obj-$(CONFIG_SCHED_DEBUG) += debug.o
> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> +obj-$(CONFIG_CPU_FREQ_GOV_CFS) += cpufreq_cfs.o
> diff --git a/kernel/sched/cpufreq_cfs.c b/kernel/sched/cpufreq_cfs.c
> new file mode 100644
> index 0000000..bcb63b6
> --- /dev/null
> +++ b/kernel/sched/cpufreq_cfs.c
> @@ -0,0 +1,343 @@
> +/*
> + * Copyright (C) 2015 Michael Turquette <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/cpufreq.h>
> +#include <linux/module.h>
> +#include <linux/kthread.h>
> +#include <linux/percpu.h>
> +#include <linux/irq_work.h>
> +
> +#include "sched.h"
> +
> +#define MARGIN_PCT 125 /* taken from imbalance_pct = 125 */

You don't use this anymore, right? But see also my comment below
on this.

> +#define THROTTLE_NSEC 50000000 /* 50ms default */
> +
> +static DEFINE_PER_CPU(unsigned long, pcpu_util);
> +static DEFINE_PER_CPU(struct cpufreq_policy *, pcpu_policy);
> +
> +/**
> + * gov_data - per-policy data internal to the governor
> + * @throttle: next throttling period expiry. Derived from throttle_nsec
> + * @throttle_nsec: throttle period length in nanoseconds
> + * @task: worker thread for dvfs transition that may block/sleep
> + * @irq_work: callback used to wake up worker thread
> + * @freq: new frequency stored in *_cfs_update_cpu and used in *_cfs_thread
> + *
> + * struct gov_data is the per-policy cpufreq_cfs-specific data structure. A
> + * per-policy instance of it is created when the cpufreq_cfs governor receives
> + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
> + * member of struct cpufreq_policy.
> + *
> + * Readers of this data must call down_read(policy->rwsem). Writers must
> + * call down_write(policy->rwsem).
> + */
> +struct gov_data {
> + ktime_t throttle;
> + unsigned int throttle_nsec;
> + struct task_struct *task;
> + struct irq_work irq_work;
> + struct cpufreq_policy *policy;
> + unsigned int freq;
> +};
> +
> +/*
> + * we pass in struct cpufreq_policy. This is safe because changing out the
> + * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
> + * which tears down all of the data structures and __cpufreq_governor(policy,
> + * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
> + * new policy pointer
> + */
> +static int cpufreq_cfs_thread(void *data)
> +{
> + struct sched_param param;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + int ret;
> +
> + policy = (struct cpufreq_policy *) data;
> + if (!policy) {
> + pr_warn("%s: missing policy\n", __func__);
> + do_exit(-EINVAL);
> + }
> +
> + gd = policy->governor_data;
> + if (!gd) {
> + pr_warn("%s: missing governor data\n", __func__);
> + do_exit(-EINVAL);
> + }
> +
> + param.sched_priority = 50;
> + ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);
> + if (ret) {
> + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> + do_exit(-EINVAL);
> + } else {
> + pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
> + __func__, gd->task->pid);
> + }
> +
> + ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus);
> + if (ret) {
> + pr_warn("%s: failed to set allowed ptr\n", __func__);
> + do_exit(-EINVAL);
> + }
> +
> + /* main loop of the per-policy kthread */
> + do {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule();
> + if (kthread_should_stop())
> + break;
> +
> + /* avoid race with cpufreq_cfs_stop */
> + if (!down_write_trylock(&policy->rwsem))
> + continue;
> +
> + ret = __cpufreq_driver_target(policy, gd->freq,
> + CPUFREQ_RELATION_L);
> + if (ret)
> + pr_debug("%s: __cpufreq_driver_target returned %d\n",
> + __func__, ret);
> +
> + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
> + up_write(&policy->rwsem);
> + } while (!kthread_should_stop());
> +
> + do_exit(0);
> +}
> +
> +static void cpufreq_cfs_irq_work(struct irq_work *irq_work)
> +{
> + struct gov_data *gd;
> +
> + gd = container_of(irq_work, struct gov_data, irq_work);
> + if (!gd) {
> + return;
> + }
> +
> + wake_up_process(gd->task);
> +}
> +
> +/**
> + * cpufreq_cfs_update_cpu - interface to scheduler for changing capacity values
> + * @cpu: cpu whose capacity utilization has recently changed
> + *
> + * cpufreq_cfs_update_cpu is an interface exposed to the scheduler so that the
> + * scheduler may inform the governor of updates to capacity utilization and
> + * make changes to cpu frequency. Currently this interface is designed around
> + * PELT values in CFS. It can be expanded to other scheduling classes in the
> + * future if needed.
> + *
> + * cpufreq_cfs_update_cpu raises an IPI. The irq_work handler for that IPI wakes up
> + * the thread that does the actual work, cpufreq_cfs_thread.
> + *
> + * This functions bails out early if either condition is true:
> + * 1) this cpu is not the new maximum utilization for its frequency domain
> + * 2) no change in cpu frequency is necessary to meet the new capacity request
> + *
> + * Returns the newly chosen capacity. Note that this may not reflect reality if
> + * the hardware fails to transition to this new capacity state.
> + */
> +unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long util)

Is anybody consuming the return value? Did you have in mind some
possible usage of it?

> +{
> + unsigned long util_new, util_old, util_max, capacity_new;
> + unsigned int freq_new, freq_tmp, cpu_tmp;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + struct cpufreq_frequency_table *pos;
> +
> + /* handle rounding errors */
> + util_new = util > SCHED_LOAD_SCALE ? SCHED_LOAD_SCALE : util;
> +
> + /* update per-cpu utilization */
> + util_old = __this_cpu_read(pcpu_util);
> + __this_cpu_write(pcpu_util, util_new);
> +
> + /* avoid locking policy for now; accessing .cpus only */
> + policy = per_cpu(pcpu_policy, cpu);
> +
> + /* find max utilization of cpus in this policy */
> + util_max = 0;
> + for_each_cpu(cpu_tmp, policy->cpus)
> + util_max = max(util_max, per_cpu(pcpu_util, cpu_tmp));
> +
> + /*
> + * We only change frequency if this cpu's utilization represents a new
> + * max. If another cpu has increased its utilization beyond the
> + * previous max then we rely on that cpu to hit this code path and make
> + * the change. IOW, the cpu with the new max utilization is responsible
> + * for setting the new capacity/frequency.
> + *
> + * If this cpu is not the new maximum then bail, returning the current
> + * capacity.
> + */
> + if (util_max > util_new)
> + return capacity_of(cpu);

Here and below you probably want to return arch_scale_freq_capacity(NULL, cpu),
as capacity_of() returns the remaining capacity (w.r.t. capacity_orig) for CFS
tasks after RT tasks contribution is removed.

> +
> + /*
> + * We are going to request a new capacity, which might result in a new
> + * cpu frequency. From here on we need to serialize access to the
> + * policy and the governor private data.
> + */
> + policy = cpufreq_cpu_get(cpu);
> + if (IS_ERR_OR_NULL(policy)) {
> + return capacity_of(cpu);
> + }

Shouldn't this be removed now that we have pcpu_policy?
Also the cpufreq_put_cpu() below.

> +
> + capacity_new = capacity_of(cpu);
> + if (!policy->governor_data) {
> + goto out;
> + }
> +
> + gd = policy->governor_data;
> +
> + /* bail early if we are throttled */
> + if (ktime_before(ktime_get(), gd->throttle)) {
> + goto out;
> + }
> +
> + /*
> + * Convert the new maximum capacity utilization into a cpu frequency
> + *
> + * It is possible to convert capacity utilization directly into a
> + * frequency, but that implies that we would be 100% utilized. Instead,
> + * first add a margin (default 25% capacity increase) to the new
> + * capacity request. This provides some head room if load increases.
> + */
> + capacity_new = util_new + (SCHED_CAPACITY_SCALE >> 2);

Here you introduce this 25% margin w.r.t. SCHED_CAPACITY_SCALE.
Shouldn't the margin be related to util_new instead (using MARGIN_PCT
maybe)?

> + freq_new = capacity_new * policy->max >> SCHED_CAPACITY_SHIFT;
> +
> + /*
> + * If a frequency table is available then find the frequency
> + * corresponding to freq_new.
> + *
> + * For cpufreq drivers without a frequency table, use the frequency
> + * directly computed from capacity_new + 25% margin.
> + */
> + if (policy->freq_table) {
> + freq_tmp = policy->max;
> + cpufreq_for_each_entry(pos, policy->freq_table) {
> + if (pos->frequency >= freq_new &&
> + pos->frequency < freq_tmp)
> + freq_tmp = pos->frequency;
> + }
> + freq_new = freq_tmp;
> + capacity_new = (freq_new << SCHED_CAPACITY_SHIFT) / policy->max;
> + }

Do we really need to do this here? Doesn't __cpufreq_driver_target()
do the same for us?

Best,

- Juri

> +
> + /* No change in frequency? Bail and return current capacity. */
> + if (freq_new == policy->cur) {
> + capacity_new = capacity_of(cpu);
> + goto out;
> + }
> +
> + /* store the new frequency and kick the thread */
> + gd->freq = freq_new;
> +
> + /* XXX can we use something like try_to_wake_up_local here instead? */
> + irq_work_queue_on(&gd->irq_work, cpu);
> +
> +out:
> + cpufreq_cpu_put(policy);
> + return capacity_new;
> +}
> +
> +static void cpufreq_cfs_start(struct cpufreq_policy *policy)
> +{
> + struct gov_data *gd;
> + int cpu;
> +
> + /* prepare per-policy private data */
> + gd = kzalloc(sizeof(*gd), GFP_KERNEL);
> + if (!gd) {
> + pr_debug("%s: failed to allocate private data\n", __func__);
> + return;
> + }
> +
> + /* initialize per-cpu data */
> + for_each_cpu(cpu, policy->cpus) {
> + per_cpu(pcpu_util, cpu) = 0;
> + per_cpu(pcpu_policy, cpu) = policy;
> + }
> +
> + /*
> + * Don't ask for freq changes at an higher rate than what
> + * the driver advertises as transition latency.
> + */
> + gd->throttle_nsec = policy->cpuinfo.transition_latency ?
> + policy->cpuinfo.transition_latency :
> + THROTTLE_NSEC;
> + pr_debug("%s: throttle threshold = %u [ns]\n",
> + __func__, gd->throttle_nsec);
> +
> + /* init per-policy kthread */
> + gd->task = kthread_run(cpufreq_cfs_thread, policy, "kcpufreq_cfs_task");
> + if (IS_ERR_OR_NULL(gd->task))
> + pr_err("%s: failed to create kcpufreq_cfs_task thread\n", __func__);
> +
> + init_irq_work(&gd->irq_work, cpufreq_cfs_irq_work);
> + policy->governor_data = gd;
> + gd->policy = policy;
> +}
> +
> +static void cpufreq_cfs_stop(struct cpufreq_policy *policy)
> +{
> + struct gov_data *gd;
> +
> + gd = policy->governor_data;
> + kthread_stop(gd->task);
> +
> + policy->governor_data = NULL;
> +
> + /* FIXME replace with devm counterparts? */
> + kfree(gd);
> +}
> +
> +static int cpufreq_cfs_setup(struct cpufreq_policy *policy, unsigned int event)
> +{
> + switch (event) {
> + case CPUFREQ_GOV_START:
> + /* Start managing the frequency */
> + cpufreq_cfs_start(policy);
> + return 0;
> +
> + case CPUFREQ_GOV_STOP:
> + cpufreq_cfs_stop(policy);
> + return 0;
> +
> + case CPUFREQ_GOV_LIMITS: /* unused */
> + case CPUFREQ_GOV_POLICY_INIT: /* unused */
> + case CPUFREQ_GOV_POLICY_EXIT: /* unused */
> + break;
> + }
> + return 0;
> +}
> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED_CFS
> +static
> +#endif
> +struct cpufreq_governor cpufreq_cfs = {
> + .name = "cfs",
> + .governor = cpufreq_cfs_setup,
> + .owner = THIS_MODULE,
> +};
> +
> +static int __init cpufreq_cfs_init(void)
> +{
> + return cpufreq_register_governor(&cpufreq_cfs);
> +}
> +
> +static void __exit cpufreq_cfs_exit(void)
> +{
> + cpufreq_unregister_governor(&cpufreq_cfs);
> +}
> +
> +/* Try to make this the default governor */
> +fs_initcall(cpufreq_cfs_init);
> +
> +MODULE_LICENSE("GPL");
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d27ded9..f3c93b9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4257,6 +4257,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> update_rq_runnable_avg(rq, rq->nr_running);
> add_nr_running(rq, 1);
> }
> +
> + if(sched_energy_freq())
> + cpufreq_cfs_update_cpu(cpu_of(rq),
> + rq->cfs.utilization_load_avg);
> +
> hrtick_update(rq);
> }
>
> @@ -4318,6 +4323,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> sub_nr_running(rq, 1);
> update_rq_runnable_avg(rq, 1);
> }
> +
> + if(sched_energy_freq())
> + cpufreq_cfs_update_cpu(cpu_of(rq),
> + rq->cfs.utilization_load_avg);
> +
> hrtick_update(rq);
> }
>
> @@ -7816,6 +7826,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> task_tick_numa(rq, curr);
>
> update_rq_runnable_avg(rq, 1);
> +
> + if(sched_energy_freq())
> + cpufreq_cfs_update_cpu(cpu_of(rq),
> + rq->cfs.utilization_load_avg);
> }
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4925bc4..a8585e1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1401,6 +1401,13 @@ static inline unsigned long capacity_of(int cpu)
> return cpu_rq(cpu)->cpu_capacity;
> }
>
> +#ifdef CONFIG_CPU_FREQ_GOV_CFS
> +unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long util);
> +#else
> +static inline unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long util)
> +{ }
> +#endif
> +
> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
> {
> rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
> @@ -1409,6 +1416,7 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
> #else
> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
> static inline void sched_avg_update(struct rq *rq) { }
> +static inline void gov_cfs_update_cpu(int cpu) {}
> #endif
>
> extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
> --
> 1.9.1
>

2015-05-22 23:17:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/4] arm: Frequency invariant scheduler load-tracking support

On Monday, May 11, 2015 07:13:12 PM Michael Turquette wrote:
> From: Morten Rasmussen <[email protected]>
>
> Implements arch-specific function to provide the scheduler with a
> frequency scaling correction factor for more accurate load-tracking. The
> factor is:
>
> current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
>
> This implementation only provides frequency invariance. No
> micro-architecture invariance yet.
>
> Cc: Russell King <[email protected]>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> Changes in v2:
> none
>
> arch/arm/include/asm/topology.h | 7 ++++++
> arch/arm/kernel/smp.c | 53 +++++++++++++++++++++++++++++++++++++++--
> arch/arm/kernel/topology.c | 17 +++++++++++++
> 3 files changed, 75 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> index 2fe85ff..4b985dc 100644
> --- a/arch/arm/include/asm/topology.h
> +++ b/arch/arm/include/asm/topology.h
> @@ -24,6 +24,13 @@ void init_cpu_topology(void);
> void store_cpu_topology(unsigned int cpuid);
> const struct cpumask *cpu_coregroup_mask(int cpu);
>
> +#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
> +struct sched_domain;
> +extern
> +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> +
> +DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> +
> #else
>
> static inline void init_cpu_topology(void) { }
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index 86ef244..297ce1b 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -672,12 +672,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
> static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
> static unsigned long global_l_p_j_ref;
> static unsigned long global_l_p_j_ref_freq;
> +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
> +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> +
> +/*
> + * Scheduler load-tracking scale-invariance
> + *
> + * Provides the scheduler with a scale-invariance correction factor that
> + * compensates for frequency scaling through arch_scale_freq_capacity()
> + * (implemented in topology.c).
> + */
> +static inline
> +void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max)
> +{
> + unsigned long capacity;
> +
> + if (!max)
> + return;
> +
> + capacity = (curr << SCHED_CAPACITY_SHIFT) / max;
> + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
> +}
>
> static int cpufreq_callback(struct notifier_block *nb,
> unsigned long val, void *data)
> {
> struct cpufreq_freqs *freq = data;
> int cpu = freq->cpu;
> + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
>
> if (freq->flags & CPUFREQ_CONST_LOOPS)
> return NOTIFY_OK;
> @@ -702,6 +724,9 @@ static int cpufreq_callback(struct notifier_block *nb,
> per_cpu(l_p_j_ref_freq, cpu),
> freq->new);
> }
> +
> + scale_freq_capacity(cpu, freq->new, max);
> +
> return NOTIFY_OK;
> }
>
> @@ -709,11 +734,35 @@ static struct notifier_block cpufreq_notifier = {
> .notifier_call = cpufreq_callback,
> };
>
> +static int cpufreq_policy_callback(struct notifier_block *nb,
> + unsigned long val, void *data)
> +{
> + struct cpufreq_policy *policy = data;
> + int i;
> +
> + for_each_cpu(i, policy->cpus) {
> + scale_freq_capacity(i, policy->cur, policy->max);
> + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
> + }
> +
> + return NOTIFY_OK;
> +}
> +
> +static struct notifier_block cpufreq_policy_notifier = {
> + .notifier_call = cpufreq_policy_callback,
> +};
> +
> static int __init register_cpufreq_notifier(void)
> {
> - return cpufreq_register_notifier(&cpufreq_notifier,
> + int ret;
> +
> + ret = cpufreq_register_notifier(&cpufreq_notifier,
> CPUFREQ_TRANSITION_NOTIFIER);
> + if (ret)
> + return ret;
> +
> + return cpufreq_register_notifier(&cpufreq_policy_notifier,
> + CPUFREQ_POLICY_NOTIFIER);
> }
> core_initcall(register_cpufreq_notifier);
> -
> #endif
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 08b7847..9c09e6e 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -169,6 +169,23 @@ static void update_cpu_capacity(unsigned int cpu)
> cpu, arch_scale_cpu_capacity(NULL, cpu));
> }
>
> +/*
> + * Scheduler load-tracking scale-invariance
> + *
> + * Provides the scheduler with a scale-invariance correction factor that
> + * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling
> + * factor is updated in smp.c
> + */
> +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> +{
> + unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));
> +
> + if (!curr)
> + return SCHED_CAPACITY_SCALE;
> +
> + return curr;
> +}
> +
> #else
> static inline void parse_dt_topology(void) {}
> static inline void update_cpu_capacity(unsigned int cpuid) {}

I'm not sure if I'm reading this correctly, but what exactly is ARM-specific
in this new code?

Could it be made generic, in particular?


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2015-05-22 23:45:11

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

On Monday, May 11, 2015 07:13:15 PM Michael Turquette wrote:
> Scheduler-driven cpu frequency selection is desirable as part of the
> on-going effort to make the scheduler better aware of energy
> consumption. No piece of the Linux kernel has a better view of the
> factors that affect a cpu frequency selection policy than the
> scheduler[0], and this patch is an attempt to converge on an initial
> solution.
>
> This patch implements a cpufreq governor that directly accesses
> scheduler statistics, in particular per-runqueue capacity utilization
> data from cfs via cfs.utilization_load_avg.
>
> Put plainly, this governor selects the lowest cpu frequency that will
> prevent a runqueue from being over-utilized (until we hit the highest
> frequency of course). This is accomplished by requesting a frequency
> that matches the current capacity utilization, plus a margin.
>
> Unlike the previous posting from 2014[1] this governor implements a
> "follow the utilization" method, where utilization is defined as the
> frequency-invariant product of cfs.utilization_load_avg and
> cpu_capacity_orig.
>
> This governor is event-driven. There is no polling loop to check cpu
> idle time nor any other method which is unsynchronized with the
> scheduler. The entry points for this policy are in fair.c:
> enqueue_task_fair, dequeue_task_fair and task_tick_fair.
>
> This policy is implemented using the cpufreq governor interface for two
> main reasons:
>
> 1) re-using the cpufreq machine drivers without using the governor
> interface is hard.
>
> 2) using the cpufreq interface allows us to switch between the
> scheduler-driven policy and legacy cpufreq governors such as ondemand at
> run-time. This is very useful for comparative testing and tuning.
>
> Finally, it is worth mentioning that this approach neglects all
> scheduling classes except for cfs. It is possible to add support for
> deadline and other other classes here, but I also wonder if a
> multi-governor approach would be a more maintainable solution, where the
> cpufreq core aggregates the constraints set by multiple governors.
> Supporting such an approach in the cpufreq core would also allow for
> peripheral devices to place constraint on cpu frequency without having
> to hack such behavior in at the governor level.
>
> Thanks to Juri Lelli <[email protected]> for contributing design ideas,
> code and test results.
>
> [0] http://article.gmane.org/gmane.linux.kernel/1499836
> [1] https://lkml.org/lkml/2014/10/22/22
>
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Michael Turquette <[email protected]>
> ---

[cut]

> +/**
> + * cpufreq_cfs_update_cpu - interface to scheduler for changing capacity values
> + * @cpu: cpu whose capacity utilization has recently changed
> + *
> + * cpufreq_cfs_update_cpu is an interface exposed to the scheduler so that the
> + * scheduler may inform the governor of updates to capacity utilization and
> + * make changes to cpu frequency. Currently this interface is designed around
> + * PELT values in CFS. It can be expanded to other scheduling classes in the
> + * future if needed.
> + *
> + * cpufreq_cfs_update_cpu raises an IPI. The irq_work handler for that IPI wakes up
> + * the thread that does the actual work, cpufreq_cfs_thread.
> + *
> + * This functions bails out early if either condition is true:
> + * 1) this cpu is not the new maximum utilization for its frequency domain
> + * 2) no change in cpu frequency is necessary to meet the new capacity request
> + *
> + * Returns the newly chosen capacity. Note that this may not reflect reality if
> + * the hardware fails to transition to this new capacity state.
> + */
> +unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long util)
> +{
> + unsigned long util_new, util_old, util_max, capacity_new;
> + unsigned int freq_new, freq_tmp, cpu_tmp;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + struct cpufreq_frequency_table *pos;
> +
> + /* handle rounding errors */
> + util_new = util > SCHED_LOAD_SCALE ? SCHED_LOAD_SCALE : util;
> +
> + /* update per-cpu utilization */
> + util_old = __this_cpu_read(pcpu_util);
> + __this_cpu_write(pcpu_util, util_new);
> +
> + /* avoid locking policy for now; accessing .cpus only */
> + policy = per_cpu(pcpu_policy, cpu);
> +
> + /* find max utilization of cpus in this policy */
> + util_max = 0;
> + for_each_cpu(cpu_tmp, policy->cpus)
> + util_max = max(util_max, per_cpu(pcpu_util, cpu_tmp));
> +
> + /*
> + * We only change frequency if this cpu's utilization represents a new
> + * max. If another cpu has increased its utilization beyond the
> + * previous max then we rely on that cpu to hit this code path and make
> + * the change. IOW, the cpu with the new max utilization is responsible
> + * for setting the new capacity/frequency.
> + *
> + * If this cpu is not the new maximum then bail, returning the current
> + * capacity.
> + */
> + if (util_max > util_new)
> + return capacity_of(cpu);
> +
> + /*
> + * We are going to request a new capacity, which might result in a new
> + * cpu frequency. From here on we need to serialize access to the
> + * policy and the governor private data.
> + */
> + policy = cpufreq_cpu_get(cpu);
> + if (IS_ERR_OR_NULL(policy)) {
> + return capacity_of(cpu);
> + }
> +
> + capacity_new = capacity_of(cpu);
> + if (!policy->governor_data) {
> + goto out;
> + }
> +
> + gd = policy->governor_data;
> +
> + /* bail early if we are throttled */
> + if (ktime_before(ktime_get(), gd->throttle)) {
> + goto out;
> + }
> +
> + /*
> + * Convert the new maximum capacity utilization into a cpu frequency
> + *
> + * It is possible to convert capacity utilization directly into a
> + * frequency, but that implies that we would be 100% utilized. Instead,
> + * first add a margin (default 25% capacity increase) to the new
> + * capacity request. This provides some head room if load increases.
> + */
> + capacity_new = util_new + (SCHED_CAPACITY_SCALE >> 2);
> + freq_new = capacity_new * policy->max >> SCHED_CAPACITY_SHIFT;
> +
> + /*
> + * If a frequency table is available then find the frequency
> + * corresponding to freq_new.
> + *
> + * For cpufreq drivers without a frequency table, use the frequency
> + * directly computed from capacity_new + 25% margin.
> + */
> + if (policy->freq_table) {
> + freq_tmp = policy->max;
> + cpufreq_for_each_entry(pos, policy->freq_table) {
> + if (pos->frequency >= freq_new &&
> + pos->frequency < freq_tmp)
> + freq_tmp = pos->frequency;
> + }
> + freq_new = freq_tmp;
> + capacity_new = (freq_new << SCHED_CAPACITY_SHIFT) / policy->max;
> + }
> +
> + /* No change in frequency? Bail and return current capacity. */
> + if (freq_new == policy->cur) {
> + capacity_new = capacity_of(cpu);
> + goto out;
> + }
> +

At this point, if the underlying driver can switch the frequency (P-state in
general) from interrupt context, we might just ask it to do that here instead of
kicking the thread and do it from there (which may be too late already in some
cases).

So what about adding a new cpufreq driver callback that will be set if the
driver is capable of making P-state changes from interrupt context and using
that here if available?

> + /* store the new frequency and kick the thread */
> + gd->freq = freq_new;
> +
> + /* XXX can we use something like try_to_wake_up_local here instead? */
> + irq_work_queue_on(&gd->irq_work, cpu);
> +
> +out:
> + cpufreq_cpu_put(policy);
> + return capacity_new;
> +}


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2015-06-10 06:23:48

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling



On 05/12/2015 10:13 AM, Michael Turquette wrote:
> This governor is event-driven. There is no polling loop to check cpu
> idle time nor any other method which is unsynchronized with the
> scheduler. The entry points for this policy are in fair.c:
> enqueue_task_fair, dequeue_task_fair and task_tick_fair.
>
> This policy is implemented using the cpufreq governor interface for two
> main reasons:
>
> 1) re-using the cpufreq machine drivers without using the governor
> interface is hard.
>
> 2) using the cpufreq interface allows us to switch between the
> scheduler-driven policy and legacy cpufreq governors such as ondemand at
> run-time. This is very useful for comparative testing and tuning.

Hi, Mike,

Did you have some testing data with your patch?

2015-06-29 16:52:10

by Mike Turquette

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

Hi Juri,

Quoting Juri Lelli (2015-05-18 09:42:14)
> On 12/05/15 03:13, Michael Turquette wrote:
> > +#define MARGIN_PCT 125 /* taken from imbalance_pct = 125 */
>
> You don't use this anymore, right? But see also my comment below
> on this.

Right. And in the new version I'll move this out to fair.c. I want to
move that sort of policy into cfs and make this governor even more
"dumb".

> > + * Returns the newly chosen capacity. Note that this may not reflect reality if
> > + * the hardware fails to transition to this new capacity state.
> > + */
> > +unsigned long cpufreq_cfs_update_cpu(int cpu, unsigned long util)
>
> Is anybody consuming the return value? Did you have in mind some
> possible usage of it?

I did have in mind that the scheduling class could do something useful
with this value. But this somewhat duplicates the
arch_scale_freq_capacity stuff so I can remove it.

>
> > +{
> > + unsigned long util_new, util_old, util_max, capacity_new;
> > + unsigned int freq_new, freq_tmp, cpu_tmp;
> > + struct cpufreq_policy *policy;
> > + struct gov_data *gd;
> > + struct cpufreq_frequency_table *pos;
> > +
> > + /* handle rounding errors */
> > + util_new = util > SCHED_LOAD_SCALE ? SCHED_LOAD_SCALE : util;
> > +
> > + /* update per-cpu utilization */
> > + util_old = __this_cpu_read(pcpu_util);
> > + __this_cpu_write(pcpu_util, util_new);
> > +
> > + /* avoid locking policy for now; accessing .cpus only */
> > + policy = per_cpu(pcpu_policy, cpu);
> > +
> > + /* find max utilization of cpus in this policy */
> > + util_max = 0;
> > + for_each_cpu(cpu_tmp, policy->cpus)
> > + util_max = max(util_max, per_cpu(pcpu_util, cpu_tmp));
> > +
> > + /*
> > + * We only change frequency if this cpu's utilization represents a new
> > + * max. If another cpu has increased its utilization beyond the
> > + * previous max then we rely on that cpu to hit this code path and make
> > + * the change. IOW, the cpu with the new max utilization is responsible
> > + * for setting the new capacity/frequency.
> > + *
> > + * If this cpu is not the new maximum then bail, returning the current
> > + * capacity.
> > + */
> > + if (util_max > util_new)
> > + return capacity_of(cpu);
>
> Here and below you probably want to return arch_scale_freq_capacity(NULL, cpu),
> as capacity_of() returns the remaining capacity (w.r.t. capacity_orig) for CFS
> tasks after RT tasks contribution is removed.

In fact I wanted to return the capacity that reflects only cfs, but
since I'm going to remove the return value it is moot.

>
> > +
> > + /*
> > + * We are going to request a new capacity, which might result in a new
> > + * cpu frequency. From here on we need to serialize access to the
> > + * policy and the governor private data.
> > + */
> > + policy = cpufreq_cpu_get(cpu);
> > + if (IS_ERR_OR_NULL(policy)) {
> > + return capacity_of(cpu);
> > + }
>
> Shouldn't this be removed now that we have pcpu_policy?
> Also the cpufreq_put_cpu() below.

We must still 'get' the policy in order to call the cpufreq apis below.
This involves holding locks that are managed for us in
cpufreq_cpu_{get,put}. In fact the pcu_policy thing is a gross hack to
avoid holding the locks and it is probably unsafe. I've removed it in
the new version.

>
> > +
> > + capacity_new = capacity_of(cpu);
> > + if (!policy->governor_data) {
> > + goto out;
> > + }
> > +
> > + gd = policy->governor_data;
> > +
> > + /* bail early if we are throttled */
> > + if (ktime_before(ktime_get(), gd->throttle)) {
> > + goto out;
> > + }
> > +
> > + /*
> > + * Convert the new maximum capacity utilization into a cpu frequency
> > + *
> > + * It is possible to convert capacity utilization directly into a
> > + * frequency, but that implies that we would be 100% utilized. Instead,
> > + * first add a margin (default 25% capacity increase) to the new
> > + * capacity request. This provides some head room if load increases.
> > + */
> > + capacity_new = util_new + (SCHED_CAPACITY_SCALE >> 2);
>
> Here you introduce this 25% margin w.r.t. SCHED_CAPACITY_SCALE.
> Shouldn't the margin be related to util_new instead (using MARGIN_PCT
> maybe)?

Maybe?!?! I'm moving this type of policy into fair.c in the new version.
Figuring out this policy is going to be a long road, with lots of
testing and competing requirements. Getting infrastructure merged
upstream is a different task compared to finding the best cpu frequency
scaling policy. Some testing already shows that for certain workloads
using the PELT curve in the way that I use it results in very slow
frequency transition times.

Thus I would prefer that this simple governor not be gated on figuring
out all of that stuff first. There are some good reasons for this:

1) it makes it easier for you to use this work with your EAS series if
the governor does not implement any policy of its own

2) it can hopefully get merged by removing the controversial policy
stuff

TL;DR, I'm no longer trying to solve the policy problem in this series.

>
> > + freq_new = capacity_new * policy->max >> SCHED_CAPACITY_SHIFT;
> > +
> > + /*
> > + * If a frequency table is available then find the frequency
> > + * corresponding to freq_new.
> > + *
> > + * For cpufreq drivers without a frequency table, use the frequency
> > + * directly computed from capacity_new + 25% margin.
> > + */
> > + if (policy->freq_table) {
> > + freq_tmp = policy->max;
> > + cpufreq_for_each_entry(pos, policy->freq_table) {
> > + if (pos->frequency >= freq_new &&
> > + pos->frequency < freq_tmp)
> > + freq_tmp = pos->frequency;
> > + }
> > + freq_new = freq_tmp;
> > + capacity_new = (freq_new << SCHED_CAPACITY_SHIFT) / policy->max;
> > + }
>
> Do we really need to do this here? Doesn't __cpufreq_driver_target()
> do the same for us?

Yes it does, but I was trying to get an accurate capacity target to
return to cfs. Since I'm removing that in the next version then I can
remove this as well.

Thanks as always for your review!

Best regards,
Mike