2016-03-22 01:57:21

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v6 0/7] cpufreq: schedutil governor

Hi,

Yet another iteration of the schedutil governor patchset.

It essentially is a resend of the series, but since patches [6-7/7] have been
updated since the v4 (http://marc.info/?l=linux-kernel&m=145814047719883&w=4),
the complete series goes again here.

The patches are based on the current Linus' tree and they have been present
in my pm-cpufreq-experimental branch for a few days.

Also, Srinivas ran SpecPower on one of the previous iterations and the
results are very promising. With CPU loads below 80% the system using the
new governor achieves the same performance and consumes much less energy at the
same time (up to around 30% less which translates to around 100 W of power
in this particular test setup).

Again, the question here is whether or not anyone has anything against
queuing this series up for 4.7 early in the cycle (preferably right after
the closing of the 4.6 merge window) in order to provide a convenient base
for further development.

Of course, ACKs are welcome in case of no objections. :-)

Thanks,
Rafael


2016-03-22 01:56:41

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

From: Rafael J. Wysocki <[email protected]>

Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.

Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.

The new governor is relatively simple.

The frequency selection formula used by it depends on whether or not
the utilization is frequency-invariant. In the frequency-invariant
case the new CPU frequency is given by

next_freq = 1.25 * max_freq * util / max

where util and max are the last two arguments of cpufreq_update_util().
In turn, if util is not frequency-invariant, the maximum frequency in
the above formula is replaced with the current frequency of the CPU:

next_freq = 1.25 * curr_freq * util / max

The coefficient 1.25 corresponds to the frequency tipping point at
(util / max) = 0.8.

All of the computations are carried out in the utilization update
handlers provided by the new governor. One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).

The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers. If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).

Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes. That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.

The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

Changes from v5:
- Fixed sugov_update_commit() to set sg_policy->next_freq properly
in the "work item" branch.
- Used smp_processor_id() in sugov_irq_work() and restored work_in_progress.

Changes from v4:
- Use TICK_NSEC in sugov_next_freq_shared().
- Use schedule_work_on() to schedule work items and replace
work_in_progress with work_cpu (which is used both for scheduling
work items and as a "work in progress" marker).
- Rearrange sugov_update_commit() to only check policy->min/max if
fast switching is enabled.
- Replace util > max checks with util == ULONG_MAX checks to make
it clear that they are about a special case (RT/DL).

Changes from v3:
- The "next frequency" formula based on
http://marc.info/?l=linux-acpi&m=145756618321500&w=4 and
http://marc.info/?l=linux-kernel&m=145760739700716&w=4
- The governor goes into kernel/sched/ (again).

Changes from v2:
- The governor goes into drivers/cpufreq/.
- The "next frequency" formula has an additional 1.1 factor to allow
more util/max values to map onto the top-most frequency in case the
distance between that and the previous one is unproportionally small.
- sugov_update_commit() traces CPU frequency even if the new one is
the same as the previous one (otherwise, if the system is 100% loaded
for long enough, powertop starts to report that all CPUs are 100% idle).

---
drivers/cpufreq/Kconfig | 26 +
kernel/sched/Makefile | 1
kernel/sched/cpufreq_schedutil.c | 528 +++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 8
4 files changed, 563 insertions(+)

Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+ bool "schedutil"
+ select CPU_FREQ_GOV_SCHEDUTIL
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the 'schedutil' CPUFreq governor by default. If unsure,
+ have a look at the help section of that governor. The fallback
+ governor will be 'performance'.
+
endchoice

config CPU_FREQ_GOV_PERFORMANCE
@@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE

If in doubt, say N.

+config CPU_FREQ_GOV_SCHEDUTIL
+ tristate "'schedutil' cpufreq policy governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_ATTR_SET
+ select IRQ_WORK
+ help
+ The frequency selection formula used by this governor is analogous
+ to the one used by 'ondemand', but instead of computing CPU load
+ as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+ utilization data provided by the scheduler as input.
+
+ To compile this driver as a module, choose M here: the
+ module will be called cpufreq_schedutil.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"

config CPUFREQ_DT
Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -0,0 +1,528 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <trace/events/power.h>
+
+#include "sched.h"
+
+struct sugov_tunables {
+ struct gov_attr_set attr_set;
+ unsigned int rate_limit_us;
+};
+
+struct sugov_policy {
+ struct cpufreq_policy *policy;
+
+ struct sugov_tunables *tunables;
+ struct list_head tunables_hook;
+
+ raw_spinlock_t update_lock; /* For shared policies */
+ u64 last_freq_update_time;
+ s64 freq_update_delay_ns;
+ unsigned int next_freq;
+
+ /* The next fields are only needed if fast switch cannot be used. */
+ struct irq_work irq_work;
+ struct work_struct work;
+ struct mutex work_lock;
+ bool work_in_progress;
+
+ bool need_freq_update;
+};
+
+struct sugov_cpu {
+ struct update_util_data update_util;
+ struct sugov_policy *sg_policy;
+
+ /* The fields below are only needed when sharing a policy. */
+ unsigned long util;
+ unsigned long max;
+ u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
+ u64 delta_ns;
+
+ if (sg_policy->work_in_progress)
+ return false;
+
+ if (unlikely(sg_policy->need_freq_update)) {
+ sg_policy->need_freq_update = false;
+ return true;
+ }
+
+ delta_ns = time - sg_policy->last_freq_update_time;
+ return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
+}
+
+static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+ unsigned int next_freq)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+
+ sg_policy->last_freq_update_time = time;
+
+ if (policy->fast_switch_enabled) {
+ if (next_freq > policy->max)
+ next_freq = policy->max;
+ else if (next_freq < policy->min)
+ next_freq = policy->min;
+
+ if (sg_policy->next_freq == next_freq) {
+ trace_cpu_frequency(policy->cur, smp_processor_id());
+ return;
+ }
+ sg_policy->next_freq = next_freq;
+ next_freq = cpufreq_driver_fast_switch(policy, next_freq);
+ if (next_freq == CPUFREQ_ENTRY_INVALID)
+ return;
+
+ policy->cur = next_freq;
+ trace_cpu_frequency(next_freq, smp_processor_id());
+ } else if (sg_policy->next_freq != next_freq) {
+ sg_policy->next_freq = next_freq;
+ sg_policy->work_in_progress = true;
+ irq_work_queue(&sg_policy->irq_work);
+ }
+}
+
+/**
+ * get_next_freq - Compute a new frequency for a given cpufreq policy.
+ * @policy: cpufreq policy object to compute the new frequency for.
+ * @util: Current CPU utilization.
+ * @max: CPU capacity.
+ *
+ * If the utilization is frequency-invariant, choose the new frequency to be
+ * proportional to it, that is
+ *
+ * next_freq = C * max_freq * util / max
+ *
+ * Otherwise, approximate the would-be frequency-invariant utilization by
+ * util_raw * (curr_freq / max_freq) which leads to
+ *
+ * next_freq = C * curr_freq * util_raw / max
+ *
+ * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
+ */
+static unsigned int get_next_freq(struct cpufreq_policy *policy,
+ unsigned long util, unsigned long max)
+{
+ unsigned int freq = arch_scale_freq_invariant() ?
+ policy->cpuinfo.max_freq : policy->cur;
+
+ return (freq + (freq >> 2)) * util / max;
+}
+
+static void sugov_update_single(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int next_f;
+
+ if (!sugov_should_update_freq(sg_policy, time))
+ return;
+
+ next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
+ get_next_freq(policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+}
+
+static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
+ unsigned long util, unsigned long max)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int max_f = policy->cpuinfo.max_freq;
+ u64 last_freq_update_time = sg_policy->last_freq_update_time;
+ unsigned int j;
+
+ if (util == ULONG_MAX)
+ return max_f;
+
+ for_each_cpu(j, policy->cpus) {
+ struct sugov_cpu *j_sg_cpu;
+ unsigned long j_util, j_max;
+ u64 delta_ns;
+
+ if (j == smp_processor_id())
+ continue;
+
+ j_sg_cpu = &per_cpu(sugov_cpu, j);
+ /*
+ * If the CPU utilization was last updated before the previous
+ * frequency update and the time elapsed between the last update
+ * of the CPU utilization and the last frequency update is long
+ * enough, don't take the CPU into account as it probably is
+ * idle now.
+ */
+ delta_ns = last_freq_update_time - j_sg_cpu->last_update;
+ if ((s64)delta_ns > TICK_NSEC)
+ continue;
+
+ j_util = j_sg_cpu->util;
+ if (j_util == ULONG_MAX)
+ return max_f;
+
+ j_max = j_sg_cpu->max;
+ if (j_util * max > j_max * util) {
+ util = j_util;
+ max = j_max;
+ }
+ }
+
+ return get_next_freq(policy, util, max);
+}
+
+static void sugov_update_shared(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int next_f;
+
+ raw_spin_lock(&sg_policy->update_lock);
+
+ sg_cpu->util = util;
+ sg_cpu->max = max;
+ sg_cpu->last_update = time;
+
+ if (sugov_should_update_freq(sg_policy, time)) {
+ next_f = sugov_next_freq_shared(sg_policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+ }
+
+ raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_work(struct work_struct *work)
+{
+ struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+
+ mutex_lock(&sg_policy->work_lock);
+ __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+ CPUFREQ_RELATION_L);
+ mutex_unlock(&sg_policy->work_lock);
+
+ sg_policy->work_in_progress = false;
+}
+
+static void sugov_irq_work(struct irq_work *irq_work)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
+ schedule_work_on(smp_processor_id(), &sg_policy->work);
+}
+
+/************************** sysfs interface ************************/
+
+static struct sugov_tunables *global_tunables;
+static DEFINE_MUTEX(global_tunables_lock);
+
+static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct sugov_tunables, attr_set);
+}
+
+static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+ return sprintf(buf, "%u\n", tunables->rate_limit_us);
+}
+
+static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+ struct sugov_policy *sg_policy;
+ unsigned int rate_limit_us;
+ int ret;
+
+ ret = sscanf(buf, "%u", &rate_limit_us);
+ if (ret != 1)
+ return -EINVAL;
+
+ tunables->rate_limit_us = rate_limit_us;
+
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+ sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+
+ return count;
+}
+
+static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+
+static struct attribute *sugov_attributes[] = {
+ &rate_limit_us.attr,
+ NULL
+};
+
+static struct kobj_type sugov_tunables_ktype = {
+ .default_attrs = sugov_attributes,
+ .sysfs_ops = &governor_sysfs_ops,
+};
+
+/********************** cpufreq governor interface *********************/
+
+static struct cpufreq_governor schedutil_gov;
+
+static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+ if (!sg_policy)
+ return NULL;
+
+ sg_policy->policy = policy;
+ init_irq_work(&sg_policy->irq_work, sugov_irq_work);
+ INIT_WORK(&sg_policy->work, sugov_work);
+ mutex_init(&sg_policy->work_lock);
+ raw_spin_lock_init(&sg_policy->update_lock);
+ return sg_policy;
+}
+
+static void sugov_policy_free(struct sugov_policy *sg_policy)
+{
+ mutex_destroy(&sg_policy->work_lock);
+ kfree(sg_policy);
+}
+
+static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
+{
+ struct sugov_tunables *tunables;
+
+ tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
+ if (tunables)
+ gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
+
+ return tunables;
+}
+
+static void sugov_tunables_free(struct sugov_tunables *tunables)
+{
+ if (!have_governor_per_policy())
+ global_tunables = NULL;
+
+ kfree(tunables);
+}
+
+static int sugov_init(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+ struct sugov_tunables *tunables;
+ unsigned int lat;
+ int ret = 0;
+
+ /* State should be equivalent to EXIT */
+ if (policy->governor_data)
+ return -EBUSY;
+
+ sg_policy = sugov_policy_alloc(policy);
+ if (!sg_policy)
+ return -ENOMEM;
+
+ mutex_lock(&global_tunables_lock);
+
+ if (global_tunables) {
+ if (WARN_ON(have_governor_per_policy())) {
+ ret = -EINVAL;
+ goto free_sg_policy;
+ }
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = global_tunables;
+
+ gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
+ goto out;
+ }
+
+ tunables = sugov_tunables_alloc(sg_policy);
+ if (!tunables) {
+ ret = -ENOMEM;
+ goto free_sg_policy;
+ }
+
+ tunables->rate_limit_us = LATENCY_MULTIPLIER;
+ lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
+ if (lat)
+ tunables->rate_limit_us *= lat;
+
+ if (!have_governor_per_policy())
+ global_tunables = tunables;
+
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = tunables;
+
+ ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
+ get_governor_parent_kobj(policy), "%s",
+ schedutil_gov.name);
+ if (!ret)
+ goto out;
+
+ /* Failure, so roll back. */
+ policy->governor_data = NULL;
+ sugov_tunables_free(tunables);
+
+ free_sg_policy:
+ pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
+ sugov_policy_free(sg_policy);
+
+ out:
+ mutex_unlock(&global_tunables_lock);
+ return ret;
+}
+
+static int sugov_exit(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ struct sugov_tunables *tunables = sg_policy->tunables;
+ unsigned int count;
+
+ mutex_lock(&global_tunables_lock);
+
+ count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
+ policy->governor_data = NULL;
+ if (!count)
+ sugov_tunables_free(tunables);
+
+ mutex_unlock(&global_tunables_lock);
+
+ sugov_policy_free(sg_policy);
+ return 0;
+}
+
+static int sugov_start(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ cpufreq_enable_fast_switch(policy);
+
+ sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->last_freq_update_time = 0;
+ sg_policy->next_freq = UINT_MAX;
+ sg_policy->work_in_progress = false;
+ sg_policy->need_freq_update = false;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+ sg_cpu->sg_policy = sg_policy;
+ if (policy_is_shared(policy)) {
+ sg_cpu->util = ULONG_MAX;
+ sg_cpu->max = 0;
+ sg_cpu->last_update = 0;
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_shared);
+ } else {
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_single);
+ }
+ }
+ return 0;
+}
+
+static int sugov_stop(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ cpufreq_remove_update_util_hook(cpu);
+
+ synchronize_sched();
+
+ irq_work_sync(&sg_policy->irq_work);
+ cancel_work_sync(&sg_policy->work);
+ return 0;
+}
+
+static int sugov_limits(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+
+ if (!policy->fast_switch_enabled) {
+ mutex_lock(&sg_policy->work_lock);
+
+ if (policy->max < policy->cur)
+ __cpufreq_driver_target(policy, policy->max,
+ CPUFREQ_RELATION_H);
+ else if (policy->min > policy->cur)
+ __cpufreq_driver_target(policy, policy->min,
+ CPUFREQ_RELATION_L);
+
+ mutex_unlock(&sg_policy->work_lock);
+ }
+
+ sg_policy->need_freq_update = true;
+ return 0;
+}
+
+int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
+{
+ if (event == CPUFREQ_GOV_POLICY_INIT) {
+ return sugov_init(policy);
+ } else if (policy->governor_data) {
+ switch (event) {
+ case CPUFREQ_GOV_POLICY_EXIT:
+ return sugov_exit(policy);
+ case CPUFREQ_GOV_START:
+ return sugov_start(policy);
+ case CPUFREQ_GOV_STOP:
+ return sugov_stop(policy);
+ case CPUFREQ_GOV_LIMITS:
+ return sugov_limits(policy);
+ }
+ }
+ return -EINVAL;
+}
+
+static struct cpufreq_governor schedutil_gov = {
+ .name = "schedutil",
+ .governor = sugov_governor,
+ .owner = THIS_MODULE,
+};
+
+static int __init sugov_module_init(void)
+{
+ return cpufreq_register_governor(&schedutil_gov);
+}
+
+static void __exit sugov_module_exit(void)
+{
+ cpufreq_unregister_governor(&schedutil_gov);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+struct cpufreq_governor *cpufreq_default_governor(void)
+{
+ return &schedutil_gov;
+}
+
+fs_initcall(sugov_module_init);
+#else
+module_init(sugov_module_init);
+#endif
+module_exit(sugov_module_exit);
Index: linux-pm/kernel/sched/Makefile
===================================================================
--- linux-pm.orig/kernel/sched/Makefile
+++ linux-pm/kernel/sched/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -1841,3 +1841,11 @@ static inline void cpufreq_trigger_updat
static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {}
static inline void cpufreq_trigger_update(u64 time) {}
#endif /* CONFIG_CPU_FREQ */
+
+#ifdef arch_scale_freq_capacity
+#ifndef arch_scale_freq_invariant
+#define arch_scale_freq_invariant() (true)
+#endif
+#else /* arch_scale_freq_capacity */
+#define arch_scale_freq_invariant() (false)
+#endif

2016-03-22 01:56:40

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v6 5/7][Resend] cpufreq: Move governor symbols to cpufreq.h

From: Rafael J. Wysocki <[email protected]>

Move definitions of symbols related to transition latency and
sampling rate to include/linux/cpufreq.h so they can be used by
(future) goverernors located outside of drivers/cpufreq/.

No functional changes.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

This patch was new in v4, no changes since then.

---
drivers/cpufreq/cpufreq_governor.h | 14 --------------
include/linux/cpufreq.h | 14 ++++++++++++++
2 files changed, 14 insertions(+), 14 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -24,20 +24,6 @@
#include <linux/module.h>
#include <linux/mutex.h>

-/*
- * The polling frequency depends on the capability of the processor. Default
- * polling frequency is 1000 times the transition latency of the processor. The
- * governor will work on any processor with transition latency <= 10ms, using
- * appropriate sampling rate.
- *
- * For CPUs with transition latency > 10ms (mostly drivers with CPUFREQ_ETERNAL)
- * this governor will not work. All times here are in us (micro seconds).
- */
-#define MIN_SAMPLING_RATE_RATIO (2)
-#define LATENCY_MULTIPLIER (1000)
-#define MIN_LATENCY_MULTIPLIER (20)
-#define TRANSITION_LATENCY_LIMIT (10 * 1000 * 1000)
-
/* Ondemand Sampling types */
enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};

Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -426,6 +426,20 @@ static inline unsigned long cpufreq_scal
#define CPUFREQ_POLICY_POWERSAVE (1)
#define CPUFREQ_POLICY_PERFORMANCE (2)

+/*
+ * The polling frequency depends on the capability of the processor. Default
+ * polling frequency is 1000 times the transition latency of the processor. The
+ * ondemand governor will work on any processor with transition latency <= 10ms,
+ * using appropriate sampling rate.
+ *
+ * For CPUs with transition latency > 10ms (mostly drivers with CPUFREQ_ETERNAL)
+ * the ondemand governor will not work. All times here are in us (microseconds).
+ */
+#define MIN_SAMPLING_RATE_RATIO (2)
+#define LATENCY_MULTIPLIER (1000)
+#define MIN_LATENCY_MULTIPLIER (20)
+#define TRANSITION_LATENCY_LIMIT (10 * 1000 * 1000)
+
/* Governor Events */
#define CPUFREQ_GOV_START 1
#define CPUFREQ_GOV_STOP 2

2016-03-22 01:56:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v6 1/7][Resend] cpufreq: sched: Helpers to add and remove update_util hooks

From: Rafael J. Wysocki <[email protected]>

Replace the single helper for adding and removing cpufreq utilization
update hooks, cpufreq_set_update_util_data(), with a pair of helpers,
cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(),
and modify the users of cpufreq_set_update_util_data() accordingly.

With the new helpers, the code using them doesn't need to worry
about the internals of struct update_util_data and in particular
it doesn't need to worry about populating the func field in it
properly upfront.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

No changes since v4 (this patch appeared then).

---
drivers/cpufreq/cpufreq_governor.c | 76 ++++++++++++++++++-------------------
drivers/cpufreq/intel_pstate.c | 8 +--
include/linux/sched.h | 5 +-
kernel/sched/cpufreq.c | 48 ++++++++++++++++++-----
4 files changed, 83 insertions(+), 54 deletions(-)

Index: linux-pm/include/linux/sched.h
===================================================================
--- linux-pm.orig/include/linux/sched.h
+++ linux-pm/include/linux/sched.h
@@ -3218,7 +3218,10 @@ struct update_util_data {
u64 time, unsigned long util, unsigned long max);
};

-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
+ void (*func)(struct update_util_data *data, u64 time,
+ unsigned long util, unsigned long max));
+void cpufreq_remove_update_util_hook(int cpu);
#endif /* CONFIG_CPU_FREQ */

#endif
Index: linux-pm/kernel/sched/cpufreq.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq.c
+++ linux-pm/kernel/sched/cpufreq.c
@@ -14,24 +14,50 @@
DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);

/**
- * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * cpufreq_add_update_util_hook - Populate the CPU's update_util_data pointer.
* @cpu: The CPU to set the pointer for.
* @data: New pointer value.
+ * @func: Callback function to set for the CPU.
*
- * Set and publish the update_util_data pointer for the given CPU. That pointer
- * points to a struct update_util_data object containing a callback function
- * to call from cpufreq_update_util(). That function will be called from an RCU
- * read-side critical section, so it must not sleep.
+ * Set and publish the update_util_data pointer for the given CPU.
*
- * Callers must use RCU-sched callbacks to free any memory that might be
- * accessed via the old update_util_data pointer or invoke synchronize_sched()
- * right after this function to avoid use-after-free.
+ * The update_util_data pointer of @cpu is set to @data and the callback
+ * function pointer in the target struct update_util_data is set to @func.
+ * That function will be called by cpufreq_update_util() from RCU-sched
+ * read-side critical sections, so it must not sleep. @data will always be
+ * passed to it as the first argument which allows the function to get to the
+ * target update_util_data structure and its container.
+ *
+ * The update_util_data pointer of @cpu must be NULL when this function is
+ * called or it will WARN() and return with no effect.
*/
-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
+ void (*func)(struct update_util_data *data, u64 time,
+ unsigned long util, unsigned long max))
{
- if (WARN_ON(data && !data->func))
+ if (WARN_ON(!data || !func))
return;

+ if (WARN_ON(per_cpu(cpufreq_update_util_data, cpu)))
+ return;
+
+ data->func = func;
rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
}
-EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+EXPORT_SYMBOL_GPL(cpufreq_add_update_util_hook);
+
+/**
+ * cpufreq_remove_update_util_hook - Clear the CPU's update_util_data pointer.
+ * @cpu: The CPU to clear the pointer for.
+ *
+ * Clear the update_util_data pointer for the given CPU.
+ *
+ * Callers must use RCU-sched callbacks to free any memory that might be
+ * accessed via the old update_util_data pointer or invoke synchronize_sched()
+ * right after this function to avoid use-after-free.
+ */
+void cpufreq_remove_update_util_hook(int cpu)
+{
+ rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), NULL);
+}
+EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -258,43 +258,6 @@ unsigned int dbs_update(struct cpufreq_p
}
EXPORT_SYMBOL_GPL(dbs_update);

-static void gov_set_update_util(struct policy_dbs_info *policy_dbs,
- unsigned int delay_us)
-{
- struct cpufreq_policy *policy = policy_dbs->policy;
- int cpu;
-
- gov_update_sample_delay(policy_dbs, delay_us);
- policy_dbs->last_sample_time = 0;
-
- for_each_cpu(cpu, policy->cpus) {
- struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
-
- cpufreq_set_update_util_data(cpu, &cdbs->update_util);
- }
-}
-
-static inline void gov_clear_update_util(struct cpufreq_policy *policy)
-{
- int i;
-
- for_each_cpu(i, policy->cpus)
- cpufreq_set_update_util_data(i, NULL);
-
- synchronize_sched();
-}
-
-static void gov_cancel_work(struct cpufreq_policy *policy)
-{
- struct policy_dbs_info *policy_dbs = policy->governor_data;
-
- gov_clear_update_util(policy_dbs->policy);
- irq_work_sync(&policy_dbs->irq_work);
- cancel_work_sync(&policy_dbs->work);
- atomic_set(&policy_dbs->work_count, 0);
- policy_dbs->work_in_progress = false;
-}
-
static void dbs_work_handler(struct work_struct *work)
{
struct policy_dbs_info *policy_dbs;
@@ -382,6 +345,44 @@ static void dbs_update_util_handler(stru
irq_work_queue(&policy_dbs->irq_work);
}

+static void gov_set_update_util(struct policy_dbs_info *policy_dbs,
+ unsigned int delay_us)
+{
+ struct cpufreq_policy *policy = policy_dbs->policy;
+ int cpu;
+
+ gov_update_sample_delay(policy_dbs, delay_us);
+ policy_dbs->last_sample_time = 0;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
+
+ cpufreq_add_update_util_hook(cpu, &cdbs->update_util,
+ dbs_update_util_handler);
+ }
+}
+
+static inline void gov_clear_update_util(struct cpufreq_policy *policy)
+{
+ int i;
+
+ for_each_cpu(i, policy->cpus)
+ cpufreq_remove_update_util_hook(i);
+
+ synchronize_sched();
+}
+
+static void gov_cancel_work(struct cpufreq_policy *policy)
+{
+ struct policy_dbs_info *policy_dbs = policy->governor_data;
+
+ gov_clear_update_util(policy_dbs->policy);
+ irq_work_sync(&policy_dbs->irq_work);
+ cancel_work_sync(&policy_dbs->work);
+ atomic_set(&policy_dbs->work_count, 0);
+ policy_dbs->work_in_progress = false;
+}
+
static struct policy_dbs_info *alloc_policy_dbs_info(struct cpufreq_policy *policy,
struct dbs_governor *gov)
{
@@ -404,7 +405,6 @@ static struct policy_dbs_info *alloc_pol
struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);

j_cdbs->policy_dbs = policy_dbs;
- j_cdbs->update_util.func = dbs_update_util_handler;
}
return policy_dbs;
}
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -1102,8 +1102,8 @@ static int intel_pstate_init_cpu(unsigne
intel_pstate_busy_pid_reset(cpu);
intel_pstate_sample(cpu, 0);

- cpu->update_util.func = intel_pstate_update_util;
- cpufreq_set_update_util_data(cpunum, &cpu->update_util);
+ cpufreq_add_update_util_hook(cpunum, &cpu->update_util,
+ intel_pstate_update_util);

pr_debug("intel_pstate: controlling: cpu %d\n", cpunum);

@@ -1187,7 +1187,7 @@ static void intel_pstate_stop_cpu(struct

pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);

- cpufreq_set_update_util_data(cpu_num, NULL);
+ cpufreq_remove_update_util_hook(cpu_num);
synchronize_sched();

if (hwp_active)
@@ -1455,7 +1455,7 @@ out:
get_online_cpus();
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu]) {
- cpufreq_set_update_util_data(cpu, NULL);
+ cpufreq_remove_update_util_hook(cpu);
synchronize_sched();
kfree(all_cpu_data[cpu]);
}

2016-03-22 01:57:38

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v6 2/7][Resend] cpufreq: governor: New data type for management part of dbs_data

From: Rafael J. Wysocki <[email protected]>

In addition to fields representing governor tunables, struct dbs_data
contains some fields needed for the management of objects of that
type. As it turns out, that part of struct dbs_data may be shared
with (future) governors that won't use the common code used by
"ondemand" and "conservative", so move it to a separate struct type
and modify the code using struct dbs_data to follow.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---

No changes from previous versions.

---
drivers/cpufreq/cpufreq_conservative.c | 25 +++++----
drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++-------------
drivers/cpufreq/cpufreq_governor.h | 35 +++++++-----
drivers/cpufreq/cpufreq_ondemand.c | 29 ++++++----
4 files changed, 107 insertions(+), 72 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -41,6 +41,13 @@
/* Ondemand Sampling types */
enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};

+struct gov_attr_set {
+ struct kobject kobj;
+ struct list_head policy_list;
+ struct mutex update_lock;
+ int usage_count;
+};
+
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
@@ -52,7 +59,7 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};

/* Governor demand based switching data (per-policy or global). */
struct dbs_data {
- int usage_count;
+ struct gov_attr_set attr_set;
void *tuners;
unsigned int min_sampling_rate;
unsigned int ignore_nice_load;
@@ -60,37 +67,35 @@ struct dbs_data {
unsigned int sampling_down_factor;
unsigned int up_threshold;
unsigned int io_is_busy;
-
- struct kobject kobj;
- struct list_head policy_dbs_list;
- /*
- * Protect concurrent updates to governor tunables from sysfs,
- * policy_dbs_list and usage_count.
- */
- struct mutex mutex;
};

+static inline struct dbs_data *to_dbs_data(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct dbs_data, attr_set);
+}
+
/* Governor's specific attributes */
-struct dbs_data;
struct governor_attr {
struct attribute attr;
- ssize_t (*show)(struct dbs_data *dbs_data, char *buf);
- ssize_t (*store)(struct dbs_data *dbs_data, const char *buf,
+ ssize_t (*show)(struct gov_attr_set *attr_set, char *buf);
+ ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf,
size_t count);
};

#define gov_show_one(_gov, file_name) \
static ssize_t show_##file_name \
-(struct dbs_data *dbs_data, char *buf) \
+(struct gov_attr_set *attr_set, char *buf) \
{ \
+ struct dbs_data *dbs_data = to_dbs_data(attr_set); \
struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \
return sprintf(buf, "%u\n", tuners->file_name); \
}

#define gov_show_one_common(file_name) \
static ssize_t show_##file_name \
-(struct dbs_data *dbs_data, char *buf) \
+(struct gov_attr_set *attr_set, char *buf) \
{ \
+ struct dbs_data *dbs_data = to_dbs_data(attr_set); \
return sprintf(buf, "%u\n", dbs_data->file_name); \
}

@@ -184,7 +189,7 @@ void od_register_powersave_bias_handler(
(struct cpufreq_policy *, unsigned int, unsigned int),
unsigned int powersave_bias);
void od_unregister_powersave_bias_handler(void);
-ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
+ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf,
size_t count);
void gov_update_cpu_data(struct dbs_data *dbs_data);
#endif /* _CPUFREQ_GOVERNOR_H */
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -43,9 +43,10 @@ static DEFINE_MUTEX(gov_dbs_data_mutex);
* This must be called with dbs_data->mutex held, otherwise traversing
* policy_dbs_list isn't safe.
*/
-ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
+ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf,
size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct policy_dbs_info *policy_dbs;
unsigned int rate;
int ret;
@@ -59,7 +60,7 @@ ssize_t store_sampling_rate(struct dbs_d
* We are operating under dbs_data->mutex and so the list and its
* entries can't be freed concurrently.
*/
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list) {
mutex_lock(&policy_dbs->timer_mutex);
/*
* On 32-bit architectures this may race with the
@@ -96,7 +97,7 @@ void gov_update_cpu_data(struct dbs_data
{
struct policy_dbs_info *policy_dbs;

- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &dbs_data->attr_set.policy_list, list) {
unsigned int j;

for_each_cpu(j, policy_dbs->policy->cpus) {
@@ -111,9 +112,9 @@ void gov_update_cpu_data(struct dbs_data
}
EXPORT_SYMBOL_GPL(gov_update_cpu_data);

-static inline struct dbs_data *to_dbs_data(struct kobject *kobj)
+static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
{
- return container_of(kobj, struct dbs_data, kobj);
+ return container_of(kobj, struct gov_attr_set, kobj);
}

static inline struct governor_attr *to_gov_attr(struct attribute *attr)
@@ -124,25 +125,24 @@ static inline struct governor_attr *to_g
static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
char *buf)
{
- struct dbs_data *dbs_data = to_dbs_data(kobj);
struct governor_attr *gattr = to_gov_attr(attr);

- return gattr->show(dbs_data, buf);
+ return gattr->show(to_gov_attr_set(kobj), buf);
}

static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
const char *buf, size_t count)
{
- struct dbs_data *dbs_data = to_dbs_data(kobj);
+ struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
struct governor_attr *gattr = to_gov_attr(attr);
int ret = -EBUSY;

- mutex_lock(&dbs_data->mutex);
+ mutex_lock(&attr_set->update_lock);

- if (dbs_data->usage_count)
- ret = gattr->store(dbs_data, buf, count);
+ if (attr_set->usage_count)
+ ret = gattr->store(attr_set, buf, count);

- mutex_unlock(&dbs_data->mutex);
+ mutex_unlock(&attr_set->update_lock);

return ret;
}
@@ -425,6 +425,41 @@ static void free_policy_dbs_info(struct
gov->free(policy_dbs);
}

+static void gov_attr_set_init(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ INIT_LIST_HEAD(&attr_set->policy_list);
+ mutex_init(&attr_set->update_lock);
+ attr_set->usage_count = 1;
+ list_add(list_node, &attr_set->policy_list);
+}
+
+static void gov_attr_set_get(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ mutex_lock(&attr_set->update_lock);
+ attr_set->usage_count++;
+ list_add(list_node, &attr_set->policy_list);
+ mutex_unlock(&attr_set->update_lock);
+}
+
+static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ unsigned int count;
+
+ mutex_lock(&attr_set->update_lock);
+ list_del(list_node);
+ count = --attr_set->usage_count;
+ mutex_unlock(&attr_set->update_lock);
+ if (count)
+ return count;
+
+ kobject_put(&attr_set->kobj);
+ mutex_destroy(&attr_set->update_lock);
+ return 0;
+}
+
static int cpufreq_governor_init(struct cpufreq_policy *policy)
{
struct dbs_governor *gov = dbs_governor_of(policy);
@@ -453,10 +488,7 @@ static int cpufreq_governor_init(struct
policy_dbs->dbs_data = dbs_data;
policy->governor_data = policy_dbs;

- mutex_lock(&dbs_data->mutex);
- dbs_data->usage_count++;
- list_add(&policy_dbs->list, &dbs_data->policy_dbs_list);
- mutex_unlock(&dbs_data->mutex);
+ gov_attr_set_get(&dbs_data->attr_set, &policy_dbs->list);
goto out;
}

@@ -466,8 +498,7 @@ static int cpufreq_governor_init(struct
goto free_policy_dbs_info;
}

- INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
- mutex_init(&dbs_data->mutex);
+ gov_attr_set_init(&dbs_data->attr_set, &policy_dbs->list);

ret = gov->init(dbs_data, !policy->governor->initialized);
if (ret)
@@ -487,14 +518,11 @@ static int cpufreq_governor_init(struct
if (!have_governor_per_policy())
gov->gdbs_data = dbs_data;

- policy->governor_data = policy_dbs;
-
policy_dbs->dbs_data = dbs_data;
- dbs_data->usage_count = 1;
- list_add(&policy_dbs->list, &dbs_data->policy_dbs_list);
+ policy->governor_data = policy_dbs;

gov->kobj_type.sysfs_ops = &governor_sysfs_ops;
- ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type,
+ ret = kobject_init_and_add(&dbs_data->attr_set.kobj, &gov->kobj_type,
get_governor_parent_kobj(policy),
"%s", gov->gov.name);
if (!ret)
@@ -523,29 +551,21 @@ static int cpufreq_governor_exit(struct
struct dbs_governor *gov = dbs_governor_of(policy);
struct policy_dbs_info *policy_dbs = policy->governor_data;
struct dbs_data *dbs_data = policy_dbs->dbs_data;
- int count;
+ unsigned int count;

/* Protect gov->gdbs_data against concurrent updates. */
mutex_lock(&gov_dbs_data_mutex);

- mutex_lock(&dbs_data->mutex);
- list_del(&policy_dbs->list);
- count = --dbs_data->usage_count;
- mutex_unlock(&dbs_data->mutex);
+ count = gov_attr_set_put(&dbs_data->attr_set, &policy_dbs->list);

- if (!count) {
- kobject_put(&dbs_data->kobj);
-
- policy->governor_data = NULL;
+ policy->governor_data = NULL;

+ if (!count) {
if (!have_governor_per_policy())
gov->gdbs_data = NULL;

gov->exit(dbs_data, policy->governor->initialized == 1);
- mutex_destroy(&dbs_data->mutex);
kfree(dbs_data);
- } else {
- policy->governor_data = NULL;
}

free_policy_dbs_info(policy_dbs, gov);
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -207,9 +207,10 @@ static unsigned int od_dbs_timer(struct
/************************** sysfs interface ************************/
static struct dbs_governor od_dbs_gov;

-static ssize_t store_io_is_busy(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_io_is_busy(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;

@@ -224,9 +225,10 @@ static ssize_t store_io_is_busy(struct d
return count;
}

-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_up_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -240,9 +242,10 @@ static ssize_t store_up_threshold(struct
return count;
}

-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct policy_dbs_info *policy_dbs;
unsigned int input;
int ret;
@@ -254,7 +257,7 @@ static ssize_t store_sampling_down_facto
dbs_data->sampling_down_factor = input;

/* Reset down sampling multiplier in case it was active */
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list) {
/*
* Doing this without locking might lead to using different
* rate_mult values in od_update() and od_dbs_timer().
@@ -267,9 +270,10 @@ static ssize_t store_sampling_down_facto
return count;
}

-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;

@@ -291,9 +295,10 @@ static ssize_t store_ignore_nice_load(st
return count;
}

-static ssize_t store_powersave_bias(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_powersave_bias(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct od_dbs_tuners *od_tuners = dbs_data->tuners;
struct policy_dbs_info *policy_dbs;
unsigned int input;
@@ -308,7 +313,7 @@ static ssize_t store_powersave_bias(stru

od_tuners->powersave_bias = input;

- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list)
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list)
ondemand_powersave_bias_init(policy_dbs->policy);

return count;
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -129,9 +129,10 @@ static struct notifier_block cs_cpufreq_
/************************** sysfs interface ************************/
static struct dbs_governor cs_dbs_gov;

-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -143,9 +144,10 @@ static ssize_t store_sampling_down_facto
return count;
}

-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_up_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
@@ -158,9 +160,10 @@ static ssize_t store_up_threshold(struct
return count;
}

-static ssize_t store_down_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_down_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
@@ -175,9 +178,10 @@ static ssize_t store_down_threshold(stru
return count;
}

-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;

@@ -199,9 +203,10 @@ static ssize_t store_ignore_nice_load(st
return count;
}

-static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_freq_step(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;

2016-03-22 01:57:57

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v6 4/7][Resend] cpufreq: Move governor attribute set headers to cpufreq.h

From: Rafael J. Wysocki <[email protected]>

Move definitions and function headers related to struct gov_attr_set
to include/linux/cpufreq.h so they can be used by (future) goverernors
located outside of drivers/cpufreq/.

No functional changes.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---

This one was first present in v2, no changes since then.

---
drivers/cpufreq/cpufreq_governor.h | 21 ---------------------
include/linux/cpufreq.h | 23 +++++++++++++++++++++++
2 files changed, 23 insertions(+), 21 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -41,19 +41,6 @@
/* Ondemand Sampling types */
enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};

-struct gov_attr_set {
- struct kobject kobj;
- struct list_head policy_list;
- struct mutex update_lock;
- int usage_count;
-};
-
-extern const struct sysfs_ops governor_sysfs_ops;
-
-void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node);
-void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node);
-unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node);
-
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
@@ -80,14 +67,6 @@ static inline struct dbs_data *to_dbs_da
return container_of(attr_set, struct dbs_data, attr_set);
}

-/* Governor's specific attributes */
-struct governor_attr {
- struct attribute attr;
- ssize_t (*show)(struct gov_attr_set *attr_set, char *buf);
- ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf,
- size_t count);
-};
-
#define gov_show_one(_gov, file_name) \
static ssize_t show_##file_name \
(struct gov_attr_set *attr_set, char *buf) \
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -462,6 +462,29 @@ void cpufreq_unregister_governor(struct
struct cpufreq_governor *cpufreq_default_governor(void);
struct cpufreq_governor *cpufreq_fallback_governor(void);

+/* Governor attribute set */
+struct gov_attr_set {
+ struct kobject kobj;
+ struct list_head policy_list;
+ struct mutex update_lock;
+ int usage_count;
+};
+
+/* sysfs ops for cpufreq governors */
+extern const struct sysfs_ops governor_sysfs_ops;
+
+void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node);
+void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node);
+unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node);
+
+/* Governor sysfs attribute */
+struct governor_attr {
+ struct attribute attr;
+ ssize_t (*show)(struct gov_attr_set *attr_set, char *buf);
+ ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf,
+ size_t count);
+};
+
/*********************************************************************
* FREQUENCY TABLE HELPERS *
*********************************************************************/

2016-03-22 01:57:56

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v6 6/7][Resend] cpufreq: Support for fast frequency switching

From: Rafael J. Wysocki <[email protected]>

Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
to support that method if available.

Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context by (future)
governors supporting that feature via (new) helper function
cpufreq_driver_fast_switch().

Add two new policy flags, fast_switch_possible, to be set by the
cpufreq driver if fast frequency switching can be used for the
given policy and fast_switch_enabled, to be set by the governor
if it is going to use fast frequency switching for the given
policy. Also add a helper for setting the latter.

Since fast frequency switching is inherently incompatible with
cpufreq transition notifiers, make it possible to set the
fast_switch_enabled only if there are no transition notifiers
already registered and make the registration of new transition
notifiers fail if fast_switch_enabled is set for at least one
policy.

Implement the ->fast_switch callback in the ACPI cpufreq driver
and make it set fast_switch_possible during policy initialization
as appropriate.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

Changes from v5:
- cpufreq_enable_fast_switch() fixed to avoid printing a confusing message
if fast_switch_possible is not set for the policy.
- Fixed a typo in that message.
- Removed the WARN_ON() from the (cpufreq_fast_switch_count > 0) check in
cpufreq_register_notifier(), because it triggered false-positive warnings
from the cpufreq_stats module (cpufreq_stats don't work with the fast
switching, because it is based on notifiers).

Changes from v4:
- If cpufreq_enable_fast_switch() is about to fail, it will print the list
of currently registered transition notifiers.
- Added lock_assert_held(&policy->rwsem) to cpufreq_enable_fast_switch().
- Added WARN_ON() to the (cpufreq_fast_switch_count > 0) check in
cpufreq_register_notifier().
- Modified the kerneldoc comment of cpufreq_driver_fast_switch() to
mention the RELATION_L expectation regarding the ->fast_switch callback.

Changes from v3:
- New fast_switch_enabled field in struct cpufreq_policy to help
avoid affecting existing setups by setting the fast_switch_possible
flag in the driver.
- __cpufreq_get() skips the policy->cur check if fast_switch_enabled is set.

Changes from v2:
- The driver ->fast_switch callback and cpufreq_driver_fast_switch()
don't need the relation argument as they will always do RELATION_L now.
- New mechanism to make fast switch and cpufreq notifiers mutually
exclusive.
- cpufreq_driver_fast_switch() doesn't do anything in addition to
invoking the driver callback and returns its return value.

---
drivers/cpufreq/acpi-cpufreq.c | 41 ++++++++++++
drivers/cpufreq/cpufreq.c | 130 ++++++++++++++++++++++++++++++++++++++---
include/linux/cpufreq.h | 9 ++
3 files changed, 171 insertions(+), 9 deletions(-)

Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp
return result;
}

+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ struct acpi_cpufreq_data *data = policy->driver_data;
+ struct acpi_processor_performance *perf;
+ struct cpufreq_frequency_table *entry;
+ unsigned int next_perf_state, next_freq, freq;
+
+ /*
+ * Find the closest frequency above target_freq.
+ *
+ * The table is sorted in the reverse order with respect to the
+ * frequency and all of the entries are valid (see the initialization).
+ */
+ entry = data->freq_table;
+ do {
+ entry++;
+ freq = entry->frequency;
+ } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
+ entry--;
+ next_freq = entry->frequency;
+ next_perf_state = entry->driver_data;
+
+ perf = to_perf_data(data);
+ if (perf->state == next_perf_state) {
+ if (unlikely(data->resume))
+ data->resume = 0;
+ else
+ return next_freq;
+ }
+
+ data->cpu_freq_write(&perf->control_register,
+ perf->states[next_perf_state].control);
+ perf->state = next_perf_state;
+ return next_freq;
+}
+
static unsigned long
acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
{
@@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct
goto err_unreg;
}

+ policy->fast_switch_possible = !acpi_pstate_strict &&
+ !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
+
data->freq_table = kzalloc(sizeof(*data->freq_table) *
(perf->state_count+1), GFP_KERNEL);
if (!data->freq_table) {
@@ -874,6 +914,7 @@ static struct freq_attr *acpi_cpufreq_at
static struct cpufreq_driver acpi_cpufreq_driver = {
.verify = cpufreq_generic_frequency_table_verify,
.target_index = acpi_cpufreq_target,
+ .fast_switch = acpi_cpufreq_fast_switch,
.bios_limit = acpi_processor_get_bios_limit,
.init = acpi_cpufreq_cpu_init,
.exit = acpi_cpufreq_cpu_exit,
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -102,6 +102,10 @@ struct cpufreq_policy {
*/
struct rw_semaphore rwsem;

+ /* Fast switch flags */
+ bool fast_switch_possible; /* Set by the driver. */
+ bool fast_switch_enabled;
+
/* Synchronization for frequency transitions */
bool transition_ongoing; /* Tracks transition status */
spinlock_t transition_lock;
@@ -156,6 +160,7 @@ int cpufreq_get_policy(struct cpufreq_po
int cpufreq_update_policy(unsigned int cpu);
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
#else
static inline unsigned int cpufreq_get(unsigned int cpu)
{
@@ -236,6 +241,8 @@ struct cpufreq_driver {
unsigned int relation); /* Deprecated */
int (*target_index)(struct cpufreq_policy *policy,
unsigned int index);
+ unsigned int (*fast_switch)(struct cpufreq_policy *policy,
+ unsigned int target_freq);
/*
* Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
* unset.
@@ -464,6 +471,8 @@ struct cpufreq_governor {
};

/* Pass a target to the cpufreq driver */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq);
int cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -428,6 +428,57 @@ void cpufreq_freq_transition_end(struct
}
EXPORT_SYMBOL_GPL(cpufreq_freq_transition_end);

+/*
+ * Fast frequency switching status count. Positive means "enabled", negative
+ * means "disabled" and 0 means "not decided yet".
+ */
+static int cpufreq_fast_switch_count;
+static DEFINE_MUTEX(cpufreq_fast_switch_lock);
+
+static void cpufreq_list_transition_notifiers(void)
+{
+ struct notifier_block *nb;
+
+ pr_info("cpufreq: Registered transition notifiers:\n");
+
+ mutex_lock(&cpufreq_transition_notifier_list.mutex);
+
+ for (nb = cpufreq_transition_notifier_list.head; nb; nb = nb->next)
+ pr_info("cpufreq: %pF\n", nb->notifier_call);
+
+ mutex_unlock(&cpufreq_transition_notifier_list.mutex);
+}
+
+/**
+ * cpufreq_enable_fast_switch - Enable fast frequency switching for policy.
+ * @policy: cpufreq policy to enable fast frequency switching for.
+ *
+ * Try to enable fast frequency switching for @policy.
+ *
+ * The attempt will fail if there is at least one transition notifier registered
+ * at this point, as fast frequency switching is quite fundamentally at odds
+ * with transition notifiers. Thus if successful, it will make registration of
+ * transition notifiers fail going forward.
+ */
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
+{
+ lockdep_assert_held(&policy->rwsem);
+
+ if (!policy->fast_switch_possible)
+ return;
+
+ mutex_lock(&cpufreq_fast_switch_lock);
+ if (cpufreq_fast_switch_count >= 0) {
+ cpufreq_fast_switch_count++;
+ policy->fast_switch_enabled = true;
+ } else {
+ pr_warn("cpufreq: CPU%u: Fast frequency switching not enabled\n",
+ policy->cpu);
+ cpufreq_list_transition_notifiers();
+ }
+ mutex_unlock(&cpufreq_fast_switch_lock);
+}
+EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch);

/*********************************************************************
* SYSFS INTERFACE *
@@ -1083,6 +1134,24 @@ static void cpufreq_policy_free(struct c
kfree(policy);
}

+static void cpufreq_driver_exit_policy(struct cpufreq_policy *policy)
+{
+ if (policy->fast_switch_enabled) {
+ mutex_lock(&cpufreq_fast_switch_lock);
+
+ policy->fast_switch_enabled = false;
+ if (!WARN_ON(cpufreq_fast_switch_count <= 0))
+ cpufreq_fast_switch_count--;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
+ }
+
+ if (cpufreq_driver->exit) {
+ cpufreq_driver->exit(policy);
+ policy->freq_table = NULL;
+ }
+}
+
static int cpufreq_online(unsigned int cpu)
{
struct cpufreq_policy *policy;
@@ -1236,8 +1305,7 @@ static int cpufreq_online(unsigned int c
out_exit_policy:
up_write(&policy->rwsem);

- if (cpufreq_driver->exit)
- cpufreq_driver->exit(policy);
+ cpufreq_driver_exit_policy(policy);
out_free_policy:
cpufreq_policy_free(policy, !new_policy);
return ret;
@@ -1334,10 +1402,7 @@ static void cpufreq_offline(unsigned int
* since this is a core component, and is essential for the
* subsequent light-weight ->init() to succeed.
*/
- if (cpufreq_driver->exit) {
- cpufreq_driver->exit(policy);
- policy->freq_table = NULL;
- }
+ cpufreq_driver_exit_policy(policy);

unlock:
up_write(&policy->rwsem);
@@ -1452,8 +1517,12 @@ static unsigned int __cpufreq_get(struct

ret_freq = cpufreq_driver->get(policy->cpu);

- /* Updating inactive policies is invalid, so avoid doing that. */
- if (unlikely(policy_is_inactive(policy)))
+ /*
+ * Updating inactive policies is invalid, so avoid doing that. Also
+ * if fast frequency switching is used with the given policy, the check
+ * against policy->cur is pointless, so skip it in that case too.
+ */
+ if (unlikely(policy_is_inactive(policy)) || policy->fast_switch_enabled)
return ret_freq;

if (ret_freq && policy->cur &&
@@ -1465,7 +1534,6 @@ static unsigned int __cpufreq_get(struct
schedule_work(&policy->update);
}
}
-
return ret_freq;
}

@@ -1672,8 +1740,18 @@ int cpufreq_register_notifier(struct not

switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
+ if (cpufreq_fast_switch_count > 0) {
+ mutex_unlock(&cpufreq_fast_switch_lock);
+ return -EBUSY;
+ }
ret = srcu_notifier_chain_register(
&cpufreq_transition_notifier_list, nb);
+ if (!ret)
+ cpufreq_fast_switch_count--;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_register(
@@ -1706,8 +1784,14 @@ int cpufreq_unregister_notifier(struct n

switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
ret = srcu_notifier_chain_unregister(
&cpufreq_transition_notifier_list, nb);
+ if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0))
+ cpufreq_fast_switch_count++;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_unregister(
@@ -1726,6 +1810,34 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
* GOVERNORS *
*********************************************************************/

+/**
+ * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
+ * @policy: cpufreq policy to switch the frequency for.
+ * @target_freq: New frequency to set (may be approximate).
+ *
+ * Carry out a fast frequency switch from interrupt context.
+ *
+ * The driver's ->fast_switch() callback invoked by this function is expected to
+ * select the minimum available frequency greater than or equal to @target_freq
+ * (CPUFREQ_RELATION_L).
+ *
+ * This function must not be called if policy->fast_switch_enabled is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same policy and that it will never be called in
+ * parallel with either ->target() or ->target_index() for the same policy.
+ *
+ * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
+ * callback to indicate an error condition, the hardware configuration must be
+ * preserved.
+ */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ return cpufreq_driver->fast_switch(policy, target_freq);
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
+
/* Must set freqs->new to intermediate frequency */
static int __target_intermediate(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, int index)

2016-03-22 01:57:54

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v6 3/7][Resend] cpufreq: governor: Move abstract gov_attr_set code to seperate file

From: Rafael J. Wysocki <[email protected]>

Move abstract code related to struct gov_attr_set to a separate (new)
file so it can be shared with (future) goverernors that won't share
more code with "ondemand" and "conservative".

No intentional functional changes.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---

No changes from previous versions.

---
drivers/cpufreq/Kconfig | 4 +
drivers/cpufreq/Makefile | 1
drivers/cpufreq/cpufreq_governor.c | 82 ---------------------------
drivers/cpufreq/cpufreq_governor.h | 6 ++
drivers/cpufreq/cpufreq_governor_attr_set.c | 84 ++++++++++++++++++++++++++++
5 files changed, 95 insertions(+), 82 deletions(-)

Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -18,7 +18,11 @@ config CPU_FREQ

if CPU_FREQ

+config CPU_FREQ_GOV_ATTR_SET
+ bool
+
config CPU_FREQ_GOV_COMMON
+ select CPU_FREQ_GOV_ATTR_SET
select IRQ_WORK
bool

Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) +=
obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o
obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o
obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
+obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o

obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o

Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -112,53 +112,6 @@ void gov_update_cpu_data(struct dbs_data
}
EXPORT_SYMBOL_GPL(gov_update_cpu_data);

-static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
-{
- return container_of(kobj, struct gov_attr_set, kobj);
-}
-
-static inline struct governor_attr *to_gov_attr(struct attribute *attr)
-{
- return container_of(attr, struct governor_attr, attr);
-}
-
-static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
- char *buf)
-{
- struct governor_attr *gattr = to_gov_attr(attr);
-
- return gattr->show(to_gov_attr_set(kobj), buf);
-}
-
-static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
- const char *buf, size_t count)
-{
- struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
- struct governor_attr *gattr = to_gov_attr(attr);
- int ret = -EBUSY;
-
- mutex_lock(&attr_set->update_lock);
-
- if (attr_set->usage_count)
- ret = gattr->store(attr_set, buf, count);
-
- mutex_unlock(&attr_set->update_lock);
-
- return ret;
-}
-
-/*
- * Sysfs Ops for accessing governor attributes.
- *
- * All show/store invocations for governor specific sysfs attributes, will first
- * call the below show/store callbacks and the attribute specific callback will
- * be called from within it.
- */
-static const struct sysfs_ops governor_sysfs_ops = {
- .show = governor_show,
- .store = governor_store,
-};
-
unsigned int dbs_update(struct cpufreq_policy *policy)
{
struct policy_dbs_info *policy_dbs = policy->governor_data;
@@ -425,41 +378,6 @@ static void free_policy_dbs_info(struct
gov->free(policy_dbs);
}

-static void gov_attr_set_init(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- INIT_LIST_HEAD(&attr_set->policy_list);
- mutex_init(&attr_set->update_lock);
- attr_set->usage_count = 1;
- list_add(list_node, &attr_set->policy_list);
-}
-
-static void gov_attr_set_get(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- mutex_lock(&attr_set->update_lock);
- attr_set->usage_count++;
- list_add(list_node, &attr_set->policy_list);
- mutex_unlock(&attr_set->update_lock);
-}
-
-static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- unsigned int count;
-
- mutex_lock(&attr_set->update_lock);
- list_del(list_node);
- count = --attr_set->usage_count;
- mutex_unlock(&attr_set->update_lock);
- if (count)
- return count;
-
- kobject_put(&attr_set->kobj);
- mutex_destroy(&attr_set->update_lock);
- return 0;
-}
-
static int cpufreq_governor_init(struct cpufreq_policy *policy)
{
struct dbs_governor *gov = dbs_governor_of(policy);
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -48,6 +48,12 @@ struct gov_attr_set {
int usage_count;
};

+extern const struct sysfs_ops governor_sysfs_ops;
+
+void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node);
+void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node);
+unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node);
+
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
Index: linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c
@@ -0,0 +1,84 @@
+/*
+ * Abstract code for CPUFreq governor tunable sysfs attributes.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "cpufreq_governor.h"
+
+static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
+{
+ return container_of(kobj, struct gov_attr_set, kobj);
+}
+
+static inline struct governor_attr *to_gov_attr(struct attribute *attr)
+{
+ return container_of(attr, struct governor_attr, attr);
+}
+
+static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+{
+ struct governor_attr *gattr = to_gov_attr(attr);
+
+ return gattr->show(to_gov_attr_set(kobj), buf);
+}
+
+static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
+ const char *buf, size_t count)
+{
+ struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
+ struct governor_attr *gattr = to_gov_attr(attr);
+ int ret;
+
+ mutex_lock(&attr_set->update_lock);
+ ret = attr_set->usage_count ? gattr->store(attr_set, buf, count) : -EBUSY;
+ mutex_unlock(&attr_set->update_lock);
+ return ret;
+}
+
+const struct sysfs_ops governor_sysfs_ops = {
+ .show = governor_show,
+ .store = governor_store,
+};
+EXPORT_SYMBOL_GPL(governor_sysfs_ops);
+
+void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ INIT_LIST_HEAD(&attr_set->policy_list);
+ mutex_init(&attr_set->update_lock);
+ attr_set->usage_count = 1;
+ list_add(list_node, &attr_set->policy_list);
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_init);
+
+void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ mutex_lock(&attr_set->update_lock);
+ attr_set->usage_count++;
+ list_add(list_node, &attr_set->policy_list);
+ mutex_unlock(&attr_set->update_lock);
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_get);
+
+unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ unsigned int count;
+
+ mutex_lock(&attr_set->update_lock);
+ list_del(list_node);
+ count = --attr_set->usage_count;
+ mutex_unlock(&attr_set->update_lock);
+ if (count)
+ return count;
+
+ kobject_put(&attr_set->kobj);
+ mutex_destroy(&attr_set->update_lock);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_put);

2016-03-26 01:12:17

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH v6 6/7][Resend] cpufreq: Support for fast frequency switching

Hi Rafael,

On 03/21/2016 06:53 PM, Rafael J. Wysocki wrote:
> Add two new policy flags, fast_switch_possible, to be set by the
> cpufreq driver if fast frequency switching can be used for the
> given policy and fast_switch_enabled, to be set by the governor
> if it is going to use fast frequency switching for the given
> policy. Also add a helper for setting the latter.
...
> @@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct
> goto err_unreg;
> }
>
> + policy->fast_switch_possible = !acpi_pstate_strict &&
> + !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);

Could the policy->fast_switch_possible flag be avoided by just checking
whether a driver has registered the .fast_switch callback?

...
> @@ -1726,6 +1810,34 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
> * GOVERNORS *
> *********************************************************************/
>
> +/**
> + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
> + * @policy: cpufreq policy to switch the frequency for.
> + * @target_freq: New frequency to set (may be approximate).
> + *
> + * Carry out a fast frequency switch from interrupt context.

I think that should say atomic rather than interrupt as this might not
be called from interrupt context.

thanks,
Steve

2016-03-26 01:12:23

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

Hi Rafael,

On 03/21/2016 06:54 PM, Rafael J. Wysocki wrote:
...
> +config CPU_FREQ_GOV_SCHEDUTIL
> + tristate "'schedutil' cpufreq policy governor"
> + depends on CPU_FREQ
> + select CPU_FREQ_GOV_ATTR_SET
> + select IRQ_WORK
> + help
> + The frequency selection formula used by this governor is analogous
> + to the one used by 'ondemand', but instead of computing CPU load
> + as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
> + utilization data provided by the scheduler as input.

The formula's changed a bit from ondemand - can the formula description
in the commit text be repackaged a bit and used here?

...
> +
> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> + unsigned int next_freq)
> +{
> + struct cpufreq_policy *policy = sg_policy->policy;
> +
> + sg_policy->last_freq_update_time = time;
> +
> + if (policy->fast_switch_enabled) {
> + if (next_freq > policy->max)
> + next_freq = policy->max;
> + else if (next_freq < policy->min)
> + next_freq = policy->min;

The __cpufreq_driver_target() interface has this capping in it. For
uniformity should this be pushed into cpufreq_driver_fast_switch()?

> +
> + if (sg_policy->next_freq == next_freq) {
> + trace_cpu_frequency(policy->cur, smp_processor_id());
> + return;
> + }

I fear this may bloat traces unnecessarily as there may be long
stretches when a frequency domain is at the same frequency (especially
fmin or fmax).

...
> +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
> + unsigned long util, unsigned long max)
> +{
> + struct cpufreq_policy *policy = sg_policy->policy;
> + unsigned int max_f = policy->cpuinfo.max_freq;
> + u64 last_freq_update_time = sg_policy->last_freq_update_time;
> + unsigned int j;
> +
> + if (util == ULONG_MAX)
> + return max_f;
> +
> + for_each_cpu(j, policy->cpus) {
> + struct sugov_cpu *j_sg_cpu;
> + unsigned long j_util, j_max;
> + u64 delta_ns;
> +
> + if (j == smp_processor_id())
> + continue;
> +
> + j_sg_cpu = &per_cpu(sugov_cpu, j);
> + /*
> + * If the CPU utilization was last updated before the previous
> + * frequency update and the time elapsed between the last update
> + * of the CPU utilization and the last frequency update is long
> + * enough, don't take the CPU into account as it probably is
> + * idle now.
> + */
> + delta_ns = last_freq_update_time - j_sg_cpu->last_update;
> + if ((s64)delta_ns > TICK_NSEC)

Why not declare delta_ns as an s64 (also in suguv_should_update_freq)
and avoid the cast?

...
> +static int sugov_limits(struct cpufreq_policy *policy)
> +{
> + struct sugov_policy *sg_policy = policy->governor_data;
> +
> + if (!policy->fast_switch_enabled) {
> + mutex_lock(&sg_policy->work_lock);
> +
> + if (policy->max < policy->cur)
> + __cpufreq_driver_target(policy, policy->max,
> + CPUFREQ_RELATION_H);
> + else if (policy->min > policy->cur)
> + __cpufreq_driver_target(policy, policy->min,
> + CPUFREQ_RELATION_L);
> +
> + mutex_unlock(&sg_policy->work_lock);
> + }

Is the expectation that in the fast_switch_enabled case we should
re-evaluate soon enough that an explicit fixup is not required here? I'm
worried as to whether that will always be true given the possible
criticality of applying frequency limits (thermal for example).

thanks,
Steve

2016-03-26 01:46:09

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 6/7][Resend] cpufreq: Support for fast frequency switching

On Sat, Mar 26, 2016 at 2:12 AM, Steve Muckle <[email protected]> wrote:
> Hi Rafael,
>
> On 03/21/2016 06:53 PM, Rafael J. Wysocki wrote:
>> Add two new policy flags, fast_switch_possible, to be set by the
>> cpufreq driver if fast frequency switching can be used for the
>> given policy and fast_switch_enabled, to be set by the governor
>> if it is going to use fast frequency switching for the given
>> policy. Also add a helper for setting the latter.
> ...
>> @@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct
>> goto err_unreg;
>> }
>>
>> + policy->fast_switch_possible = !acpi_pstate_strict &&
>> + !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
>
> Could the policy->fast_switch_possible flag be avoided by just checking
> whether a driver has registered the .fast_switch callback?

No, it couldn't.

As in this case, the driver has the ->fast_switch callback, but it
can't be used for policies that don't satisfy the above condition. At
the same time it may be possible to use it for other policies on the
same system in principle.

> ...
>> @@ -1726,6 +1810,34 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
>> * GOVERNORS *
>> *********************************************************************/
>>
>> +/**
>> + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
>> + * @policy: cpufreq policy to switch the frequency for.
>> + * @target_freq: New frequency to set (may be approximate).
>> + *
>> + * Carry out a fast frequency switch from interrupt context.
>
> I think that should say atomic rather than interrupt as this might not
> be called from interrupt context.

"Interrupt context" here means something like "context that cannot
sleep" and it's sort of a traditional way of calling that. I
considered saying "atomic context" here, but then decided that it
might suggest too much.

Maybe something like "Carry out a fast frequency switch without
sleeping" would be better?

Thanks,
Rafael

2016-03-26 02:05:23

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On Sat, Mar 26, 2016 at 2:12 AM, Steve Muckle <[email protected]> wrote:
> Hi Rafael,
>
> On 03/21/2016 06:54 PM, Rafael J. Wysocki wrote:
> ...
>> +config CPU_FREQ_GOV_SCHEDUTIL
>> + tristate "'schedutil' cpufreq policy governor"
>> + depends on CPU_FREQ
>> + select CPU_FREQ_GOV_ATTR_SET
>> + select IRQ_WORK
>> + help
>> + The frequency selection formula used by this governor is analogous
>> + to the one used by 'ondemand', but instead of computing CPU load
>> + as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
>> + utilization data provided by the scheduler as input.
>
> The formula's changed a bit from ondemand - can the formula description
> in the commit text be repackaged a bit and used here?

Right, I forgot to update this help text.

I'll figure out what to do here.

> ...
>> +
>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>> + unsigned int next_freq)
>> +{
>> + struct cpufreq_policy *policy = sg_policy->policy;
>> +
>> + sg_policy->last_freq_update_time = time;
>> +
>> + if (policy->fast_switch_enabled) {
>> + if (next_freq > policy->max)
>> + next_freq = policy->max;
>> + else if (next_freq < policy->min)
>> + next_freq = policy->min;
>
> The __cpufreq_driver_target() interface has this capping in it. For
> uniformity should this be pushed into cpufreq_driver_fast_switch()?

It could, but see below.

>> +
>> + if (sg_policy->next_freq == next_freq) {
>> + trace_cpu_frequency(policy->cur, smp_processor_id());
>> + return;
>> + }
>
> I fear this may bloat traces unnecessarily as there may be long
> stretches when a frequency domain is at the same frequency (especially
> fmin or fmax).

I put it here, because without it powertop reports that the CPU is
idle in situations like these.

> ...
>> +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
>> + unsigned long util, unsigned long max)
>> +{
>> + struct cpufreq_policy *policy = sg_policy->policy;
>> + unsigned int max_f = policy->cpuinfo.max_freq;
>> + u64 last_freq_update_time = sg_policy->last_freq_update_time;
>> + unsigned int j;
>> +
>> + if (util == ULONG_MAX)
>> + return max_f;
>> +
>> + for_each_cpu(j, policy->cpus) {
>> + struct sugov_cpu *j_sg_cpu;
>> + unsigned long j_util, j_max;
>> + u64 delta_ns;
>> +
>> + if (j == smp_processor_id())
>> + continue;
>> +
>> + j_sg_cpu = &per_cpu(sugov_cpu, j);
>> + /*
>> + * If the CPU utilization was last updated before the previous
>> + * frequency update and the time elapsed between the last update
>> + * of the CPU utilization and the last frequency update is long
>> + * enough, don't take the CPU into account as it probably is
>> + * idle now.
>> + */
>> + delta_ns = last_freq_update_time - j_sg_cpu->last_update;
>> + if ((s64)delta_ns > TICK_NSEC)
>
> Why not declare delta_ns as an s64 (also in suguv_should_update_freq)
> and avoid the cast?

I took this from __update_load_avg(), but it shouldn't matter here.

> ...
>> +static int sugov_limits(struct cpufreq_policy *policy)
>> +{
>> + struct sugov_policy *sg_policy = policy->governor_data;
>> +
>> + if (!policy->fast_switch_enabled) {
>> + mutex_lock(&sg_policy->work_lock);
>> +
>> + if (policy->max < policy->cur)
>> + __cpufreq_driver_target(policy, policy->max,
>> + CPUFREQ_RELATION_H);
>> + else if (policy->min > policy->cur)
>> + __cpufreq_driver_target(policy, policy->min,
>> + CPUFREQ_RELATION_L);
>> +
>> + mutex_unlock(&sg_policy->work_lock);
>> + }
>
> Is the expectation that in the fast_switch_enabled case we should
> re-evaluate soon enough that an explicit fixup is not required here?

Yes, it is.

> I'm worried as to whether that will always be true given the possible
> criticality of applying frequency limits (thermal for example).

The part of the patch below that you cut actually takes care of that:

sg_policy->need_freq_update = true;

which causes the rate limit to be ignored essentially, so the
frequency will be changed on the first update from the scheduler.
Which also is why the min/max check is before the sg_policy->next_freq
== next_freq check in sugov_update_commit().

I wanted to avoid locking in the fast switch/one CPU per policy case
which otherwise would be necessary just for the handling of this
thing. I'd like to keep it the way it is unless it can be clearly
demonstrated that it really would lead to problems in practice in a
real system.

Thanks,
Rafael

2016-03-27 01:27:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 6/7][Resend] cpufreq: Support for fast frequency switching

On Sat, Mar 26, 2016 at 2:46 AM, Rafael J. Wysocki <[email protected]> wrote:
> On Sat, Mar 26, 2016 at 2:12 AM, Steve Muckle <[email protected]> wrote:
>> Hi Rafael,
>>
>> On 03/21/2016 06:53 PM, Rafael J. Wysocki wrote:
>>> Add two new policy flags, fast_switch_possible, to be set by the
>>> cpufreq driver if fast frequency switching can be used for the
>>> given policy and fast_switch_enabled, to be set by the governor
>>> if it is going to use fast frequency switching for the given
>>> policy. Also add a helper for setting the latter.
>> ...
>>> @@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct
>>> goto err_unreg;
>>> }
>>>
>>> + policy->fast_switch_possible = !acpi_pstate_strict &&
>>> + !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
>>
>> Could the policy->fast_switch_possible flag be avoided by just checking
>> whether a driver has registered the .fast_switch callback?
>
> No, it couldn't.
>
> As in this case, the driver has the ->fast_switch callback, but it
> can't be used for policies that don't satisfy the above condition. At
> the same time it may be possible to use it for other policies on the
> same system in principle.

In fact, for fast switching to be useful, the driver has to guarantee
that frequency can be updated on any of the policy CPUs (and it
doesn't matter which of them updates the frequency) and that's what
the fast_switch_possible flag is really for. I guess I should add a
comment to that effect to its definition.

Thanks,
Rafael

2016-03-27 01:37:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On Sat, Mar 26, 2016 at 3:05 AM, Rafael J. Wysocki <[email protected]> wrote:
> On Sat, Mar 26, 2016 at 2:12 AM, Steve Muckle <[email protected]> wrote:
>> Hi Rafael,
>>
>> On 03/21/2016 06:54 PM, Rafael J. Wysocki wrote:
>> ...
>>> +config CPU_FREQ_GOV_SCHEDUTIL
>>> + tristate "'schedutil' cpufreq policy governor"
>>> + depends on CPU_FREQ
>>> + select CPU_FREQ_GOV_ATTR_SET
>>> + select IRQ_WORK
>>> + help
>>> + The frequency selection formula used by this governor is analogous
>>> + to the one used by 'ondemand', but instead of computing CPU load
>>> + as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
>>> + utilization data provided by the scheduler as input.
>>
>> The formula's changed a bit from ondemand - can the formula description
>> in the commit text be repackaged a bit and used here?
>
> Right, I forgot to update this help text.
>
> I'll figure out what to do here.
>
>> ...
>>> +
>>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>>> + unsigned int next_freq)
>>> +{
>>> + struct cpufreq_policy *policy = sg_policy->policy;
>>> +
>>> + sg_policy->last_freq_update_time = time;
>>> +
>>> + if (policy->fast_switch_enabled) {
>>> + if (next_freq > policy->max)
>>> + next_freq = policy->max;
>>> + else if (next_freq < policy->min)
>>> + next_freq = policy->min;
>>
>> The __cpufreq_driver_target() interface has this capping in it. For
>> uniformity should this be pushed into cpufreq_driver_fast_switch()?
>
> It could, but see below.

It should be doable regardless unless I'm overlooking something. Will try.

[cut]

>> ...
>>> +static int sugov_limits(struct cpufreq_policy *policy)
>>> +{
>>> + struct sugov_policy *sg_policy = policy->governor_data;
>>> +
>>> + if (!policy->fast_switch_enabled) {
>>> + mutex_lock(&sg_policy->work_lock);
>>> +
>>> + if (policy->max < policy->cur)
>>> + __cpufreq_driver_target(policy, policy->max,
>>> + CPUFREQ_RELATION_H);
>>> + else if (policy->min > policy->cur)
>>> + __cpufreq_driver_target(policy, policy->min,
>>> + CPUFREQ_RELATION_L);
>>> +
>>> + mutex_unlock(&sg_policy->work_lock);
>>> + }
>>
>> Is the expectation that in the fast_switch_enabled case we should
>> re-evaluate soon enough that an explicit fixup is not required here?
>
> Yes, it is.
>
>> I'm worried as to whether that will always be true given the possible
>> criticality of applying frequency limits (thermal for example).
>
> The part of the patch below that you cut actually takes care of that:
>
> sg_policy->need_freq_update = true;
>
> which causes the rate limit to be ignored essentially, so the
> frequency will be changed on the first update from the scheduler.
> Which also is why the min/max check is before the sg_policy->next_freq
> == next_freq check in sugov_update_commit().
>
> I wanted to avoid locking in the fast switch/one CPU per policy case
> which otherwise would be necessary just for the handling of this
> thing. I'd like to keep it the way it is unless it can be clearly
> demonstrated that it really would lead to problems in practice in a
> real system.

Besides, even if frequency is updated directly from here in the "fast
switch" case, that still doesn't guarantee that it will be updated
immediately, because the task running this code may be preempted and
only scheduled again in the next cycle. Not to mention the fact that
it may not run on the CPU to be updated, so it would need to use
something like smp_call_function_single() for the update and that
would complicate things even more.

Overall, I don't really think that doing the update directly from here
in the "fast switch" case would improve things much latency-wise and
it would increase complexity and introduce overhead into the fast
path. So this really is a tradeoff and the current choice is the
right one IMO.

Thanks,
Rafael

2016-03-28 05:31:39

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v6 1/7][Resend] cpufreq: sched: Helpers to add and remove update_util hooks

On 22-03-16, 02:46, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Replace the single helper for adding and removing cpufreq utilization
> update hooks, cpufreq_set_update_util_data(), with a pair of helpers,
> cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(),
> and modify the users of cpufreq_set_update_util_data() accordingly.
>
> With the new helpers, the code using them doesn't need to worry
> about the internals of struct update_util_data and in particular
> it doesn't need to worry about populating the func field in it
> properly upfront.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2016-03-28 05:35:11

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v6 5/7][Resend] cpufreq: Move governor symbols to cpufreq.h

On 22-03-16, 02:51, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Move definitions of symbols related to transition latency and
> sampling rate to include/linux/cpufreq.h so they can be used by
> (future) goverernors located outside of drivers/cpufreq/.

s/goverernors/governors

>
> No functional changes.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> This patch was new in v4, no changes since then.
>
> ---
> drivers/cpufreq/cpufreq_governor.h | 14 --------------
> include/linux/cpufreq.h | 14 ++++++++++++++
> 2 files changed, 14 insertions(+), 14 deletions(-)

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2016-03-28 06:28:08

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v6 6/7][Resend] cpufreq: Support for fast frequency switching

Sorry for jumping in late, was busy with other stuff and travel :(

On 22-03-16, 02:53, Rafael J. Wysocki wrote:
> Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
> +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
> @@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp
> return result;
> }
>
> +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> + unsigned int target_freq)
> +{
> + struct acpi_cpufreq_data *data = policy->driver_data;
> + struct acpi_processor_performance *perf;
> + struct cpufreq_frequency_table *entry;
> + unsigned int next_perf_state, next_freq, freq;
> +
> + /*
> + * Find the closest frequency above target_freq.
> + *
> + * The table is sorted in the reverse order with respect to the
> + * frequency and all of the entries are valid (see the initialization).
> + */
> + entry = data->freq_table;
> + do {
> + entry++;
> + freq = entry->frequency;
> + } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
> + entry--;
> + next_freq = entry->frequency;
> + next_perf_state = entry->driver_data;
> +
> + perf = to_perf_data(data);
> + if (perf->state == next_perf_state) {
> + if (unlikely(data->resume))
> + data->resume = 0;
> + else
> + return next_freq;
> + }
> +
> + data->cpu_freq_write(&perf->control_register,
> + perf->states[next_perf_state].control);
> + perf->state = next_perf_state;
> + return next_freq;
> +}
> +
> static unsigned long
> acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
> {
> @@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct
> goto err_unreg;
> }
>
> + policy->fast_switch_possible = !acpi_pstate_strict &&
> + !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
> +
> data->freq_table = kzalloc(sizeof(*data->freq_table) *
> (perf->state_count+1), GFP_KERNEL);
> if (!data->freq_table) {
> @@ -874,6 +914,7 @@ static struct freq_attr *acpi_cpufreq_at
> static struct cpufreq_driver acpi_cpufreq_driver = {
> .verify = cpufreq_generic_frequency_table_verify,
> .target_index = acpi_cpufreq_target,
> + .fast_switch = acpi_cpufreq_fast_switch,
> .bios_limit = acpi_processor_get_bios_limit,
> .init = acpi_cpufreq_cpu_init,
> .exit = acpi_cpufreq_cpu_exit,
> Index: linux-pm/include/linux/cpufreq.h
> ===================================================================
> --- linux-pm.orig/include/linux/cpufreq.h
> +++ linux-pm/include/linux/cpufreq.h
> @@ -102,6 +102,10 @@ struct cpufreq_policy {
> */
> struct rw_semaphore rwsem;
>
> + /* Fast switch flags */
> + bool fast_switch_possible; /* Set by the driver. */
> + bool fast_switch_enabled;
> +
> /* Synchronization for frequency transitions */
> bool transition_ongoing; /* Tracks transition status */
> spinlock_t transition_lock;
> @@ -156,6 +160,7 @@ int cpufreq_get_policy(struct cpufreq_po
> int cpufreq_update_policy(unsigned int cpu);
> bool have_governor_per_policy(void);
> struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
> +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
> #else
> static inline unsigned int cpufreq_get(unsigned int cpu)
> {
> @@ -236,6 +241,8 @@ struct cpufreq_driver {
> unsigned int relation); /* Deprecated */
> int (*target_index)(struct cpufreq_policy *policy,
> unsigned int index);
> + unsigned int (*fast_switch)(struct cpufreq_policy *policy,
> + unsigned int target_freq);
> /*
> * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
> * unset.
> @@ -464,6 +471,8 @@ struct cpufreq_governor {
> };
>
> /* Pass a target to the cpufreq driver */
> +unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> + unsigned int target_freq);
> int cpufreq_driver_target(struct cpufreq_policy *policy,
> unsigned int target_freq,
> unsigned int relation);
> Index: linux-pm/drivers/cpufreq/cpufreq.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq.c
> +++ linux-pm/drivers/cpufreq/cpufreq.c
> @@ -428,6 +428,57 @@ void cpufreq_freq_transition_end(struct
> }
> EXPORT_SYMBOL_GPL(cpufreq_freq_transition_end);
>
> +/*
> + * Fast frequency switching status count. Positive means "enabled", negative
> + * means "disabled" and 0 means "not decided yet".
> + */
> +static int cpufreq_fast_switch_count;
> +static DEFINE_MUTEX(cpufreq_fast_switch_lock);
> +
> +static void cpufreq_list_transition_notifiers(void)
> +{
> + struct notifier_block *nb;
> +
> + pr_info("cpufreq: Registered transition notifiers:\n");
> +
> + mutex_lock(&cpufreq_transition_notifier_list.mutex);
> +
> + for (nb = cpufreq_transition_notifier_list.head; nb; nb = nb->next)
> + pr_info("cpufreq: %pF\n", nb->notifier_call);
> +
> + mutex_unlock(&cpufreq_transition_notifier_list.mutex);

This will get printed as:

cpufreq: cpufreq: Registered transition notifiers:
cpufreq: cpufreq: <func>+0x0/0x<address>
cpufreq: cpufreq: <func>+0x0/0x<address>
cpufreq: cpufreq: <func>+0x0/0x<address>

Maybe we want something like:
cpufreq: Registered transition notifiers:
cpufreq: <func>+0x0/0x<address>
cpufreq: <func>+0x0/0x<address>
cpufreq: <func>+0x0/0x<address>

?

> +}
> +
> +/**
> + * cpufreq_enable_fast_switch - Enable fast frequency switching for policy.
> + * @policy: cpufreq policy to enable fast frequency switching for.
> + *
> + * Try to enable fast frequency switching for @policy.
> + *
> + * The attempt will fail if there is at least one transition notifier registered
> + * at this point, as fast frequency switching is quite fundamentally at odds
> + * with transition notifiers. Thus if successful, it will make registration of
> + * transition notifiers fail going forward.
> + */
> +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
> +{
> + lockdep_assert_held(&policy->rwsem);
> +
> + if (!policy->fast_switch_possible)
> + return;
> +
> + mutex_lock(&cpufreq_fast_switch_lock);
> + if (cpufreq_fast_switch_count >= 0) {
> + cpufreq_fast_switch_count++;
> + policy->fast_switch_enabled = true;
> + } else {
> + pr_warn("cpufreq: CPU%u: Fast frequency switching not enabled\n",
> + policy->cpu);
> + cpufreq_list_transition_notifiers();
> + }
> + mutex_unlock(&cpufreq_fast_switch_lock);
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch);

And, why don't we have support for disabling fast-switch support? What if we
switch to schedutil governor (from userspace) and then back to ondemand? We
don't call policy->exit for that.

> /*********************************************************************
> * SYSFS INTERFACE *
> @@ -1083,6 +1134,24 @@ static void cpufreq_policy_free(struct c
> kfree(policy);
> }
>
> +static void cpufreq_driver_exit_policy(struct cpufreq_policy *policy)
> +{
> + if (policy->fast_switch_enabled) {

Shouldn't this be accessed from within lock as well ?

> + mutex_lock(&cpufreq_fast_switch_lock);
> +
> + policy->fast_switch_enabled = false;
> + if (!WARN_ON(cpufreq_fast_switch_count <= 0))
> + cpufreq_fast_switch_count--;

Shouldn't we make it more efficient and write it as:

WARN_ON(cpufreq_fast_switch_count <= 0);
policy->fast_switch_enabled = false;
cpufreq_fast_switch_count--;

The WARN check will hold true only for a major bug somewhere in the core and we
shall *never* hit it.

> + mutex_unlock(&cpufreq_fast_switch_lock);
> + }
> +
> + if (cpufreq_driver->exit) {
> + cpufreq_driver->exit(policy);
> + policy->freq_table = NULL;
> + }
> +}
> +
> static int cpufreq_online(unsigned int cpu)
> {
> struct cpufreq_policy *policy;
> @@ -1236,8 +1305,7 @@ static int cpufreq_online(unsigned int c
> out_exit_policy:
> up_write(&policy->rwsem);
>
> - if (cpufreq_driver->exit)
> - cpufreq_driver->exit(policy);
> + cpufreq_driver_exit_policy(policy);
> out_free_policy:
> cpufreq_policy_free(policy, !new_policy);
> return ret;
> @@ -1334,10 +1402,7 @@ static void cpufreq_offline(unsigned int
> * since this is a core component, and is essential for the
> * subsequent light-weight ->init() to succeed.
> */
> - if (cpufreq_driver->exit) {
> - cpufreq_driver->exit(policy);
> - policy->freq_table = NULL;
> - }
> + cpufreq_driver_exit_policy(policy);
>
> unlock:
> up_write(&policy->rwsem);
> @@ -1452,8 +1517,12 @@ static unsigned int __cpufreq_get(struct
>
> ret_freq = cpufreq_driver->get(policy->cpu);
>
> - /* Updating inactive policies is invalid, so avoid doing that. */
> - if (unlikely(policy_is_inactive(policy)))
> + /*
> + * Updating inactive policies is invalid, so avoid doing that. Also
> + * if fast frequency switching is used with the given policy, the check
> + * against policy->cur is pointless, so skip it in that case too.
> + */
> + if (unlikely(policy_is_inactive(policy)) || policy->fast_switch_enabled)
> return ret_freq;
>
> if (ret_freq && policy->cur &&
> @@ -1465,7 +1534,6 @@ static unsigned int __cpufreq_get(struct
> schedule_work(&policy->update);
> }
> }
> -

Unrelated change ? And to me it looks better with the blank line ..

> return ret_freq;
> }
>
> @@ -1672,8 +1740,18 @@ int cpufreq_register_notifier(struct not
>
> switch (list) {
> case CPUFREQ_TRANSITION_NOTIFIER:
> + mutex_lock(&cpufreq_fast_switch_lock);
> +
> + if (cpufreq_fast_switch_count > 0) {
> + mutex_unlock(&cpufreq_fast_switch_lock);
> + return -EBUSY;
> + }
> ret = srcu_notifier_chain_register(
> &cpufreq_transition_notifier_list, nb);
> + if (!ret)
> + cpufreq_fast_switch_count--;
> +
> + mutex_unlock(&cpufreq_fast_switch_lock);
> break;
> case CPUFREQ_POLICY_NOTIFIER:
> ret = blocking_notifier_chain_register(
> @@ -1706,8 +1784,14 @@ int cpufreq_unregister_notifier(struct n
>
> switch (list) {
> case CPUFREQ_TRANSITION_NOTIFIER:
> + mutex_lock(&cpufreq_fast_switch_lock);
> +
> ret = srcu_notifier_chain_unregister(
> &cpufreq_transition_notifier_list, nb);
> + if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0))
> + cpufreq_fast_switch_count++;

Again here, why shouldn't we write it as:

WARN_ON(cpufreq_fast_switch_count >= 0);

if (!ret)
cpufreq_fast_switch_count++;

> +
> + mutex_unlock(&cpufreq_fast_switch_lock);
> break;
> case CPUFREQ_POLICY_NOTIFIER:
> ret = blocking_notifier_chain_unregister(
> @@ -1726,6 +1810,34 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
> * GOVERNORS *
> *********************************************************************/
>
> +/**
> + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
> + * @policy: cpufreq policy to switch the frequency for.
> + * @target_freq: New frequency to set (may be approximate).
> + *
> + * Carry out a fast frequency switch from interrupt context.
> + *
> + * The driver's ->fast_switch() callback invoked by this function is expected to
> + * select the minimum available frequency greater than or equal to @target_freq
> + * (CPUFREQ_RELATION_L).
> + *
> + * This function must not be called if policy->fast_switch_enabled is unset.
> + *
> + * Governors calling this function must guarantee that it will never be invoked
> + * twice in parallel for the same policy and that it will never be called in
> + * parallel with either ->target() or ->target_index() for the same policy.
> + *
> + * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
> + * callback to indicate an error condition, the hardware configuration must be
> + * preserved.
> + */
> +unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> + unsigned int target_freq)
> +{
> + return cpufreq_driver->fast_switch(policy, target_freq);
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
> +
> /* Must set freqs->new to intermediate frequency */
> static int __target_intermediate(struct cpufreq_policy *policy,
> struct cpufreq_freqs *freqs, int index)

--
viresh

2016-03-28 07:03:54

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v6 6/7][Resend] cpufreq: Support for fast frequency switching

forgot to review acpi update earlier ..

On 22-03-16, 02:53, Rafael J. Wysocki wrote:
> Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
> +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
> @@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp
> return result;
> }
>
> +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> + unsigned int target_freq)
> +{
> + struct acpi_cpufreq_data *data = policy->driver_data;
> + struct acpi_processor_performance *perf;
> + struct cpufreq_frequency_table *entry;
> + unsigned int next_perf_state, next_freq, freq;
> +
> + /*
> + * Find the closest frequency above target_freq.
> + *
> + * The table is sorted in the reverse order with respect to the
> + * frequency and all of the entries are valid (see the initialization).
> + */
> + entry = data->freq_table;
> + do {
> + entry++;
> + freq = entry->frequency;
> + } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);

Consider this table:

11000
10000
9000

And a target-freq of 10000.

Wouldn't you end up selecting 11000 ? Or did I misread it ?

> + entry--;
> + next_freq = entry->frequency;
> + next_perf_state = entry->driver_data;
> +
> + perf = to_perf_data(data);
> + if (perf->state == next_perf_state) {
> + if (unlikely(data->resume))
> + data->resume = 0;
> + else
> + return next_freq;
> + }
> +
> + data->cpu_freq_write(&perf->control_register,
> + perf->states[next_perf_state].control);
> + perf->state = next_perf_state;
> + return next_freq;
> +}
> +
> static unsigned long
> acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
> {
> @@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct
> goto err_unreg;
> }
>
> + policy->fast_switch_possible = !acpi_pstate_strict &&
> + !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
> +
> data->freq_table = kzalloc(sizeof(*data->freq_table) *
> (perf->state_count+1), GFP_KERNEL);
> if (!data->freq_table) {
> @@ -874,6 +914,7 @@ static struct freq_attr *acpi_cpufreq_at
> static struct cpufreq_driver acpi_cpufreq_driver = {
> .verify = cpufreq_generic_frequency_table_verify,
> .target_index = acpi_cpufreq_target,
> + .fast_switch = acpi_cpufreq_fast_switch,
> .bios_limit = acpi_processor_get_bios_limit,
> .init = acpi_cpufreq_cpu_init,
> .exit = acpi_cpufreq_cpu_exit,

--
viresh

2016-03-28 09:03:43

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On 22-03-16, 02:54, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- /dev/null
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -0,0 +1,528 @@
> +/*
> + * CPUFreq governor based on scheduler-provided CPU utilization data.
> + *
> + * Copyright (C) 2016, Intel Corporation
> + * Author: Rafael J. Wysocki <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/cpufreq.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <trace/events/power.h>
> +
> +#include "sched.h"
> +
> +struct sugov_tunables {
> + struct gov_attr_set attr_set;
> + unsigned int rate_limit_us;
> +};
> +
> +struct sugov_policy {
> + struct cpufreq_policy *policy;
> +
> + struct sugov_tunables *tunables;
> + struct list_head tunables_hook;
> +
> + raw_spinlock_t update_lock; /* For shared policies */
> + u64 last_freq_update_time;
> + s64 freq_update_delay_ns;

And why isn't it part of sugov_tunables? Its gonna be same for all policies
sharing tunables ..

> + unsigned int next_freq;
> +
> + /* The next fields are only needed if fast switch cannot be used. */
> + struct irq_work irq_work;
> + struct work_struct work;
> + struct mutex work_lock;
> + bool work_in_progress;
> +
> + bool need_freq_update;
> +};
> +
> +struct sugov_cpu {
> + struct update_util_data update_util;
> + struct sugov_policy *sg_policy;
> +
> + /* The fields below are only needed when sharing a policy. */
> + unsigned long util;
> + unsigned long max;
> + u64 last_update;
> +};
> +
> +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> +
> +/************************ Governor internals ***********************/
> +
> +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)

To make its purpose clear, maybe name it as: sugov_should_reevaluate_freq(),
because we aren't updating the freq just yet, but deciding if we need to
reevaluate again or not.

As its going to be called from hotpath, maybe mark it as inline and let compiler
decide ?

> +{
> + u64 delta_ns;
> +
> + if (sg_policy->work_in_progress)
> + return false;
> +
> + if (unlikely(sg_policy->need_freq_update)) {
> + sg_policy->need_freq_update = false;
> + return true;
> + }
> +
> + delta_ns = time - sg_policy->last_freq_update_time;
> + return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
> +}
> +
> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,

Maybe sugov_update_freq() ?

> + unsigned int next_freq)
> +{
> + struct cpufreq_policy *policy = sg_policy->policy;
> +
> + sg_policy->last_freq_update_time = time;
> +
> + if (policy->fast_switch_enabled) {
> + if (next_freq > policy->max)
> + next_freq = policy->max;
> + else if (next_freq < policy->min)
> + next_freq = policy->min;
> +
> + if (sg_policy->next_freq == next_freq) {
> + trace_cpu_frequency(policy->cur, smp_processor_id());
> + return;
> + }
> + sg_policy->next_freq = next_freq;

Why not do all of above stuff as part of else block as well and move it before
the if {} block ?

> + next_freq = cpufreq_driver_fast_switch(policy, next_freq);
> + if (next_freq == CPUFREQ_ENTRY_INVALID)
> + return;
> +
> + policy->cur = next_freq;
> + trace_cpu_frequency(next_freq, smp_processor_id());
> + } else if (sg_policy->next_freq != next_freq) {
> + sg_policy->next_freq = next_freq;
> + sg_policy->work_in_progress = true;
> + irq_work_queue(&sg_policy->irq_work);
> + }
> +}
> +
> +/**
> + * get_next_freq - Compute a new frequency for a given cpufreq policy.
> + * @policy: cpufreq policy object to compute the new frequency for.
> + * @util: Current CPU utilization.
> + * @max: CPU capacity.
> + *
> + * If the utilization is frequency-invariant, choose the new frequency to be
> + * proportional to it, that is
> + *
> + * next_freq = C * max_freq * util / max
> + *
> + * Otherwise, approximate the would-be frequency-invariant utilization by
> + * util_raw * (curr_freq / max_freq) which leads to
> + *
> + * next_freq = C * curr_freq * util_raw / max
> + *
> + * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
> + */
> +static unsigned int get_next_freq(struct cpufreq_policy *policy,
> + unsigned long util, unsigned long max)
> +{
> + unsigned int freq = arch_scale_freq_invariant() ?
> + policy->cpuinfo.max_freq : policy->cur;
> +
> + return (freq + (freq >> 2)) * util / max;
> +}
> +
> +static void sugov_update_single(struct update_util_data *hook, u64 time,
> + unsigned long util, unsigned long max)
> +{
> + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> + struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> + struct cpufreq_policy *policy = sg_policy->policy;
> + unsigned int next_f;
> +
> + if (!sugov_should_update_freq(sg_policy, time))
> + return;
> +
> + next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
> + get_next_freq(policy, util, max);
> + sugov_update_commit(sg_policy, time, next_f);
> +}
> +
> +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
> + unsigned long util, unsigned long max)
> +{
> + struct cpufreq_policy *policy = sg_policy->policy;
> + unsigned int max_f = policy->cpuinfo.max_freq;
> + u64 last_freq_update_time = sg_policy->last_freq_update_time;
> + unsigned int j;
> +
> + if (util == ULONG_MAX)
> + return max_f;
> +
> + for_each_cpu(j, policy->cpus) {
> + struct sugov_cpu *j_sg_cpu;
> + unsigned long j_util, j_max;
> + u64 delta_ns;
> +
> + if (j == smp_processor_id())
> + continue;

Why skip local CPU completely ? And if we really want to do that, what about
something like for_each_cpu_and_not to kill the unnecessary if {} statement ?

> +
> + j_sg_cpu = &per_cpu(sugov_cpu, j);
> + /*
> + * If the CPU utilization was last updated before the previous
> + * frequency update and the time elapsed between the last update
> + * of the CPU utilization and the last frequency update is long
> + * enough, don't take the CPU into account as it probably is
> + * idle now.
> + */
> + delta_ns = last_freq_update_time - j_sg_cpu->last_update;
> + if ((s64)delta_ns > TICK_NSEC)
> + continue;
> +
> + j_util = j_sg_cpu->util;
> + if (j_util == ULONG_MAX)
> + return max_f;
> +
> + j_max = j_sg_cpu->max;
> + if (j_util * max > j_max * util) {
> + util = j_util;
> + max = j_max;
> + }
> + }
> +
> + return get_next_freq(policy, util, max);
> +}
> +
> +static void sugov_update_shared(struct update_util_data *hook, u64 time,
> + unsigned long util, unsigned long max)
> +{
> + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> + struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> + unsigned int next_f;
> +
> + raw_spin_lock(&sg_policy->update_lock);
> +
> + sg_cpu->util = util;
> + sg_cpu->max = max;
> + sg_cpu->last_update = time;
> +
> + if (sugov_should_update_freq(sg_policy, time)) {
> + next_f = sugov_next_freq_shared(sg_policy, util, max);
> + sugov_update_commit(sg_policy, time, next_f);
> + }
> +
> + raw_spin_unlock(&sg_policy->update_lock);
> +}
> +
> +static void sugov_work(struct work_struct *work)
> +{
> + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> +
> + mutex_lock(&sg_policy->work_lock);
> + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> + CPUFREQ_RELATION_L);
> + mutex_unlock(&sg_policy->work_lock);
> +
> + sg_policy->work_in_progress = false;
> +}
> +
> +static void sugov_irq_work(struct irq_work *irq_work)
> +{
> + struct sugov_policy *sg_policy;
> +
> + sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
> + schedule_work_on(smp_processor_id(), &sg_policy->work);
> +}
> +
> +/************************** sysfs interface ************************/
> +
> +static struct sugov_tunables *global_tunables;
> +static DEFINE_MUTEX(global_tunables_lock);
> +
> +static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
> +{
> + return container_of(attr_set, struct sugov_tunables, attr_set);
> +}
> +
> +static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
> +{
> + struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> +
> + return sprintf(buf, "%u\n", tunables->rate_limit_us);
> +}
> +
> +static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
> + size_t count)
> +{
> + struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> + struct sugov_policy *sg_policy;
> + unsigned int rate_limit_us;
> + int ret;
> +
> + ret = sscanf(buf, "%u", &rate_limit_us);

checkpatch warns for this, we should be using kstrtou32 here ..

> + if (ret != 1)
> + return -EINVAL;
> +
> + tunables->rate_limit_us = rate_limit_us;
> +
> + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
> + sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
> +
> + return count;
> +}
> +
> +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);

Why not reuse gov_attr_rw() ?

> +
> +static struct attribute *sugov_attributes[] = {
> + &rate_limit_us.attr,
> + NULL
> +};
> +
> +static struct kobj_type sugov_tunables_ktype = {
> + .default_attrs = sugov_attributes,
> + .sysfs_ops = &governor_sysfs_ops,
> +};
> +
> +/********************** cpufreq governor interface *********************/
> +
> +static struct cpufreq_governor schedutil_gov;
> +
> +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
> +{
> + struct sugov_policy *sg_policy;
> +
> + sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
> + if (!sg_policy)
> + return NULL;
> +
> + sg_policy->policy = policy;
> + init_irq_work(&sg_policy->irq_work, sugov_irq_work);
> + INIT_WORK(&sg_policy->work, sugov_work);
> + mutex_init(&sg_policy->work_lock);
> + raw_spin_lock_init(&sg_policy->update_lock);
> + return sg_policy;
> +}
> +
> +static void sugov_policy_free(struct sugov_policy *sg_policy)
> +{
> + mutex_destroy(&sg_policy->work_lock);
> + kfree(sg_policy);
> +}
> +
> +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
> +{
> + struct sugov_tunables *tunables;
> +
> + tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
> + if (tunables)
> + gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
> +
> + return tunables;
> +}
> +
> +static void sugov_tunables_free(struct sugov_tunables *tunables)
> +{
> + if (!have_governor_per_policy())
> + global_tunables = NULL;
> +
> + kfree(tunables);
> +}
> +
> +static int sugov_init(struct cpufreq_policy *policy)
> +{
> + struct sugov_policy *sg_policy;
> + struct sugov_tunables *tunables;
> + unsigned int lat;
> + int ret = 0;
> +
> + /* State should be equivalent to EXIT */
> + if (policy->governor_data)
> + return -EBUSY;
> +
> + sg_policy = sugov_policy_alloc(policy);
> + if (!sg_policy)
> + return -ENOMEM;
> +
> + mutex_lock(&global_tunables_lock);
> +
> + if (global_tunables) {
> + if (WARN_ON(have_governor_per_policy())) {
> + ret = -EINVAL;
> + goto free_sg_policy;
> + }
> + policy->governor_data = sg_policy;
> + sg_policy->tunables = global_tunables;
> +
> + gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
> + goto out;
> + }
> +
> + tunables = sugov_tunables_alloc(sg_policy);
> + if (!tunables) {
> + ret = -ENOMEM;
> + goto free_sg_policy;
> + }
> +
> + tunables->rate_limit_us = LATENCY_MULTIPLIER;
> + lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
> + if (lat)
> + tunables->rate_limit_us *= lat;
> +
> + if (!have_governor_per_policy())
> + global_tunables = tunables;

To make sugov_tunables_alloc/free() symmetric to each other, should we move
above into sugov_tunables_alloc() ?

> +
> + policy->governor_data = sg_policy;
> + sg_policy->tunables = tunables;
> +
> + ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
> + get_governor_parent_kobj(policy), "%s",
> + schedutil_gov.name);
> + if (!ret)
> + goto out;
> +
> + /* Failure, so roll back. */
> + policy->governor_data = NULL;
> + sugov_tunables_free(tunables);
> +
> + free_sg_policy:
> + pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
> + sugov_policy_free(sg_policy);

I didn't like the way we have mixed success and failure path here, just to save
a single line of code (unlock).

Over that it does things, that aren't symmetric anymore. For example, we have
called sugov_policy_alloc() without locks and are freeing it from within locks.

> +
> + out:
> + mutex_unlock(&global_tunables_lock);
> + return ret;
> +}
> +
> +static int sugov_exit(struct cpufreq_policy *policy)
> +{
> + struct sugov_policy *sg_policy = policy->governor_data;
> + struct sugov_tunables *tunables = sg_policy->tunables;
> + unsigned int count;
> +
> + mutex_lock(&global_tunables_lock);
> +
> + count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
> + policy->governor_data = NULL;
> + if (!count)
> + sugov_tunables_free(tunables);
> +
> + mutex_unlock(&global_tunables_lock);
> +
> + sugov_policy_free(sg_policy);
> + return 0;
> +}
> +
> +static int sugov_start(struct cpufreq_policy *policy)
> +{
> + struct sugov_policy *sg_policy = policy->governor_data;
> + unsigned int cpu;
> +
> + cpufreq_enable_fast_switch(policy);

Why should we be doing this from START, which gets called a lot compared to
INIT/EXIT? This is something which should be moved to INIT IMHO.

> + sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
> + sg_policy->last_freq_update_time = 0;
> + sg_policy->next_freq = UINT_MAX;
> + sg_policy->work_in_progress = false;
> + sg_policy->need_freq_update = false;
> +
> + for_each_cpu(cpu, policy->cpus) {
> + struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
> +
> + sg_cpu->sg_policy = sg_policy;
> + if (policy_is_shared(policy)) {
> + sg_cpu->util = ULONG_MAX;
> + sg_cpu->max = 0;
> + sg_cpu->last_update = 0;
> + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
> + sugov_update_shared);
> + } else {
> + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
> + sugov_update_single);
> + }
> + }
> + return 0;
> +}
> +
> +static int sugov_stop(struct cpufreq_policy *policy)
> +{
> + struct sugov_policy *sg_policy = policy->governor_data;
> + unsigned int cpu;
> +
> + for_each_cpu(cpu, policy->cpus)
> + cpufreq_remove_update_util_hook(cpu);
> +
> + synchronize_sched();
> +
> + irq_work_sync(&sg_policy->irq_work);
> + cancel_work_sync(&sg_policy->work);

And again, we should have a disable-fast-switch as well..

> + return 0;
> +}
> +
> +static int sugov_limits(struct cpufreq_policy *policy)
> +{
> + struct sugov_policy *sg_policy = policy->governor_data;
> +
> + if (!policy->fast_switch_enabled) {
> + mutex_lock(&sg_policy->work_lock);
> +
> + if (policy->max < policy->cur)
> + __cpufreq_driver_target(policy, policy->max,
> + CPUFREQ_RELATION_H);
> + else if (policy->min > policy->cur)
> + __cpufreq_driver_target(policy, policy->min,
> + CPUFREQ_RELATION_L);
> +
> + mutex_unlock(&sg_policy->work_lock);

Maybe we can try to take lock only if we are going to switch the freq, i.e. only
if sugov_limits is called for policy->min/max update?

i.e.

void __sugov_limits(policy, freq, relation)
{
mutex_lock(&sg_policy->work_lock);
__cpufreq_driver_target(policy, freq, relation);
mutex_unlock(&sg_policy->work_lock);
}

static int sugov_limits(struct cpufreq_policy *policy)
{
struct sugov_policy *sg_policy = policy->governor_data;

if (!policy->fast_switch_enabled) {
if (policy->max < policy->cur)
__sugov_limits(policy, policy->max, CPUFREQ_RELATION_H);
else if (policy->min > policy->cur)
__sugov_limits(policy, policy->min, CPUFREQ_RELATION_L);
}

sg_policy->need_freq_update = true;
return 0;
}

??

And maybe the same for current governors? (ofcourse in a separate patch, I can
do that if you want).


Also, why not just always do 'sg_policy->need_freq_update = true' from this
routine and remove everything else? It will be taken care of on next evaluation.

> +
> +int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
> +{
> + if (event == CPUFREQ_GOV_POLICY_INIT) {
> + return sugov_init(policy);
> + } else if (policy->governor_data) {
> + switch (event) {
> + case CPUFREQ_GOV_POLICY_EXIT:
> + return sugov_exit(policy);
> + case CPUFREQ_GOV_START:
> + return sugov_start(policy);
> + case CPUFREQ_GOV_STOP:
> + return sugov_stop(policy);
> + case CPUFREQ_GOV_LIMITS:
> + return sugov_limits(policy);
> + }
> + }
> + return -EINVAL;
> +}
> +
> +static struct cpufreq_governor schedutil_gov = {
> + .name = "schedutil",
> + .governor = sugov_governor,
> + .owner = THIS_MODULE,
> +};
> +
> +static int __init sugov_module_init(void)
> +{
> + return cpufreq_register_governor(&schedutil_gov);
> +}
> +
> +static void __exit sugov_module_exit(void)
> +{
> + cpufreq_unregister_governor(&schedutil_gov);
> +}
> +
> +MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
> +MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
> +MODULE_LICENSE("GPL");

Maybe a MODULE_ALIAS as well ?

--
viresh

2016-03-28 16:47:58

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH v6 6/7][Resend] cpufreq: Support for fast frequency switching

On 03/25/2016 06:46 PM, Rafael J. Wysocki wrote:
>>> @@ -1726,6 +1810,34 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
>>> >> * GOVERNORS *
>>> >> *********************************************************************/
>>> >>
>>> >> +/**
>>> >> + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
>>> >> + * @policy: cpufreq policy to switch the frequency for.
>>> >> + * @target_freq: New frequency to set (may be approximate).
>>> >> + *
>>> >> + * Carry out a fast frequency switch from interrupt context.
>> >
>> > I think that should say atomic rather than interrupt as this might not
>> > be called from interrupt context.
>
> "Interrupt context" here means something like "context that cannot
> sleep" and it's sort of a traditional way of calling that. I
> considered saying "atomic context" here, but then decided that it
> might suggest too much.
>
> Maybe something like "Carry out a fast frequency switch without
> sleeping" would be better?

Yes I do think that's preferable. I also wonder if it makes sense to
state expectations of how long the operation should take - i.e. not only
will it not sleep, but it is expected to complete "quickly." However I
accept that it is not well defined what that means. Maybe a mention that
this may be called in scheduler hot paths.

2016-03-28 18:17:49

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On 03/26/2016 06:36 PM, Rafael J. Wysocki wrote:
>>>> +static int sugov_limits(struct cpufreq_policy *policy)
>>>> >>> +{
>>>> >>> + struct sugov_policy *sg_policy = policy->governor_data;
>>>> >>> +
>>>> >>> + if (!policy->fast_switch_enabled) {
>>>> >>> + mutex_lock(&sg_policy->work_lock);
>>>> >>> +
>>>> >>> + if (policy->max < policy->cur)
>>>> >>> + __cpufreq_driver_target(policy, policy->max,
>>>> >>> + CPUFREQ_RELATION_H);
>>>> >>> + else if (policy->min > policy->cur)
>>>> >>> + __cpufreq_driver_target(policy, policy->min,
>>>> >>> + CPUFREQ_RELATION_L);
>>>> >>> +
>>>> >>> + mutex_unlock(&sg_policy->work_lock);
>>>> >>> + }
>>> >>
>>> >> Is the expectation that in the fast_switch_enabled case we should
>>> >> re-evaluate soon enough that an explicit fixup is not required here?
>> >
>> > Yes, it is.
>> >
>>> >> I'm worried as to whether that will always be true given the possible
>>> >> criticality of applying frequency limits (thermal for example).
>> >
>> > The part of the patch below that you cut actually takes care of that:
>> >
>> > sg_policy->need_freq_update = true;
>> >
>> > which causes the rate limit to be ignored essentially, so the
>> > frequency will be changed on the first update from the scheduler.

The scenario I'm contemplating is that while a CPU-intensive task is
running a thermal interrupt goes off. The driver for this thermal
interrupt responds by capping fmax. If this happens just after the tick,
it seems possible that we could wait a full tick before changing the
frequency. Given a 10ms tick it could be rather annoying for thermal
management algorithms on some platforms (I'm familiar with a few).

>> > Which also is why the min/max check is before the sg_policy->next_freq
>> > == next_freq check in sugov_update_commit().
>> >
>> > I wanted to avoid locking in the fast switch/one CPU per policy case
>> > which otherwise would be necessary just for the handling of this
>> > thing. I'd like to keep it the way it is unless it can be clearly
>> > demonstrated that it really would lead to problems in practice in a
>> > real system.
>
> Besides, even if frequency is updated directly from here in the "fast
> switch" case, that still doesn't guarantee that it will be updated
> immediately, because the task running this code may be preempted and
> only scheduled again in the next cycle.
>
> Not to mention the fact that it may not run on the CPU to be updated,
> so it would need to use something like smp_call_function_single() for
> the update and that would complicate things even more.
>
> Overall, I don't really think that doing the update directly from here
> in the "fast switch" case would improve things much latency-wise and
> it would increase complexity and introduce overhead into the fast
> path. So this really is a tradeoff and the current choice is the
> right one IMO.

On the desire to avoid locking in the fast switch/one CPU per policy
case, I wondered about whether disabling interrupts in sugov_limits()
would suffice. That's a rarely called function and I was hoping that the
update hook would already have interrupts disabled due to its being
called in scheduler paths that may do raw_spin_lock_irqsave. But I'm not
sure offhand that will always be true. If it isn't though then I'm not
sure what's necessarily stopping say the sched tick calling the hook
while the hook is already in progress from some other path.

Agreed there would need to be some additional complexity somewhere to
get things running on the correct CPU.

Anyway I have nothing against deferring this for now.

thanks,
Steve

2016-03-29 12:08:00

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 6/7][Resend] cpufreq: Support for fast frequency switching

On Monday, March 28, 2016 12:33:41 PM Viresh Kumar wrote:
> forgot to review acpi update earlier ..
>
> On 22-03-16, 02:53, Rafael J. Wysocki wrote:
> > Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
> > +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
> > @@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp
> > return result;
> > }
> >
> > +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > + unsigned int target_freq)
> > +{
> > + struct acpi_cpufreq_data *data = policy->driver_data;
> > + struct acpi_processor_performance *perf;
> > + struct cpufreq_frequency_table *entry;
> > + unsigned int next_perf_state, next_freq, freq;
> > +
> > + /*
> > + * Find the closest frequency above target_freq.
> > + *
> > + * The table is sorted in the reverse order with respect to the
> > + * frequency and all of the entries are valid (see the initialization).
> > + */
> > + entry = data->freq_table;
> > + do {
> > + entry++;
> > + freq = entry->frequency;
> > + } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
>
> Consider this table:
>
> 11000
> 10000
> 9000
>
> And a target-freq of 10000.
>
> Wouldn't you end up selecting 11000 ? Or did I misread it ?

In that case the loop will break for freq = 9000 (as per the above
freq >= freq_target check), so it looks like you've misread it.

Thanks,
Rafael

2016-03-29 12:08:28

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 6/7][Resend] cpufreq: Support for fast frequency switching

On Monday, March 28, 2016 09:47:53 AM Steve Muckle wrote:
> On 03/25/2016 06:46 PM, Rafael J. Wysocki wrote:
> >>> @@ -1726,6 +1810,34 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
> >>> >> * GOVERNORS *
> >>> >> *********************************************************************/
> >>> >>
> >>> >> +/**
> >>> >> + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
> >>> >> + * @policy: cpufreq policy to switch the frequency for.
> >>> >> + * @target_freq: New frequency to set (may be approximate).
> >>> >> + *
> >>> >> + * Carry out a fast frequency switch from interrupt context.
> >> >
> >> > I think that should say atomic rather than interrupt as this might not
> >> > be called from interrupt context.
> >
> > "Interrupt context" here means something like "context that cannot
> > sleep" and it's sort of a traditional way of calling that. I
> > considered saying "atomic context" here, but then decided that it
> > might suggest too much.
> >
> > Maybe something like "Carry out a fast frequency switch without
> > sleeping" would be better?
>
> Yes I do think that's preferable. I also wonder if it makes sense to
> state expectations of how long the operation should take - i.e. not only
> will it not sleep, but it is expected to complete "quickly." However I
> accept that it is not well defined what that means. Maybe a mention that
> this may be called in scheduler hot paths.

OK

2016-03-29 12:21:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On Monday, March 28, 2016 11:17:44 AM Steve Muckle wrote:
> On 03/26/2016 06:36 PM, Rafael J. Wysocki wrote:
> >>>> +static int sugov_limits(struct cpufreq_policy *policy)
> >>>> >>> +{
> >>>> >>> + struct sugov_policy *sg_policy = policy->governor_data;
> >>>> >>> +
> >>>> >>> + if (!policy->fast_switch_enabled) {
> >>>> >>> + mutex_lock(&sg_policy->work_lock);
> >>>> >>> +
> >>>> >>> + if (policy->max < policy->cur)
> >>>> >>> + __cpufreq_driver_target(policy, policy->max,
> >>>> >>> + CPUFREQ_RELATION_H);
> >>>> >>> + else if (policy->min > policy->cur)
> >>>> >>> + __cpufreq_driver_target(policy, policy->min,
> >>>> >>> + CPUFREQ_RELATION_L);
> >>>> >>> +
> >>>> >>> + mutex_unlock(&sg_policy->work_lock);
> >>>> >>> + }
> >>> >>
> >>> >> Is the expectation that in the fast_switch_enabled case we should
> >>> >> re-evaluate soon enough that an explicit fixup is not required here?
> >> >
> >> > Yes, it is.
> >> >
> >>> >> I'm worried as to whether that will always be true given the possible
> >>> >> criticality of applying frequency limits (thermal for example).
> >> >
> >> > The part of the patch below that you cut actually takes care of that:
> >> >
> >> > sg_policy->need_freq_update = true;
> >> >
> >> > which causes the rate limit to be ignored essentially, so the
> >> > frequency will be changed on the first update from the scheduler.
>
> The scenario I'm contemplating is that while a CPU-intensive task is
> running a thermal interrupt goes off. The driver for this thermal
> interrupt responds by capping fmax. If this happens just after the tick,
> it seems possible that we could wait a full tick before changing the
> frequency. Given a 10ms tick it could be rather annoying for thermal
> management algorithms on some platforms (I'm familiar with a few).

The thermal driver has to do something like cpufreq_update_policy() then
which can only happen in process context. I'm not sure how it is possible
to guarantee any latency better than that full tick here anyway.

> >> > Which also is why the min/max check is before the sg_policy->next_freq
> >> > == next_freq check in sugov_update_commit().
> >> >
> >> > I wanted to avoid locking in the fast switch/one CPU per policy case
> >> > which otherwise would be necessary just for the handling of this
> >> > thing. I'd like to keep it the way it is unless it can be clearly
> >> > demonstrated that it really would lead to problems in practice in a
> >> > real system.
> >
> > Besides, even if frequency is updated directly from here in the "fast
> > switch" case, that still doesn't guarantee that it will be updated
> > immediately, because the task running this code may be preempted and
> > only scheduled again in the next cycle.
> >
> > Not to mention the fact that it may not run on the CPU to be updated,
> > so it would need to use something like smp_call_function_single() for
> > the update and that would complicate things even more.
> >
> > Overall, I don't really think that doing the update directly from here
> > in the "fast switch" case would improve things much latency-wise and
> > it would increase complexity and introduce overhead into the fast
> > path. So this really is a tradeoff and the current choice is the
> > right one IMO.
>
> On the desire to avoid locking in the fast switch/one CPU per policy
> case, I wondered about whether disabling interrupts in sugov_limits()
> would suffice. That's a rarely called function and I was hoping that the
> update hook would already have interrupts disabled due to its being
> called in scheduler paths that may do raw_spin_lock_irqsave. But I'm not
> sure offhand that will always be true.

It will.

That's why we can use RCU-sched in cpufreq_update_util() etc.

> If it isn't though then I'm not
> sure what's necessarily stopping say the sched tick calling the hook
> while the hook is already in progress from some other path.
>
> Agreed there would need to be some additional complexity somewhere to
> get things running on the correct CPU.
>
> Anyway I have nothing against deferring this for now.

OK

Thanks,
Rafael

2016-03-29 12:29:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 6/7][Resend] cpufreq: Support for fast frequency switching

On Monday, March 28, 2016 11:57:51 AM Viresh Kumar wrote:
> Sorry for jumping in late, was busy with other stuff and travel :(
>

[cut]

> > +static void cpufreq_list_transition_notifiers(void)
> > +{
> > + struct notifier_block *nb;
> > +
> > + pr_info("cpufreq: Registered transition notifiers:\n");
> > +
> > + mutex_lock(&cpufreq_transition_notifier_list.mutex);
> > +
> > + for (nb = cpufreq_transition_notifier_list.head; nb; nb = nb->next)
> > + pr_info("cpufreq: %pF\n", nb->notifier_call);
> > +
> > + mutex_unlock(&cpufreq_transition_notifier_list.mutex);
>
> This will get printed as:
>
> cpufreq: cpufreq: Registered transition notifiers:
> cpufreq: cpufreq: <func>+0x0/0x<address>
> cpufreq: cpufreq: <func>+0x0/0x<address>
> cpufreq: cpufreq: <func>+0x0/0x<address>
>
> Maybe we want something like:
> cpufreq: Registered transition notifiers:
> cpufreq: <func>+0x0/0x<address>
> cpufreq: <func>+0x0/0x<address>
> cpufreq: <func>+0x0/0x<address>
>
> ?

You seem to be saying that pr_fmt() already has "cpufreq: " in it. Fair enough.

> > +}
> > +
> > +/**
> > + * cpufreq_enable_fast_switch - Enable fast frequency switching for policy.
> > + * @policy: cpufreq policy to enable fast frequency switching for.
> > + *
> > + * Try to enable fast frequency switching for @policy.
> > + *
> > + * The attempt will fail if there is at least one transition notifier registered
> > + * at this point, as fast frequency switching is quite fundamentally at odds
> > + * with transition notifiers. Thus if successful, it will make registration of
> > + * transition notifiers fail going forward.
> > + */
> > +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
> > +{
> > + lockdep_assert_held(&policy->rwsem);
> > +
> > + if (!policy->fast_switch_possible)
> > + return;
> > +
> > + mutex_lock(&cpufreq_fast_switch_lock);
> > + if (cpufreq_fast_switch_count >= 0) {
> > + cpufreq_fast_switch_count++;
> > + policy->fast_switch_enabled = true;
> > + } else {
> > + pr_warn("cpufreq: CPU%u: Fast frequency switching not enabled\n",
> > + policy->cpu);
> > + cpufreq_list_transition_notifiers();
> > + }
> > + mutex_unlock(&cpufreq_fast_switch_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch);
>
> And, why don't we have support for disabling fast-switch support? What if we
> switch to schedutil governor (from userspace) and then back to ondemand? We
> don't call policy->exit for that.

Disabling fast switch can be automatic depending on whether or not
fast_switch_enabled is set, but I clearly forgot about the manual governor
switch case.

It should be fine to do it before calling cpufreq_governor(_EXIT) then.


> > /*********************************************************************
> > * SYSFS INTERFACE *
> > @@ -1083,6 +1134,24 @@ static void cpufreq_policy_free(struct c
> > kfree(policy);
> > }
> >
> > +static void cpufreq_driver_exit_policy(struct cpufreq_policy *policy)
> > +{
> > + if (policy->fast_switch_enabled) {
>
> Shouldn't this be accessed from within lock as well ?
>
> > + mutex_lock(&cpufreq_fast_switch_lock);
> > +
> > + policy->fast_switch_enabled = false;
> > + if (!WARN_ON(cpufreq_fast_switch_count <= 0))
> > + cpufreq_fast_switch_count--;
>
> Shouldn't we make it more efficient and write it as:
>
> WARN_ON(cpufreq_fast_switch_count <= 0);
> policy->fast_switch_enabled = false;
> cpufreq_fast_switch_count--;
>
> The WARN check will hold true only for a major bug somewhere in the core and we
> shall *never* hit it.

The point here is to avoid the decrementation if the WARN_ON() triggers too.

> > + mutex_unlock(&cpufreq_fast_switch_lock);
> > + }
> > +
> > + if (cpufreq_driver->exit) {
> > + cpufreq_driver->exit(policy);
> > + policy->freq_table = NULL;
> > + }
> > +}
> > +
> > static int cpufreq_online(unsigned int cpu)
> > {
> > struct cpufreq_policy *policy;
> > @@ -1236,8 +1305,7 @@ static int cpufreq_online(unsigned int c
> > out_exit_policy:
> > up_write(&policy->rwsem);
> >
> > - if (cpufreq_driver->exit)
> > - cpufreq_driver->exit(policy);
> > + cpufreq_driver_exit_policy(policy);
> > out_free_policy:
> > cpufreq_policy_free(policy, !new_policy);
> > return ret;
> > @@ -1334,10 +1402,7 @@ static void cpufreq_offline(unsigned int
> > * since this is a core component, and is essential for the
> > * subsequent light-weight ->init() to succeed.
> > */
> > - if (cpufreq_driver->exit) {
> > - cpufreq_driver->exit(policy);
> > - policy->freq_table = NULL;
> > - }
> > + cpufreq_driver_exit_policy(policy);
> >
> > unlock:
> > up_write(&policy->rwsem);
> > @@ -1452,8 +1517,12 @@ static unsigned int __cpufreq_get(struct
> >
> > ret_freq = cpufreq_driver->get(policy->cpu);
> >
> > - /* Updating inactive policies is invalid, so avoid doing that. */
> > - if (unlikely(policy_is_inactive(policy)))
> > + /*
> > + * Updating inactive policies is invalid, so avoid doing that. Also
> > + * if fast frequency switching is used with the given policy, the check
> > + * against policy->cur is pointless, so skip it in that case too.
> > + */
> > + if (unlikely(policy_is_inactive(policy)) || policy->fast_switch_enabled)
> > return ret_freq;
> >
> > if (ret_freq && policy->cur &&
> > @@ -1465,7 +1534,6 @@ static unsigned int __cpufreq_get(struct
> > schedule_work(&policy->update);
> > }
> > }
> > -
>
> Unrelated change ? And to me it looks better with the blank line ..

Yes, it is unrelated.

> > return ret_freq;
> > }
> >
> > @@ -1672,8 +1740,18 @@ int cpufreq_register_notifier(struct not
> >
> > switch (list) {
> > case CPUFREQ_TRANSITION_NOTIFIER:
> > + mutex_lock(&cpufreq_fast_switch_lock);
> > +
> > + if (cpufreq_fast_switch_count > 0) {
> > + mutex_unlock(&cpufreq_fast_switch_lock);
> > + return -EBUSY;
> > + }
> > ret = srcu_notifier_chain_register(
> > &cpufreq_transition_notifier_list, nb);
> > + if (!ret)
> > + cpufreq_fast_switch_count--;
> > +
> > + mutex_unlock(&cpufreq_fast_switch_lock);
> > break;
> > case CPUFREQ_POLICY_NOTIFIER:
> > ret = blocking_notifier_chain_register(
> > @@ -1706,8 +1784,14 @@ int cpufreq_unregister_notifier(struct n
> >
> > switch (list) {
> > case CPUFREQ_TRANSITION_NOTIFIER:
> > + mutex_lock(&cpufreq_fast_switch_lock);
> > +
> > ret = srcu_notifier_chain_unregister(
> > &cpufreq_transition_notifier_list, nb);
> > + if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0))
> > + cpufreq_fast_switch_count++;
>
> Again here, why shouldn't we write it as:

And same here again, I don't want the incrementation to happen if the WARN_ON()
triggers.

Thanks,
Rafael

2016-03-29 12:56:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On Monday, March 28, 2016 02:33:33 PM Viresh Kumar wrote:
> On 22-03-16, 02:54, Rafael J. Wysocki wrote:
> > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > ===================================================================
> > --- /dev/null
> > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > @@ -0,0 +1,528 @@
> > +/*
> > + * CPUFreq governor based on scheduler-provided CPU utilization data.
> > + *
> > + * Copyright (C) 2016, Intel Corporation
> > + * Author: Rafael J. Wysocki <[email protected]>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/cpufreq.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <trace/events/power.h>
> > +
> > +#include "sched.h"
> > +
> > +struct sugov_tunables {
> > + struct gov_attr_set attr_set;
> > + unsigned int rate_limit_us;
> > +};
> > +
> > +struct sugov_policy {
> > + struct cpufreq_policy *policy;
> > +
> > + struct sugov_tunables *tunables;
> > + struct list_head tunables_hook;
> > +
> > + raw_spinlock_t update_lock; /* For shared policies */
> > + u64 last_freq_update_time;
> > + s64 freq_update_delay_ns;
>
> And why isn't it part of sugov_tunables?

Because it is not a tunable.

> Its gonna be same for all policies sharing tunables ..

The value will be the same, but the cacheline won't.

>
> > + unsigned int next_freq;
> > +
> > + /* The next fields are only needed if fast switch cannot be used. */
> > + struct irq_work irq_work;
> > + struct work_struct work;
> > + struct mutex work_lock;
> > + bool work_in_progress;
> > +
> > + bool need_freq_update;
> > +};
> > +
> > +struct sugov_cpu {
> > + struct update_util_data update_util;
> > + struct sugov_policy *sg_policy;
> > +
> > + /* The fields below are only needed when sharing a policy. */
> > + unsigned long util;
> > + unsigned long max;
> > + u64 last_update;
> > +};
> > +
> > +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> > +
> > +/************************ Governor internals ***********************/
> > +
> > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>
> To make its purpose clear, maybe name it as: sugov_should_reevaluate_freq(),
> because we aren't updating the freq just yet, but deciding if we need to
> reevaluate again or not.

Splitting hairs anyone?

> As its going to be called from hotpath, maybe mark it as inline and let compiler
> decide ?

The compiler will make it inline if it decides it's worth it anyway.

> > +{
> > + u64 delta_ns;
> > +
> > + if (sg_policy->work_in_progress)
> > + return false;
> > +
> > + if (unlikely(sg_policy->need_freq_update)) {
> > + sg_policy->need_freq_update = false;
> > + return true;
> > + }
> > +
> > + delta_ns = time - sg_policy->last_freq_update_time;
> > + return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
> > +}
> > +
> > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>
> Maybe sugov_update_freq() ?

Can you please give up suggesting the names?

What's wrong with the original one? Is it confusing in some way or something?

> > + unsigned int next_freq)
> > +{
> > + struct cpufreq_policy *policy = sg_policy->policy;
> > +
> > + sg_policy->last_freq_update_time = time;
> > +
> > + if (policy->fast_switch_enabled) {
> > + if (next_freq > policy->max)
> > + next_freq = policy->max;
> > + else if (next_freq < policy->min)
> > + next_freq = policy->min;
> > +
> > + if (sg_policy->next_freq == next_freq) {
> > + trace_cpu_frequency(policy->cur, smp_processor_id());
> > + return;
> > + }
> > + sg_policy->next_freq = next_freq;
>
> Why not do all of above stuff as part of else block as well and move it before
> the if {} block ?

Because the trace_cpu_frequency() is needed only in the fast switch case.

> > + next_freq = cpufreq_driver_fast_switch(policy, next_freq);
> > + if (next_freq == CPUFREQ_ENTRY_INVALID)
> > + return;
> > +
> > + policy->cur = next_freq;
> > + trace_cpu_frequency(next_freq, smp_processor_id());
> > + } else if (sg_policy->next_freq != next_freq) {
> > + sg_policy->next_freq = next_freq;
> > + sg_policy->work_in_progress = true;
> > + irq_work_queue(&sg_policy->irq_work);
> > + }
> > +}
> > +
> > +/**
> > + * get_next_freq - Compute a new frequency for a given cpufreq policy.
> > + * @policy: cpufreq policy object to compute the new frequency for.
> > + * @util: Current CPU utilization.
> > + * @max: CPU capacity.
> > + *
> > + * If the utilization is frequency-invariant, choose the new frequency to be
> > + * proportional to it, that is
> > + *
> > + * next_freq = C * max_freq * util / max
> > + *
> > + * Otherwise, approximate the would-be frequency-invariant utilization by
> > + * util_raw * (curr_freq / max_freq) which leads to
> > + *
> > + * next_freq = C * curr_freq * util_raw / max
> > + *
> > + * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
> > + */
> > +static unsigned int get_next_freq(struct cpufreq_policy *policy,
> > + unsigned long util, unsigned long max)
> > +{
> > + unsigned int freq = arch_scale_freq_invariant() ?
> > + policy->cpuinfo.max_freq : policy->cur;
> > +
> > + return (freq + (freq >> 2)) * util / max;
> > +}
> > +
> > +static void sugov_update_single(struct update_util_data *hook, u64 time,
> > + unsigned long util, unsigned long max)
> > +{
> > + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> > + struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> > + struct cpufreq_policy *policy = sg_policy->policy;
> > + unsigned int next_f;
> > +
> > + if (!sugov_should_update_freq(sg_policy, time))
> > + return;
> > +
> > + next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
> > + get_next_freq(policy, util, max);
> > + sugov_update_commit(sg_policy, time, next_f);
> > +}
> > +
> > +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
> > + unsigned long util, unsigned long max)
> > +{
> > + struct cpufreq_policy *policy = sg_policy->policy;
> > + unsigned int max_f = policy->cpuinfo.max_freq;
> > + u64 last_freq_update_time = sg_policy->last_freq_update_time;
> > + unsigned int j;
> > +
> > + if (util == ULONG_MAX)
> > + return max_f;
> > +
> > + for_each_cpu(j, policy->cpus) {
> > + struct sugov_cpu *j_sg_cpu;
> > + unsigned long j_util, j_max;
> > + u64 delta_ns;
> > +
> > + if (j == smp_processor_id())
> > + continue;
>
> Why skip local CPU completely ?

Because the original util and max come from it.

> And if we really want to do that, what about something like for_each_cpu_and_not
> to kill the unnecessary if {} statement ?

That will work.

> > +
> > + j_sg_cpu = &per_cpu(sugov_cpu, j);
> > + /*
> > + * If the CPU utilization was last updated before the previous
> > + * frequency update and the time elapsed between the last update
> > + * of the CPU utilization and the last frequency update is long
> > + * enough, don't take the CPU into account as it probably is
> > + * idle now.
> > + */
> > + delta_ns = last_freq_update_time - j_sg_cpu->last_update;
> > + if ((s64)delta_ns > TICK_NSEC)
> > + continue;
> > +
> > + j_util = j_sg_cpu->util;
> > + if (j_util == ULONG_MAX)
> > + return max_f;
> > +
> > + j_max = j_sg_cpu->max;
> > + if (j_util * max > j_max * util) {
> > + util = j_util;
> > + max = j_max;
> > + }
> > + }
> > +
> > + return get_next_freq(policy, util, max);
> > +}
> > +
> > +static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > + unsigned long util, unsigned long max)
> > +{
> > + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> > + struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> > + unsigned int next_f;
> > +
> > + raw_spin_lock(&sg_policy->update_lock);
> > +
> > + sg_cpu->util = util;
> > + sg_cpu->max = max;
> > + sg_cpu->last_update = time;
> > +
> > + if (sugov_should_update_freq(sg_policy, time)) {
> > + next_f = sugov_next_freq_shared(sg_policy, util, max);
> > + sugov_update_commit(sg_policy, time, next_f);
> > + }
> > +
> > + raw_spin_unlock(&sg_policy->update_lock);
> > +}
> > +
> > +static void sugov_work(struct work_struct *work)
> > +{
> > + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> > +
> > + mutex_lock(&sg_policy->work_lock);
> > + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> > + CPUFREQ_RELATION_L);
> > + mutex_unlock(&sg_policy->work_lock);
> > +
> > + sg_policy->work_in_progress = false;
> > +}
> > +
> > +static void sugov_irq_work(struct irq_work *irq_work)
> > +{
> > + struct sugov_policy *sg_policy;
> > +
> > + sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
> > + schedule_work_on(smp_processor_id(), &sg_policy->work);
> > +}
> > +
> > +/************************** sysfs interface ************************/
> > +
> > +static struct sugov_tunables *global_tunables;
> > +static DEFINE_MUTEX(global_tunables_lock);
> > +
> > +static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
> > +{
> > + return container_of(attr_set, struct sugov_tunables, attr_set);
> > +}
> > +
> > +static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
> > +{
> > + struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> > +
> > + return sprintf(buf, "%u\n", tunables->rate_limit_us);
> > +}
> > +
> > +static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
> > + size_t count)
> > +{
> > + struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> > + struct sugov_policy *sg_policy;
> > + unsigned int rate_limit_us;
> > + int ret;
> > +
> > + ret = sscanf(buf, "%u", &rate_limit_us);
>
> checkpatch warns for this, we should be using kstrtou32 here ..

Hmm. checkpatch. Oh well.

> > + if (ret != 1)
> > + return -EINVAL;
> > +
> > + tunables->rate_limit_us = rate_limit_us;
> > +
> > + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
> > + sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
> > +
> > + return count;
> > +}
> > +
> > +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
>
> Why not reuse gov_attr_rw() ?

Would it work?

> > +
> > +static struct attribute *sugov_attributes[] = {
> > + &rate_limit_us.attr,
> > + NULL
> > +};
> > +
> > +static struct kobj_type sugov_tunables_ktype = {
> > + .default_attrs = sugov_attributes,
> > + .sysfs_ops = &governor_sysfs_ops,
> > +};
> > +
> > +/********************** cpufreq governor interface *********************/
> > +
> > +static struct cpufreq_governor schedutil_gov;
> > +
> > +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
> > +{
> > + struct sugov_policy *sg_policy;
> > +
> > + sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
> > + if (!sg_policy)
> > + return NULL;
> > +
> > + sg_policy->policy = policy;
> > + init_irq_work(&sg_policy->irq_work, sugov_irq_work);
> > + INIT_WORK(&sg_policy->work, sugov_work);
> > + mutex_init(&sg_policy->work_lock);
> > + raw_spin_lock_init(&sg_policy->update_lock);
> > + return sg_policy;
> > +}
> > +
> > +static void sugov_policy_free(struct sugov_policy *sg_policy)
> > +{
> > + mutex_destroy(&sg_policy->work_lock);
> > + kfree(sg_policy);
> > +}
> > +
> > +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
> > +{
> > + struct sugov_tunables *tunables;
> > +
> > + tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
> > + if (tunables)
> > + gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
> > +
> > + return tunables;
> > +}
> > +
> > +static void sugov_tunables_free(struct sugov_tunables *tunables)
> > +{
> > + if (!have_governor_per_policy())
> > + global_tunables = NULL;
> > +
> > + kfree(tunables);
> > +}
> > +
> > +static int sugov_init(struct cpufreq_policy *policy)
> > +{
> > + struct sugov_policy *sg_policy;
> > + struct sugov_tunables *tunables;
> > + unsigned int lat;
> > + int ret = 0;
> > +
> > + /* State should be equivalent to EXIT */
> > + if (policy->governor_data)
> > + return -EBUSY;
> > +
> > + sg_policy = sugov_policy_alloc(policy);
> > + if (!sg_policy)
> > + return -ENOMEM;
> > +
> > + mutex_lock(&global_tunables_lock);
> > +
> > + if (global_tunables) {
> > + if (WARN_ON(have_governor_per_policy())) {
> > + ret = -EINVAL;
> > + goto free_sg_policy;
> > + }
> > + policy->governor_data = sg_policy;
> > + sg_policy->tunables = global_tunables;
> > +
> > + gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
> > + goto out;
> > + }
> > +
> > + tunables = sugov_tunables_alloc(sg_policy);
> > + if (!tunables) {
> > + ret = -ENOMEM;
> > + goto free_sg_policy;
> > + }
> > +
> > + tunables->rate_limit_us = LATENCY_MULTIPLIER;
> > + lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
> > + if (lat)
> > + tunables->rate_limit_us *= lat;
> > +
> > + if (!have_governor_per_policy())
> > + global_tunables = tunables;
>
> To make sugov_tunables_alloc/free() symmetric to each other, should we move
> above into sugov_tunables_alloc() ?

It doesn't matter too much, does it?

> > +
> > + policy->governor_data = sg_policy;
> > + sg_policy->tunables = tunables;
> > +
> > + ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
> > + get_governor_parent_kobj(policy), "%s",
> > + schedutil_gov.name);
> > + if (!ret)
> > + goto out;
> > +
> > + /* Failure, so roll back. */
> > + policy->governor_data = NULL;
> > + sugov_tunables_free(tunables);
> > +
> > + free_sg_policy:
> > + pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
> > + sugov_policy_free(sg_policy);
>
> I didn't like the way we have mixed success and failure path here, just to save
> a single line of code (unlock).

I don't follow, sorry. Yes, I can do unlock/return instead of the "goto out",
but then the goto label is still needed.

> Over that it does things, that aren't symmetric anymore. For example, we have
> called sugov_policy_alloc() without locks

Are you sure?

> and are freeing it from within locks.

Both are under global_tunables_lock.

> > +
> > + out:
> > + mutex_unlock(&global_tunables_lock);
> > + return ret;
> > +}
> > +
> > +static int sugov_exit(struct cpufreq_policy *policy)
> > +{
> > + struct sugov_policy *sg_policy = policy->governor_data;
> > + struct sugov_tunables *tunables = sg_policy->tunables;
> > + unsigned int count;
> > +
> > + mutex_lock(&global_tunables_lock);
> > +
> > + count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
> > + policy->governor_data = NULL;
> > + if (!count)
> > + sugov_tunables_free(tunables);
> > +
> > + mutex_unlock(&global_tunables_lock);
> > +
> > + sugov_policy_free(sg_policy);
> > + return 0;
> > +}
> > +
> > +static int sugov_start(struct cpufreq_policy *policy)
> > +{
> > + struct sugov_policy *sg_policy = policy->governor_data;
> > + unsigned int cpu;
> > +
> > + cpufreq_enable_fast_switch(policy);
>
> Why should we be doing this from START, which gets called a lot compared to
> INIT/EXIT? This is something which should be moved to INIT IMHO.

Yes, INIT would be a better call site.

> > + sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
> > + sg_policy->last_freq_update_time = 0;
> > + sg_policy->next_freq = UINT_MAX;
> > + sg_policy->work_in_progress = false;
> > + sg_policy->need_freq_update = false;
> > +
> > + for_each_cpu(cpu, policy->cpus) {
> > + struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
> > +
> > + sg_cpu->sg_policy = sg_policy;
> > + if (policy_is_shared(policy)) {
> > + sg_cpu->util = ULONG_MAX;
> > + sg_cpu->max = 0;
> > + sg_cpu->last_update = 0;
> > + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
> > + sugov_update_shared);
> > + } else {
> > + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
> > + sugov_update_single);
> > + }
> > + }
> > + return 0;
> > +}
> > +
> > +static int sugov_stop(struct cpufreq_policy *policy)
> > +{
> > + struct sugov_policy *sg_policy = policy->governor_data;
> > + unsigned int cpu;
> > +
> > + for_each_cpu(cpu, policy->cpus)
> > + cpufreq_remove_update_util_hook(cpu);
> > +
> > + synchronize_sched();
> > +
> > + irq_work_sync(&sg_policy->irq_work);
> > + cancel_work_sync(&sg_policy->work);
>
> And again, we should have a disable-fast-switch as well..

That's not necessary.

> > + return 0;
> > +}
> > +
> > +static int sugov_limits(struct cpufreq_policy *policy)
> > +{
> > + struct sugov_policy *sg_policy = policy->governor_data;
> > +
> > + if (!policy->fast_switch_enabled) {
> > + mutex_lock(&sg_policy->work_lock);
> > +
> > + if (policy->max < policy->cur)
> > + __cpufreq_driver_target(policy, policy->max,
> > + CPUFREQ_RELATION_H);
> > + else if (policy->min > policy->cur)
> > + __cpufreq_driver_target(policy, policy->min,
> > + CPUFREQ_RELATION_L);
> > +
> > + mutex_unlock(&sg_policy->work_lock);
>
> Maybe we can try to take lock only if we are going to switch the freq, i.e. only
> if sugov_limits is called for policy->min/max update?

The __cpufreq_driver_target() in sugov_work() potentially updates policy->cur
that's checked here, so I don't really think this is a good idea.

> i.e.
>
> void __sugov_limits(policy, freq, relation)
> {
> mutex_lock(&sg_policy->work_lock);
> __cpufreq_driver_target(policy, freq, relation);
> mutex_unlock(&sg_policy->work_lock);
> }
>
> static int sugov_limits(struct cpufreq_policy *policy)
> {
> struct sugov_policy *sg_policy = policy->governor_data;
>
> if (!policy->fast_switch_enabled) {
> if (policy->max < policy->cur)
> __sugov_limits(policy, policy->max, CPUFREQ_RELATION_H);
> else if (policy->min > policy->cur)
> __sugov_limits(policy, policy->min, CPUFREQ_RELATION_L);
> }
>
> sg_policy->need_freq_update = true;
> return 0;
> }
>
> ??
>
> And maybe the same for current governors? (ofcourse in a separate patch, I can
> do that if you want).
>
>
> Also, why not just always do 'sg_policy->need_freq_update = true' from this
> routine and remove everything else? It will be taken care of on next evaluation.

It only would be taken care of quickly enough in the fast switch case.

> > +
> > +int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
> > +{
> > + if (event == CPUFREQ_GOV_POLICY_INIT) {
> > + return sugov_init(policy);
> > + } else if (policy->governor_data) {
> > + switch (event) {
> > + case CPUFREQ_GOV_POLICY_EXIT:
> > + return sugov_exit(policy);
> > + case CPUFREQ_GOV_START:
> > + return sugov_start(policy);
> > + case CPUFREQ_GOV_STOP:
> > + return sugov_stop(policy);
> > + case CPUFREQ_GOV_LIMITS:
> > + return sugov_limits(policy);
> > + }
> > + }
> > + return -EINVAL;
> > +}
> > +
> > +static struct cpufreq_governor schedutil_gov = {
> > + .name = "schedutil",
> > + .governor = sugov_governor,
> > + .owner = THIS_MODULE,
> > +};
> > +
> > +static int __init sugov_module_init(void)
> > +{
> > + return cpufreq_register_governor(&schedutil_gov);
> > +}
> > +
> > +static void __exit sugov_module_exit(void)
> > +{
> > + cpufreq_unregister_governor(&schedutil_gov);
> > +}
> > +
> > +MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
> > +MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
> > +MODULE_LICENSE("GPL");
>
> Maybe a MODULE_ALIAS as well ?

Sorry, I don't follow.

Thanks,
Rafael

2016-03-29 14:21:07

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v6 6/7][Resend] cpufreq: Support for fast frequency switching

On 29-03-16, 14:10, Rafael J. Wysocki wrote:
> In that case the loop will break for freq = 9000 (as per the above
> freq >= freq_target check), so it looks like you've misread it.

My bad ..

--
viresh

2016-03-30 01:12:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On Tue, Mar 29, 2016 at 2:58 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Monday, March 28, 2016 02:33:33 PM Viresh Kumar wrote:
>> On 22-03-16, 02:54, Rafael J. Wysocki wrote:

[cut]

>> > +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
>> > + unsigned long util, unsigned long max)
>> > +{
>> > + struct cpufreq_policy *policy = sg_policy->policy;
>> > + unsigned int max_f = policy->cpuinfo.max_freq;
>> > + u64 last_freq_update_time = sg_policy->last_freq_update_time;
>> > + unsigned int j;
>> > +
>> > + if (util == ULONG_MAX)
>> > + return max_f;
>> > +
>> > + for_each_cpu(j, policy->cpus) {
>> > + struct sugov_cpu *j_sg_cpu;
>> > + unsigned long j_util, j_max;
>> > + u64 delta_ns;
>> > +
>> > + if (j == smp_processor_id())
>> > + continue;
>>
>> Why skip local CPU completely ?
>
> Because the original util and max come from it.
>
>> And if we really want to do that, what about something like for_each_cpu_and_not
>> to kill the unnecessary if {} statement ?
>
> That will work.

Except that for_each_cpu_and_not is not defined as of today.

I guess I can play with cpumasks, but then I'm not sure that will end
up actually more efficient.

2016-03-30 01:45:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Update][PATCH v7 6/7] cpufreq: Support for fast frequency switching

From: Rafael J. Wysocki <[email protected]>

Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
to support that method if available.

Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context by (future)
governors supporting that feature via (new) helper function
cpufreq_driver_fast_switch().

Add two new policy flags, fast_switch_possible, to be set by the
cpufreq driver if fast frequency switching can be used for the
given policy and fast_switch_enabled, to be set by the governor
if it is going to use fast frequency switching for the given
policy. Also add a helper for setting the latter.

Since fast frequency switching is inherently incompatible with
cpufreq transition notifiers, make it possible to set the
fast_switch_enabled only if there are no transition notifiers
already registered and make the registration of new transition
notifiers fail if fast_switch_enabled is set for at least one
policy.

Implement the ->fast_switch callback in the ACPI cpufreq driver
and make it set fast_switch_possible during policy initialization
as appropriate.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

Changes from v6:
- Added cpufreq_exit_governor() that disables fast frequency switching
before calling cpufreq_governor(_EXIT) and updated the callers of the
latter to use the new function instead.
- Modified cpufreq_driver_fast_switch() to apply the limits to target_freq.
- Modified the changelog of cpufreq_driver_fast_switch() to mention the
RCU-sched read-side critical sections requirement for ->fast_switch.
- Added a comment describing fast_switch_possible and fast_switch_enabled.
- Modified acpi-cpufreq to clear policy->fast_switch_possible when
policy->driver_data is cleared.

Changes from v5:
- cpufreq_enable_fast_switch() fixed to avoid printing a confusing message
if fast_switch_possible is not set for the policy.
- Fixed a typo in that message.
- Removed the WARN_ON() from the (cpufreq_fast_switch_count > 0) check in
cpufreq_register_notifier(), because it triggered false-positive warnings
from the cpufreq_stats module (cpufreq_stats don't work with the fast
switching, because it is based on notifiers).

Changes from v4:
- If cpufreq_enable_fast_switch() is about to fail, it will print the list
of currently registered transition notifiers.
- Added lock_assert_held(&policy->rwsem) to cpufreq_enable_fast_switch().
- Added WARN_ON() to the (cpufreq_fast_switch_count > 0) check in
cpufreq_register_notifier().
- Modified the kerneldoc comment of cpufreq_driver_fast_switch() to
mention the RELATION_L expectation regarding the ->fast_switch callback.

Changes from v3:
- New fast_switch_enabled field in struct cpufreq_policy to help
avoid affecting existing setups by setting the fast_switch_possible
flag in the driver.
- __cpufreq_get() skips the policy->cur check if fast_switch_enabled is set.

Changes from v2:
- The driver ->fast_switch callback and cpufreq_driver_fast_switch()
don't need the relation argument as they will always do RELATION_L now.
- New mechanism to make fast switch and cpufreq notifiers mutually
exclusive.
- cpufreq_driver_fast_switch() doesn't do anything in addition to
invoking the driver callback and returns its return value.

---
drivers/cpufreq/acpi-cpufreq.c | 42 +++++++++++++
drivers/cpufreq/cpufreq.c | 130 +++++++++++++++++++++++++++++++++++++++--
include/linux/cpufreq.h | 16 +++++
3 files changed, 183 insertions(+), 5 deletions(-)

Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp
return result;
}

+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ struct acpi_cpufreq_data *data = policy->driver_data;
+ struct acpi_processor_performance *perf;
+ struct cpufreq_frequency_table *entry;
+ unsigned int next_perf_state, next_freq, freq;
+
+ /*
+ * Find the closest frequency above target_freq.
+ *
+ * The table is sorted in the reverse order with respect to the
+ * frequency and all of the entries are valid (see the initialization).
+ */
+ entry = data->freq_table;
+ do {
+ entry++;
+ freq = entry->frequency;
+ } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
+ entry--;
+ next_freq = entry->frequency;
+ next_perf_state = entry->driver_data;
+
+ perf = to_perf_data(data);
+ if (perf->state == next_perf_state) {
+ if (unlikely(data->resume))
+ data->resume = 0;
+ else
+ return next_freq;
+ }
+
+ data->cpu_freq_write(&perf->control_register,
+ perf->states[next_perf_state].control);
+ perf->state = next_perf_state;
+ return next_freq;
+}
+
static unsigned long
acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
{
@@ -821,6 +858,9 @@ static int acpi_cpufreq_cpu_init(struct
*/
data->resume = 1;

+ policy->fast_switch_possible = !acpi_pstate_strict &&
+ !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
+
return result;

err_freqfree:
@@ -843,6 +883,7 @@ static int acpi_cpufreq_cpu_exit(struct
pr_debug("acpi_cpufreq_cpu_exit\n");

if (data) {
+ policy->fast_switch_possible = false;
policy->driver_data = NULL;
acpi_processor_unregister_performance(data->acpi_perf_cpu);
free_cpumask_var(data->freqdomain_cpus);
@@ -876,6 +917,7 @@ static struct freq_attr *acpi_cpufreq_at
static struct cpufreq_driver acpi_cpufreq_driver = {
.verify = cpufreq_generic_frequency_table_verify,
.target_index = acpi_cpufreq_target,
+ .fast_switch = acpi_cpufreq_fast_switch,
.bios_limit = acpi_processor_get_bios_limit,
.init = acpi_cpufreq_cpu_init,
.exit = acpi_cpufreq_cpu_exit,
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -102,6 +102,17 @@ struct cpufreq_policy {
*/
struct rw_semaphore rwsem;

+ /*
+ * Fast switch flags:
+ * - fast_switch_possible should be set by the driver if it can
+ * guarantee that frequency can be changed on any CPU sharing the
+ * policy and that the change will affect all of the policy CPUs then.
+ * - fast_switch_enabled is to be set by governors that support fast
+ * freqnency switching with the help of cpufreq_enable_fast_switch().
+ */
+ bool fast_switch_possible;
+ bool fast_switch_enabled;
+
/* Synchronization for frequency transitions */
bool transition_ongoing; /* Tracks transition status */
spinlock_t transition_lock;
@@ -156,6 +167,7 @@ int cpufreq_get_policy(struct cpufreq_po
int cpufreq_update_policy(unsigned int cpu);
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
#else
static inline unsigned int cpufreq_get(unsigned int cpu)
{
@@ -236,6 +248,8 @@ struct cpufreq_driver {
unsigned int relation); /* Deprecated */
int (*target_index)(struct cpufreq_policy *policy,
unsigned int index);
+ unsigned int (*fast_switch)(struct cpufreq_policy *policy,
+ unsigned int target_freq);
/*
* Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
* unset.
@@ -464,6 +478,8 @@ struct cpufreq_governor {
};

/* Pass a target to the cpufreq driver */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq);
int cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -77,6 +77,7 @@ static inline bool has_target(void)
static int cpufreq_governor(struct cpufreq_policy *policy, unsigned int event);
static unsigned int __cpufreq_get(struct cpufreq_policy *policy);
static int cpufreq_start_governor(struct cpufreq_policy *policy);
+static int cpufreq_exit_governor(struct cpufreq_policy *policy);

/**
* Two notifier lists: the "policy" list is involved in the
@@ -429,6 +430,68 @@ void cpufreq_freq_transition_end(struct
}
EXPORT_SYMBOL_GPL(cpufreq_freq_transition_end);

+/*
+ * Fast frequency switching status count. Positive means "enabled", negative
+ * means "disabled" and 0 means "not decided yet".
+ */
+static int cpufreq_fast_switch_count;
+static DEFINE_MUTEX(cpufreq_fast_switch_lock);
+
+static void cpufreq_list_transition_notifiers(void)
+{
+ struct notifier_block *nb;
+
+ pr_info("Registered transition notifiers:\n");
+
+ mutex_lock(&cpufreq_transition_notifier_list.mutex);
+
+ for (nb = cpufreq_transition_notifier_list.head; nb; nb = nb->next)
+ pr_info("%pF\n", nb->notifier_call);
+
+ mutex_unlock(&cpufreq_transition_notifier_list.mutex);
+}
+
+/**
+ * cpufreq_enable_fast_switch - Enable fast frequency switching for policy.
+ * @policy: cpufreq policy to enable fast frequency switching for.
+ *
+ * Try to enable fast frequency switching for @policy.
+ *
+ * The attempt will fail if there is at least one transition notifier registered
+ * at this point, as fast frequency switching is quite fundamentally at odds
+ * with transition notifiers. Thus if successful, it will make registration of
+ * transition notifiers fail going forward.
+ */
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
+{
+ lockdep_assert_held(&policy->rwsem);
+
+ if (!policy->fast_switch_possible)
+ return;
+
+ mutex_lock(&cpufreq_fast_switch_lock);
+ if (cpufreq_fast_switch_count >= 0) {
+ cpufreq_fast_switch_count++;
+ policy->fast_switch_enabled = true;
+ } else {
+ pr_warn("CPU%u: Fast frequency switching not enabled\n",
+ policy->cpu);
+ cpufreq_list_transition_notifiers();
+ }
+ mutex_unlock(&cpufreq_fast_switch_lock);
+}
+EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch);
+
+static void cpufreq_disable_fast_switch(struct cpufreq_policy *policy)
+{
+ mutex_lock(&cpufreq_fast_switch_lock);
+ if (policy->fast_switch_enabled) {
+ policy->fast_switch_enabled = false;
+ if (!WARN_ON(cpufreq_fast_switch_count <= 0))
+ cpufreq_fast_switch_count--;
+ }
+ mutex_unlock(&cpufreq_fast_switch_lock);
+}

/*********************************************************************
* SYSFS INTERFACE *
@@ -1319,7 +1382,7 @@ static void cpufreq_offline(unsigned int

/* If cpu is last user of policy, free policy */
if (has_target()) {
- ret = cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
+ ret = cpufreq_exit_governor(policy);
if (ret)
pr_err("%s: Failed to exit governor\n", __func__);
}
@@ -1447,8 +1510,12 @@ static unsigned int __cpufreq_get(struct

ret_freq = cpufreq_driver->get(policy->cpu);

- /* Updating inactive policies is invalid, so avoid doing that. */
- if (unlikely(policy_is_inactive(policy)))
+ /*
+ * Updating inactive policies is invalid, so avoid doing that. Also
+ * if fast frequency switching is used with the given policy, the check
+ * against policy->cur is pointless, so skip it in that case too.
+ */
+ if (unlikely(policy_is_inactive(policy)) || policy->fast_switch_enabled)
return ret_freq;

if (ret_freq && policy->cur &&
@@ -1672,8 +1739,18 @@ int cpufreq_register_notifier(struct not

switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
+ if (cpufreq_fast_switch_count > 0) {
+ mutex_unlock(&cpufreq_fast_switch_lock);
+ return -EBUSY;
+ }
ret = srcu_notifier_chain_register(
&cpufreq_transition_notifier_list, nb);
+ if (!ret)
+ cpufreq_fast_switch_count--;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_register(
@@ -1706,8 +1783,14 @@ int cpufreq_unregister_notifier(struct n

switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
ret = srcu_notifier_chain_unregister(
&cpufreq_transition_notifier_list, nb);
+ if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0))
+ cpufreq_fast_switch_count++;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_unregister(
@@ -1726,6 +1809,37 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
* GOVERNORS *
*********************************************************************/

+/**
+ * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
+ * @policy: cpufreq policy to switch the frequency for.
+ * @target_freq: New frequency to set (may be approximate).
+ *
+ * Carry out a fast frequency switch without sleeping.
+ *
+ * The driver's ->fast_switch() callback invoked by this function must be
+ * suitable for being called from within RCU-sched read-side critical sections
+ * and it is expected to select the minimum available frequency greater than or
+ * equal to @target_freq (CPUFREQ_RELATION_L).
+ *
+ * This function must not be called if policy->fast_switch_enabled is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same policy and that it will never be called in
+ * parallel with either ->target() or ->target_index() for the same policy.
+ *
+ * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
+ * callback to indicate an error condition, the hardware configuration must be
+ * preserved.
+ */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ clamp_val(target_freq, policy->min, policy->max);
+
+ return cpufreq_driver->fast_switch(policy, target_freq);
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
+
/* Must set freqs->new to intermediate frequency */
static int __target_intermediate(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, int index)
@@ -1946,6 +2060,12 @@ static int cpufreq_start_governor(struct
return ret ? ret : cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
}

+static int cpufreq_exit_governor(struct cpufreq_policy *policy)
+{
+ cpufreq_disable_fast_switch(policy);
+ return cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
+}
+
int cpufreq_register_governor(struct cpufreq_governor *governor)
{
int err;
@@ -2101,7 +2221,7 @@ static int cpufreq_set_policy(struct cpu
return ret;
}

- ret = cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
+ ret = cpufreq_exit_governor(policy);
if (ret) {
pr_err("%s: Failed to Exit Governor: %s (%d)\n",
__func__, old_gov->name, ret);
@@ -2118,7 +2238,7 @@ static int cpufreq_set_policy(struct cpu
pr_debug("cpufreq: governor change\n");
return 0;
}
- cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
+ cpufreq_exit_governor(policy);
}

/* new governor failed, so re-start old one */

2016-03-30 01:58:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

From: Rafael J. Wysocki <[email protected]>

Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.

Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.

The new governor is relatively simple.

The frequency selection formula used by it depends on whether or not
the utilization is frequency-invariant. In the frequency-invariant
case the new CPU frequency is given by

next_freq = 1.25 * max_freq * util / max

where util and max are the last two arguments of cpufreq_update_util().
In turn, if util is not frequency-invariant, the maximum frequency in
the above formula is replaced with the current frequency of the CPU:

next_freq = 1.25 * curr_freq * util / max

The coefficient 1.25 corresponds to the frequency tipping point at
(util / max) = 0.8.

All of the computations are carried out in the utilization update
handlers provided by the new governor. One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).

The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers. If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).

Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes. That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.

The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

Changes from v6:
- Rebased on top of 4.6-rc1.
- Fixed the help text in Kconfig.
- sugov_should_update_freq() sets sg_policy->next_freq to UINT_MAX if
need_freq_update is set to enforce a frequency update (even if the new
frequency would be equal to the previously requested one).
- Dropped the limits check from sugov_update_commit() as
cpufreq_driver_fast_switch() applies the limits to the target frequency now.
- rate_limit_us_store() uses kstrtouint() to get the new tunable value.
- sugov_tunables_alloc() sets global_tunables (if necessary and possible).
- sugov_init() calls cpufreq_enable_fast_switch() and was rearranged a bit.

Changes from v5:
- Fixed sugov_update_commit() to set sg_policy->next_freq properly
in the "work item" branch.
- Used smp_processor_id() in sugov_irq_work() and restored work_in_progress.

Changes from v4:
- Use TICK_NSEC in sugov_next_freq_shared().
- Use schedule_work_on() to schedule work items and replace
work_in_progress with work_cpu (which is used both for scheduling
work items and as a "work in progress" marker).
- Rearrange sugov_update_commit() to only check policy->min/max if
fast switching is enabled.
- Replace util > max checks with util == ULONG_MAX checks to make
it clear that they are about a special case (RT/DL).

Changes from v3:
- The "next frequency" formula based on
http://marc.info/?l=linux-acpi&m=145756618321500&w=4 and
http://marc.info/?l=linux-kernel&m=145760739700716&w=4
- The governor goes into kernel/sched/ (again).

Changes from v2:
- The governor goes into drivers/cpufreq/.
- The "next frequency" formula has an additional 1.1 factor to allow
more util/max values to map onto the top-most frequency in case the
distance between that and the previous one is unproportionally small.
- sugov_update_commit() traces CPU frequency even if the new one is
the same as the previous one (otherwise, if the system is 100% loaded
for long enough, powertop starts to report that all CPUs are 100% idle).

---
drivers/cpufreq/Kconfig | 29 ++
kernel/sched/Makefile | 1
kernel/sched/cpufreq_schedutil.c | 528 +++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 8
4 files changed, 566 insertions(+)

Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+ bool "schedutil"
+ select CPU_FREQ_GOV_SCHEDUTIL
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the 'schedutil' CPUFreq governor by default. If unsure,
+ have a look at the help section of that governor. The fallback
+ governor will be 'performance'.
+
endchoice

config CPU_FREQ_GOV_PERFORMANCE
@@ -188,6 +198,25 @@ config CPU_FREQ_GOV_CONSERVATIVE

If in doubt, say N.

+config CPU_FREQ_GOV_SCHEDUTIL
+ tristate "'schedutil' cpufreq policy governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_ATTR_SET
+ select IRQ_WORK
+ help
+ This governor makes decisions based on the utilization data provided
+ by the scheduler. It sets the CPU frequency to be proportional to
+ the utilization/capacity ratio coming from the scheduler. If the
+ utilization is frequency-invariant, the new frequency is also
+ proportional to the maximum available frequency. If that is not the
+ case, it is proportional to the current frequency of the CPU with the
+ tipping point at utilization/capacity equal to 80%.
+
+ To compile this driver as a module, choose M here: the module will
+ be called cpufreq_schedutil.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"

config CPUFREQ_DT
Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -0,0 +1,528 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <trace/events/power.h>
+
+#include "sched.h"
+
+struct sugov_tunables {
+ struct gov_attr_set attr_set;
+ unsigned int rate_limit_us;
+};
+
+struct sugov_policy {
+ struct cpufreq_policy *policy;
+
+ struct sugov_tunables *tunables;
+ struct list_head tunables_hook;
+
+ raw_spinlock_t update_lock; /* For shared policies */
+ u64 last_freq_update_time;
+ s64 freq_update_delay_ns;
+ unsigned int next_freq;
+
+ /* The next fields are only needed if fast switch cannot be used. */
+ struct irq_work irq_work;
+ struct work_struct work;
+ struct mutex work_lock;
+ bool work_in_progress;
+
+ bool need_freq_update;
+};
+
+struct sugov_cpu {
+ struct update_util_data update_util;
+ struct sugov_policy *sg_policy;
+
+ /* The fields below are only needed when sharing a policy. */
+ unsigned long util;
+ unsigned long max;
+ u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
+ u64 delta_ns;
+
+ if (sg_policy->work_in_progress)
+ return false;
+
+ if (unlikely(sg_policy->need_freq_update)) {
+ sg_policy->need_freq_update = false;
+ /*
+ * This happens when limits change, so forget the previous
+ * next_freq value and force an update.
+ */
+ sg_policy->next_freq = UINT_MAX;
+ return true;
+ }
+
+ delta_ns = time - sg_policy->last_freq_update_time;
+ return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
+}
+
+static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+ unsigned int next_freq)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+
+ sg_policy->last_freq_update_time = time;
+
+ if (policy->fast_switch_enabled) {
+ if (sg_policy->next_freq == next_freq) {
+ trace_cpu_frequency(policy->cur, smp_processor_id());
+ return;
+ }
+ sg_policy->next_freq = next_freq;
+ next_freq = cpufreq_driver_fast_switch(policy, next_freq);
+ if (next_freq == CPUFREQ_ENTRY_INVALID)
+ return;
+
+ policy->cur = next_freq;
+ trace_cpu_frequency(next_freq, smp_processor_id());
+ } else if (sg_policy->next_freq != next_freq) {
+ sg_policy->next_freq = next_freq;
+ sg_policy->work_in_progress = true;
+ irq_work_queue(&sg_policy->irq_work);
+ }
+}
+
+/**
+ * get_next_freq - Compute a new frequency for a given cpufreq policy.
+ * @policy: cpufreq policy object to compute the new frequency for.
+ * @util: Current CPU utilization.
+ * @max: CPU capacity.
+ *
+ * If the utilization is frequency-invariant, choose the new frequency to be
+ * proportional to it, that is
+ *
+ * next_freq = C * max_freq * util / max
+ *
+ * Otherwise, approximate the would-be frequency-invariant utilization by
+ * util_raw * (curr_freq / max_freq) which leads to
+ *
+ * next_freq = C * curr_freq * util_raw / max
+ *
+ * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
+ */
+static unsigned int get_next_freq(struct cpufreq_policy *policy,
+ unsigned long util, unsigned long max)
+{
+ unsigned int freq = arch_scale_freq_invariant() ?
+ policy->cpuinfo.max_freq : policy->cur;
+
+ return (freq + (freq >> 2)) * util / max;
+}
+
+static void sugov_update_single(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int next_f;
+
+ if (!sugov_should_update_freq(sg_policy, time))
+ return;
+
+ next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
+ get_next_freq(policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+}
+
+static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
+ unsigned long util, unsigned long max)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int max_f = policy->cpuinfo.max_freq;
+ u64 last_freq_update_time = sg_policy->last_freq_update_time;
+ unsigned int j;
+
+ if (util == ULONG_MAX)
+ return max_f;
+
+ for_each_cpu(j, policy->cpus) {
+ struct sugov_cpu *j_sg_cpu;
+ unsigned long j_util, j_max;
+ u64 delta_ns;
+
+ if (j == smp_processor_id())
+ continue;
+
+ j_sg_cpu = &per_cpu(sugov_cpu, j);
+ /*
+ * If the CPU utilization was last updated before the previous
+ * frequency update and the time elapsed between the last update
+ * of the CPU utilization and the last frequency update is long
+ * enough, don't take the CPU into account as it probably is
+ * idle now.
+ */
+ delta_ns = last_freq_update_time - j_sg_cpu->last_update;
+ if ((s64)delta_ns > TICK_NSEC)
+ continue;
+
+ j_util = j_sg_cpu->util;
+ if (j_util == ULONG_MAX)
+ return max_f;
+
+ j_max = j_sg_cpu->max;
+ if (j_util * max > j_max * util) {
+ util = j_util;
+ max = j_max;
+ }
+ }
+
+ return get_next_freq(policy, util, max);
+}
+
+static void sugov_update_shared(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int next_f;
+
+ raw_spin_lock(&sg_policy->update_lock);
+
+ sg_cpu->util = util;
+ sg_cpu->max = max;
+ sg_cpu->last_update = time;
+
+ if (sugov_should_update_freq(sg_policy, time)) {
+ next_f = sugov_next_freq_shared(sg_policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+ }
+
+ raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_work(struct work_struct *work)
+{
+ struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+
+ mutex_lock(&sg_policy->work_lock);
+ __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+ CPUFREQ_RELATION_L);
+ mutex_unlock(&sg_policy->work_lock);
+
+ sg_policy->work_in_progress = false;
+}
+
+static void sugov_irq_work(struct irq_work *irq_work)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
+ schedule_work_on(smp_processor_id(), &sg_policy->work);
+}
+
+/************************** sysfs interface ************************/
+
+static struct sugov_tunables *global_tunables;
+static DEFINE_MUTEX(global_tunables_lock);
+
+static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct sugov_tunables, attr_set);
+}
+
+static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+ return sprintf(buf, "%u\n", tunables->rate_limit_us);
+}
+
+static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+ struct sugov_policy *sg_policy;
+ unsigned int rate_limit_us;
+
+ if (kstrtouint(buf, 10, &rate_limit_us))
+ return -EINVAL;
+
+ tunables->rate_limit_us = rate_limit_us;
+
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+ sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+
+ return count;
+}
+
+static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+
+static struct attribute *sugov_attributes[] = {
+ &rate_limit_us.attr,
+ NULL
+};
+
+static struct kobj_type sugov_tunables_ktype = {
+ .default_attrs = sugov_attributes,
+ .sysfs_ops = &governor_sysfs_ops,
+};
+
+/********************** cpufreq governor interface *********************/
+
+static struct cpufreq_governor schedutil_gov;
+
+static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+ if (!sg_policy)
+ return NULL;
+
+ sg_policy->policy = policy;
+ init_irq_work(&sg_policy->irq_work, sugov_irq_work);
+ INIT_WORK(&sg_policy->work, sugov_work);
+ mutex_init(&sg_policy->work_lock);
+ raw_spin_lock_init(&sg_policy->update_lock);
+ return sg_policy;
+}
+
+static void sugov_policy_free(struct sugov_policy *sg_policy)
+{
+ mutex_destroy(&sg_policy->work_lock);
+ kfree(sg_policy);
+}
+
+static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
+{
+ struct sugov_tunables *tunables;
+
+ tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
+ if (tunables) {
+ gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
+ if (!have_governor_per_policy())
+ global_tunables = tunables;
+ }
+ return tunables;
+}
+
+static void sugov_tunables_free(struct sugov_tunables *tunables)
+{
+ if (!have_governor_per_policy())
+ global_tunables = NULL;
+
+ kfree(tunables);
+}
+
+static int sugov_init(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+ struct sugov_tunables *tunables;
+ unsigned int lat;
+ int ret = 0;
+
+ /* State should be equivalent to EXIT */
+ if (policy->governor_data)
+ return -EBUSY;
+
+ sg_policy = sugov_policy_alloc(policy);
+ if (!sg_policy)
+ return -ENOMEM;
+
+ mutex_lock(&global_tunables_lock);
+
+ if (global_tunables) {
+ if (WARN_ON(have_governor_per_policy())) {
+ ret = -EINVAL;
+ goto free_sg_policy;
+ }
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = global_tunables;
+
+ gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
+ goto out;
+ }
+
+ tunables = sugov_tunables_alloc(sg_policy);
+ if (!tunables) {
+ ret = -ENOMEM;
+ goto free_sg_policy;
+ }
+
+ tunables->rate_limit_us = LATENCY_MULTIPLIER;
+ lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
+ if (lat)
+ tunables->rate_limit_us *= lat;
+
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = tunables;
+
+ ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
+ get_governor_parent_kobj(policy), "%s",
+ schedutil_gov.name);
+ if (ret)
+ goto fail;
+
+ out:
+ mutex_unlock(&global_tunables_lock);
+
+ cpufreq_enable_fast_switch(policy);
+ return 0;
+
+ fail:
+ policy->governor_data = NULL;
+ sugov_tunables_free(tunables);
+
+ free_sg_policy:
+ mutex_unlock(&global_tunables_lock);
+
+ sugov_policy_free(sg_policy);
+ pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
+ return ret;
+}
+
+static int sugov_exit(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ struct sugov_tunables *tunables = sg_policy->tunables;
+ unsigned int count;
+
+ mutex_lock(&global_tunables_lock);
+
+ count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
+ policy->governor_data = NULL;
+ if (!count)
+ sugov_tunables_free(tunables);
+
+ mutex_unlock(&global_tunables_lock);
+
+ sugov_policy_free(sg_policy);
+ return 0;
+}
+
+static int sugov_start(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->last_freq_update_time = 0;
+ sg_policy->next_freq = UINT_MAX;
+ sg_policy->work_in_progress = false;
+ sg_policy->need_freq_update = false;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+ sg_cpu->sg_policy = sg_policy;
+ if (policy_is_shared(policy)) {
+ sg_cpu->util = ULONG_MAX;
+ sg_cpu->max = 0;
+ sg_cpu->last_update = 0;
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_shared);
+ } else {
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_single);
+ }
+ }
+ return 0;
+}
+
+static int sugov_stop(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ cpufreq_remove_update_util_hook(cpu);
+
+ synchronize_sched();
+
+ irq_work_sync(&sg_policy->irq_work);
+ cancel_work_sync(&sg_policy->work);
+ return 0;
+}
+
+static int sugov_limits(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+
+ if (!policy->fast_switch_enabled) {
+ mutex_lock(&sg_policy->work_lock);
+
+ if (policy->max < policy->cur)
+ __cpufreq_driver_target(policy, policy->max,
+ CPUFREQ_RELATION_H);
+ else if (policy->min > policy->cur)
+ __cpufreq_driver_target(policy, policy->min,
+ CPUFREQ_RELATION_L);
+
+ mutex_unlock(&sg_policy->work_lock);
+ }
+
+ sg_policy->need_freq_update = true;
+ return 0;
+}
+
+int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
+{
+ if (event == CPUFREQ_GOV_POLICY_INIT) {
+ return sugov_init(policy);
+ } else if (policy->governor_data) {
+ switch (event) {
+ case CPUFREQ_GOV_POLICY_EXIT:
+ return sugov_exit(policy);
+ case CPUFREQ_GOV_START:
+ return sugov_start(policy);
+ case CPUFREQ_GOV_STOP:
+ return sugov_stop(policy);
+ case CPUFREQ_GOV_LIMITS:
+ return sugov_limits(policy);
+ }
+ }
+ return -EINVAL;
+}
+
+static struct cpufreq_governor schedutil_gov = {
+ .name = "schedutil",
+ .governor = sugov_governor,
+ .owner = THIS_MODULE,
+};
+
+static int __init sugov_module_init(void)
+{
+ return cpufreq_register_governor(&schedutil_gov);
+}
+
+static void __exit sugov_module_exit(void)
+{
+ cpufreq_unregister_governor(&schedutil_gov);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+struct cpufreq_governor *cpufreq_default_governor(void)
+{
+ return &schedutil_gov;
+}
+
+fs_initcall(sugov_module_init);
+#else
+module_init(sugov_module_init);
+#endif
+module_exit(sugov_module_exit);
Index: linux-pm/kernel/sched/Makefile
===================================================================
--- linux-pm.orig/kernel/sched/Makefile
+++ linux-pm/kernel/sched/Makefile
@@ -24,3 +24,4 @@ obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -1842,6 +1842,14 @@ static inline void cpufreq_update_util(u
static inline void cpufreq_trigger_update(u64 time) {}
#endif /* CONFIG_CPU_FREQ */

+#ifdef arch_scale_freq_capacity
+#ifndef arch_scale_freq_invariant
+#define arch_scale_freq_invariant() (true)
+#endif
+#else /* arch_scale_freq_capacity */
+#define arch_scale_freq_invariant() (false)
+#endif
+
static inline void account_reset_rq(struct rq *rq)
{
#ifdef CONFIG_IRQ_TIME_ACCOUNTING

2016-03-30 04:10:17

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On 29-03-16, 14:58, Rafael J. Wysocki wrote:
> On Monday, March 28, 2016 02:33:33 PM Viresh Kumar wrote:
> > Its gonna be same for all policies sharing tunables ..
>
> The value will be the same, but the cacheline won't.

Fair enough. So this information is replicated for each policy for performance
benefits. Would it make sense to add a comment for that? Its not obvious today
why we are keeping this per-policy.

> > > +static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
> > > + size_t count)
> > > +{
> > > + struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> > > + struct sugov_policy *sg_policy;
> > > + unsigned int rate_limit_us;
> > > + int ret;
> > > +
> > > + ret = sscanf(buf, "%u", &rate_limit_us);
> > > + if (ret != 1)
> > > + return -EINVAL;
> > > +
> > > + tunables->rate_limit_us = rate_limit_us;
> > > +
> > > + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
> > > + sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
> > > +
> > > + return count;
> > > +}
> > > +
> > > +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
> >
> > Why not reuse gov_attr_rw() ?
>
> Would it work?

Why wouldn't it? I had a look again at that and I couldn't find a reason for it
to not work. Probably I missed something.

> > > + ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
> > > + get_governor_parent_kobj(policy), "%s",
> > > + schedutil_gov.name);
> > > + if (!ret)
> > > + goto out;
> > > +
> > > + /* Failure, so roll back. */
> > > + policy->governor_data = NULL;
> > > + sugov_tunables_free(tunables);
> > > +
> > > + free_sg_policy:
> > > + pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
> > > + sugov_policy_free(sg_policy);
> >
> > I didn't like the way we have mixed success and failure path here, just to save
> > a single line of code (unlock).
>
> I don't follow, sorry. Yes, I can do unlock/return instead of the "goto out",
> but then the goto label is still needed.

Sorry for not being clear earlier, but this what I was suggesting it to look like:

---
static int sugov_init(struct cpufreq_policy *policy)
{
struct sugov_policy *sg_policy;
struct sugov_tunables *tunables;
unsigned int lat;
int ret = 0;

/* State should be equivalent to EXIT */
if (policy->governor_data)
return -EBUSY;

sg_policy = sugov_policy_alloc(policy);
if (!sg_policy)
return -ENOMEM;

mutex_lock(&global_tunables_lock);

if (global_tunables) {
if (WARN_ON(have_governor_per_policy())) {
ret = -EINVAL;
goto free_sg_policy;
}
policy->governor_data = sg_policy;
sg_policy->tunables = global_tunables;

gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);

mutex_unlock(&global_tunables_lock);
return 0;
}

tunables = sugov_tunables_alloc(sg_policy);
if (!tunables) {
ret = -ENOMEM;
goto free_sg_policy;
}

tunables->rate_limit_us = LATENCY_MULTIPLIER;
lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
if (lat)
tunables->rate_limit_us *= lat;

if (!have_governor_per_policy())
global_tunables = tunables;

policy->governor_data = sg_policy;
sg_policy->tunables = tunables;

ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
get_governor_parent_kobj(policy), "%s",
schedutil_gov.name);
if (!ret) {
mutex_unlock(&global_tunables_lock);
return 0;
}

/* Failure, so roll back. */
policy->governor_data = NULL;
sugov_tunables_free(tunables);

free_sg_policy:
mutex_unlock(&global_tunables_lock);

pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
sugov_policy_free(sg_policy);

return ret;

---

> > Over that it does things, that aren't symmetric anymore. For example, we have
> > called sugov_policy_alloc() without locks
>
> Are you sure?

Yes.

> > and are freeing it from within locks.
>
> Both are under global_tunables_lock.

No, sugov_policy_alloc() isn't called from within locks.

> > > +static int __init sugov_module_init(void)
> > > +{
> > > + return cpufreq_register_governor(&schedutil_gov);
> > > +}
> > > +
> > > +static void __exit sugov_module_exit(void)
> > > +{
> > > + cpufreq_unregister_governor(&schedutil_gov);
> > > +}
> > > +
> > > +MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
> > > +MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
> > > +MODULE_LICENSE("GPL");
> >
> > Maybe a MODULE_ALIAS as well ?
>
> Sorry, I don't follow.

Oh, I was just saying that we may also want to add a MODULE_ALIAS() line here
to help auto-loading if it is built as a module?

--
viresh

2016-03-30 05:07:38

by Viresh Kumar

[permalink] [raw]
Subject: Re: [Update][PATCH v7 6/7] cpufreq: Support for fast frequency switching

On 30-03-16, 03:47, Rafael J. Wysocki wrote:
> Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
> @@ -843,6 +883,7 @@ static int acpi_cpufreq_cpu_exit(struct
> pr_debug("acpi_cpufreq_cpu_exit\n");
>
> if (data) {
> + policy->fast_switch_possible = false;

Is this done just for keeping code symmetric or is there a logical advantage
of this? Just for my understanding, not saying that it is wrong.

Otherwise, it looks good

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2016-03-30 05:30:17

by Viresh Kumar

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

On 30-03-16, 04:00, Rafael J. Wysocki wrote:
> +static int sugov_init(struct cpufreq_policy *policy)
> +{
> + struct sugov_policy *sg_policy;
> + struct sugov_tunables *tunables;
> + unsigned int lat;
> + int ret = 0;
> +
> + /* State should be equivalent to EXIT */
> + if (policy->governor_data)
> + return -EBUSY;
> +
> + sg_policy = sugov_policy_alloc(policy);
> + if (!sg_policy)
> + return -ENOMEM;
> +
> + mutex_lock(&global_tunables_lock);
> +
> + if (global_tunables) {
> + if (WARN_ON(have_governor_per_policy())) {
> + ret = -EINVAL;
> + goto free_sg_policy;
> + }
> + policy->governor_data = sg_policy;
> + sg_policy->tunables = global_tunables;
> +
> + gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
> + goto out;
> + }
> +
> + tunables = sugov_tunables_alloc(sg_policy);
> + if (!tunables) {
> + ret = -ENOMEM;
> + goto free_sg_policy;
> + }
> +
> + tunables->rate_limit_us = LATENCY_MULTIPLIER;
> + lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
> + if (lat)
> + tunables->rate_limit_us *= lat;
> +
> + policy->governor_data = sg_policy;
> + sg_policy->tunables = tunables;
> +
> + ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
> + get_governor_parent_kobj(policy), "%s",
> + schedutil_gov.name);
> + if (ret)
> + goto fail;
> +
> + out:
> + mutex_unlock(&global_tunables_lock);
> +
> + cpufreq_enable_fast_switch(policy);
> + return 0;
> +
> + fail:
> + policy->governor_data = NULL;
> + sugov_tunables_free(tunables);
> +
> + free_sg_policy:
> + mutex_unlock(&global_tunables_lock);
> +
> + sugov_policy_free(sg_policy);
> + pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
> + return ret;
> +}

The current version of this looks good to me and takes care of all the issues I
raised earlier. Thanks.

> +static int sugov_limits(struct cpufreq_policy *policy)
> +{
> + struct sugov_policy *sg_policy = policy->governor_data;
> +
> + if (!policy->fast_switch_enabled) {
> + mutex_lock(&sg_policy->work_lock);
> +
> + if (policy->max < policy->cur)
> + __cpufreq_driver_target(policy, policy->max,
> + CPUFREQ_RELATION_H);
> + else if (policy->min > policy->cur)
> + __cpufreq_driver_target(policy, policy->min,
> + CPUFREQ_RELATION_L);
> +
> + mutex_unlock(&sg_policy->work_lock);
> + }
> +
> + sg_policy->need_freq_update = true;

I am wondering why we need to do this for !fast_switch_enabled case?

> + return 0;
> +}

Apart from that:

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2016-03-30 11:29:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Update][PATCH v7 6/7] cpufreq: Support for fast frequency switching

On Wed, Mar 30, 2016 at 7:07 AM, Viresh Kumar <[email protected]> wrote:
> On 30-03-16, 03:47, Rafael J. Wysocki wrote:
>> Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
>> @@ -843,6 +883,7 @@ static int acpi_cpufreq_cpu_exit(struct
>> pr_debug("acpi_cpufreq_cpu_exit\n");
>>
>> if (data) {
>> + policy->fast_switch_possible = false;
>
> Is this done just for keeping code symmetric or is there a logical advantage
> of this? Just for my understanding, not saying that it is wrong.

It is not necessary for correctness today, as schedutil will be the
only governor using fast switch, but generally that prevents leaking
configuration information from one governor to another.

Thanks,
Rafael

2016-03-30 11:31:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

[cut]

> The current version of this looks good to me and takes care of all the issues I
> raised earlier. Thanks.
>
>> +static int sugov_limits(struct cpufreq_policy *policy)
>> +{
>> + struct sugov_policy *sg_policy = policy->governor_data;
>> +
>> + if (!policy->fast_switch_enabled) {
>> + mutex_lock(&sg_policy->work_lock);
>> +
>> + if (policy->max < policy->cur)
>> + __cpufreq_driver_target(policy, policy->max,
>> + CPUFREQ_RELATION_H);
>> + else if (policy->min > policy->cur)
>> + __cpufreq_driver_target(policy, policy->min,
>> + CPUFREQ_RELATION_L);
>> +
>> + mutex_unlock(&sg_policy->work_lock);
>> + }
>> +
>> + sg_policy->need_freq_update = true;
>
> I am wondering why we need to do this for !fast_switch_enabled case?

That will cause the rate limit to be ignored in the utilization update
handler which may be necessary if it is set to a relatively large
value (like 1 s).

>> + return 0;
>> +}
>
> Apart from that:
>
> Acked-by: Viresh Kumar <[email protected]>

Thanks,
Rafael

2016-03-30 17:06:00

by Steve Muckle

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

On 03/30/2016 04:31 AM, Rafael J. Wysocki wrote:
>>> >> +static int sugov_limits(struct cpufreq_policy *policy)
>>> >> +{
>>> >> + struct sugov_policy *sg_policy = policy->governor_data;
>>> >> +
>>> >> + if (!policy->fast_switch_enabled) {
>>> >> + mutex_lock(&sg_policy->work_lock);
>>> >> +
>>> >> + if (policy->max < policy->cur)
>>> >> + __cpufreq_driver_target(policy, policy->max,
>>> >> + CPUFREQ_RELATION_H);
>>> >> + else if (policy->min > policy->cur)
>>> >> + __cpufreq_driver_target(policy, policy->min,
>>> >> + CPUFREQ_RELATION_L);
>>> >> +
>>> >> + mutex_unlock(&sg_policy->work_lock);
>>> >> + }
>>> >> +
>>> >> + sg_policy->need_freq_update = true;
>> >
>> > I am wondering why we need to do this for !fast_switch_enabled case?
>
> That will cause the rate limit to be ignored in the utilization update
> handler which may be necessary if it is set to a relatively large
> value (like 1 s).

But why is that necessary for !fast_switch_enabled? In that case the
frequency has been adjusted to satisfy the new limits here, so ignoring
the rate limit shouldn't be necessary. In other words why not

} else {
sg_policy->need_freq_update = true;
}

2016-03-30 17:24:14

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

On Wed, Mar 30, 2016 at 7:05 PM, Steve Muckle <[email protected]> wrote:
> On 03/30/2016 04:31 AM, Rafael J. Wysocki wrote:
>>>> >> +static int sugov_limits(struct cpufreq_policy *policy)
>>>> >> +{
>>>> >> + struct sugov_policy *sg_policy = policy->governor_data;
>>>> >> +
>>>> >> + if (!policy->fast_switch_enabled) {
>>>> >> + mutex_lock(&sg_policy->work_lock);
>>>> >> +
>>>> >> + if (policy->max < policy->cur)
>>>> >> + __cpufreq_driver_target(policy, policy->max,
>>>> >> + CPUFREQ_RELATION_H);
>>>> >> + else if (policy->min > policy->cur)
>>>> >> + __cpufreq_driver_target(policy, policy->min,
>>>> >> + CPUFREQ_RELATION_L);
>>>> >> +
>>>> >> + mutex_unlock(&sg_policy->work_lock);
>>>> >> + }
>>>> >> +
>>>> >> + sg_policy->need_freq_update = true;
>>> >
>>> > I am wondering why we need to do this for !fast_switch_enabled case?
>>
>> That will cause the rate limit to be ignored in the utilization update
>> handler which may be necessary if it is set to a relatively large
>> value (like 1 s).
>
> But why is that necessary for !fast_switch_enabled? In that case the
> frequency has been adjusted to satisfy the new limits here, so ignoring
> the rate limit shouldn't be necessary. In other words why not
>
> } else {
> sg_policy->need_freq_update = true;
> }

My thinking here was that the governor might decide to use something
different from the limit enforced here, so it would be good to make it
do so as soon as possible. In particular in the
non-frequency-invariant utilization case in which new frequency
depends on the current one.

That said i'm not particularly opposed to making that change if that's
preferred.

Thanks,
Rafael

2016-03-31 01:44:06

by Steve Muckle

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

On 03/30/2016 10:24 AM, Rafael J. Wysocki wrote:
> On Wed, Mar 30, 2016 at 7:05 PM, Steve Muckle <[email protected]> wrote:
>> On 03/30/2016 04:31 AM, Rafael J. Wysocki wrote:
>>>>>>> +static int sugov_limits(struct cpufreq_policy *policy)
>>>>>>> +{
>>>>>>> + struct sugov_policy *sg_policy = policy->governor_data;
>>>>>>> +
>>>>>>> + if (!policy->fast_switch_enabled) {
>>>>>>> + mutex_lock(&sg_policy->work_lock);
>>>>>>> +
>>>>>>> + if (policy->max < policy->cur)
>>>>>>> + __cpufreq_driver_target(policy, policy->max,
>>>>>>> + CPUFREQ_RELATION_H);
>>>>>>> + else if (policy->min > policy->cur)
>>>>>>> + __cpufreq_driver_target(policy, policy->min,
>>>>>>> + CPUFREQ_RELATION_L);
>>>>>>> +
>>>>>>> + mutex_unlock(&sg_policy->work_lock);
>>>>>>> + }
>>>>>>> +
>>>>>>> + sg_policy->need_freq_update = true;
>>>>>
>>>>> I am wondering why we need to do this for !fast_switch_enabled case?
>>>
>>> That will cause the rate limit to be ignored in the utilization update
>>> handler which may be necessary if it is set to a relatively large
>>> value (like 1 s).
>>
>> But why is that necessary for !fast_switch_enabled? In that case the
>> frequency has been adjusted to satisfy the new limits here, so ignoring
>> the rate limit shouldn't be necessary. In other words why not
>>
>> } else {
>> sg_policy->need_freq_update = true;
>> }
>
> My thinking here was that the governor might decide to use something
> different from the limit enforced here, so it would be good to make it
> do so as soon as possible. In particular in the
> non-frequency-invariant utilization case in which new frequency
> depends on the current one.
>
> That said i'm not particularly opposed to making that change if that's
> preferred.

Ah ok fair enough. No strong opinion from me...

thanks,
Steve

2016-03-31 12:12:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data


Ingo reminded me that the schedutil governor is part of the scheduler
proper and can access scheduler data because of that.

This allows us to remove the util and max arguments since only the
schedutil governor will use those, which leads to some further text
reduction:

43595 1226 24 44845 af2d defconfig-build/kernel/sched/fair.o.pre
42907 1226 24 44157 ac7d defconfig-build/kernel/sched/fair.o.post

Of course, we get more text in schedutil in return, but the below also
shows how we can benefit from not being tied to those two parameters by
doing a very coarse deadline reservation.

---
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -248,8 +248,7 @@ static void dbs_irq_work(struct irq_work
schedule_work_on(smp_processor_id(), &policy_dbs->work);
}

-static void dbs_update_util_handler(struct update_util_data *data, u64 time,
- unsigned long util, unsigned long max)
+static void dbs_update_util_handler(struct update_util_data *data, u64 time)
{
struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
struct policy_dbs_info *policy_dbs = cdbs->policy_dbs;
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1032,8 +1032,7 @@ static inline void intel_pstate_adjust_b
get_avg_frequency(cpu));
}

-static void intel_pstate_update_util(struct update_util_data *data, u64 time,
- unsigned long util, unsigned long max)
+static void intel_pstate_update_util(struct update_util_data *data, u64 time)
{
struct cpudata *cpu = container_of(data, struct cpudata, update_util);
u64 delta_ns = time - cpu->sample.time;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3236,13 +3236,11 @@ static inline unsigned long rlimit_max(u

#ifdef CONFIG_CPU_FREQ
struct update_util_data {
- void (*func)(struct update_util_data *data,
- u64 time, unsigned long util, unsigned long max);
+ void (*func)(struct update_util_data *data, u64 time);
};

void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
- void (*func)(struct update_util_data *data, u64 time,
- unsigned long util, unsigned long max));
+ void (*func)(struct update_util_data *data, u64 time));
void cpufreq_remove_update_util_hook(int cpu);
#endif /* CONFIG_CPU_FREQ */

--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -32,8 +32,7 @@ DEFINE_PER_CPU(struct update_util_data *
* called or it will WARN() and return with no effect.
*/
void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
- void (*func)(struct update_util_data *data, u64 time,
- unsigned long util, unsigned long max))
+ void (*func)(struct update_util_data *data, u64 time))
{
if (WARN_ON(!data || !func))
return;
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -129,19 +129,55 @@ static unsigned int get_next_freq(struct
return (freq + (freq >> 2)) * util / max;
}

-static void sugov_update_single(struct update_util_data *hook, u64 time,
- unsigned long util, unsigned long max)
+static void sugov_get_util(unsigned long *util, unsigned long *max)
+{
+ unsigned long dl_util, dl_max;
+ unsigned long cfs_util, cfs_max;
+ int cpu = smp_processor_id();
+ struct dl_bw *dl_bw = dl_bw_of(cpu);
+ struct rq *rq = this_rq();
+
+ if (rt_prio(current->prio)) {
+ /*
+ * Punt for now; maybe do something based on sysctl_sched_rt_*.
+ */
+ *util = ULONG_MAX;
+ return;
+ }
+
+ dl_max = dl_bw_cpus(cpu) << 20;
+ dl_util = dl_bw->total_bw;
+
+ cfs_max = rq->cpu_capacity_orig;
+ cfs_util = min(rq->cfs.avg.util_avg, cfs_max);
+
+ if (cfs_util * dl_max > dl_util * cfs_max) {
+ *util = cfs_util;
+ *max = cfs_max;
+ } else {
+ *util = dl_util;
+ *max = dl_max;
+ }
+}
+
+static void sugov_update_single(struct update_util_data *hook, u64 time)
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned long util, max;
unsigned int next_f;

if (!sugov_should_update_freq(sg_policy, time))
return;

- next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
- get_next_freq(policy, util, max);
+ sugov_get_util(&util, &max);
+
+ if (util == ULONG_MAX)
+ next_f = policy->cpuinfo.max_freq;
+ else
+ next_f = get_next_freq(policy, util, max);
+
sugov_update_commit(sg_policy, time, next_f);
}

@@ -190,13 +226,15 @@ static unsigned int sugov_next_freq_shar
return get_next_freq(policy, util, max);
}

-static void sugov_update_shared(struct update_util_data *hook, u64 time,
- unsigned long util, unsigned long max)
+static void sugov_update_shared(struct update_util_data *hook, u64 time)
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned long util, max;
unsigned int next_f;

+ sugov_get_util(&util, &max);
+
raw_spin_lock(&sg_policy->update_lock);

sg_cpu->util = util;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2823,12 +2823,8 @@ static inline u64 cfs_rq_clock_task(stru

static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
{
- struct rq *rq = rq_of(cfs_rq);
- int cpu = cpu_of(rq);
-
- if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
- unsigned long max = rq->cpu_capacity_orig;
-
+ if (&this_rq()->cfs == cfs_rq) {
+ struct rq *rq = rq_of(cfs_rq);
/*
* There are a few boundary cases this might miss but it should
* get called often enough that that should (hopefully) not be
@@ -2845,8 +2841,7 @@ static inline void cfs_rq_util_change(st
*
* See cpu_util().
*/
- cpufreq_update_util(rq_clock(rq),
- min(cfs_rq->avg.util_avg, max), max);
+ cpufreq_update_util(rq_clock(rq));
}
}

--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -183,6 +183,7 @@ static inline int dl_bandwidth_enabled(v
}

extern struct dl_bw *dl_bw_of(int i);
+extern int dl_bw_cpus(int i);

struct dl_bw {
raw_spinlock_t lock;
@@ -1808,13 +1809,13 @@ DECLARE_PER_CPU(struct update_util_data
*
* It can only be called from RCU-sched read-side critical sections.
*/
-static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+static inline void cpufreq_update_util(u64 time)
{
- struct update_util_data *data;
+ struct update_util_data *data;

- data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
- if (data)
- data->func(data, time, util, max);
+ data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
+ if (data)
+ data->func(data, time);
}

/**
@@ -1835,10 +1836,10 @@ static inline void cpufreq_update_util(u
*/
static inline void cpufreq_trigger_update(u64 time)
{
- cpufreq_update_util(time, ULONG_MAX, 0);
+ cpufreq_update_util(time);
}
#else
-static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {}
+static inline void cpufreq_update_util(u64 time) {}
static inline void cpufreq_trigger_update(u64 time) {}
#endif /* CONFIG_CPU_FREQ */


2016-03-31 12:18:38

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

On Thu, Mar 31, 2016 at 2:12 PM, Peter Zijlstra <[email protected]> wrote:
>
> Ingo reminded me that the schedutil governor is part of the scheduler
> proper and can access scheduler data because of that.
>
> This allows us to remove the util and max arguments since only the
> schedutil governor will use those, which leads to some further text
> reduction:
>
> 43595 1226 24 44845 af2d defconfig-build/kernel/sched/fair.o.pre
> 42907 1226 24 44157 ac7d defconfig-build/kernel/sched/fair.o.post
>
> Of course, we get more text in schedutil in return, but the below also
> shows how we can benefit from not being tied to those two parameters by
> doing a very coarse deadline reservation.

OK

Do you want this to go into the series or be folded into the schedutil
patch or on top of it?

2016-03-31 12:24:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On Mon, Mar 28, 2016 at 11:17:44AM -0700, Steve Muckle wrote:
> The scenario I'm contemplating is that while a CPU-intensive task is
> running a thermal interrupt goes off. The driver for this thermal
> interrupt responds by capping fmax. If this happens just after the tick,
> it seems possible that we could wait a full tick before changing the
> frequency. Given a 10ms tick it could be rather annoying for thermal
> management algorithms on some platforms (I'm familiar with a few).

So I'm blissfully unaware of all the thermal stuffs we have; but it
looks like its somehow bolten onto cpufreq without feedback.

The thing I worry about is thermal scaling the CPU back past where RT/DL
tasks can still complete in time. It should not be able to do that, or
rather, missing deadlines because thermal is about as useful as
rebooting the device.

I guess I'm saying is, the whole cpufreq/thermal 'interface' needs work
anyhow.

2016-03-31 12:28:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On Wed, Mar 30, 2016 at 03:12:40AM +0200, Rafael J. Wysocki wrote:
> Except that for_each_cpu_and_not is not defined as of today.
>
> I guess I can play with cpumasks, but then I'm not sure that will end
> up actually more efficient.

It will not indeed.

2016-03-31 12:32:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On Thu, Mar 31, 2016 at 2:24 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Mar 28, 2016 at 11:17:44AM -0700, Steve Muckle wrote:
>> The scenario I'm contemplating is that while a CPU-intensive task is
>> running a thermal interrupt goes off. The driver for this thermal
>> interrupt responds by capping fmax. If this happens just after the tick,
>> it seems possible that we could wait a full tick before changing the
>> frequency. Given a 10ms tick it could be rather annoying for thermal
>> management algorithms on some platforms (I'm familiar with a few).
>
> So I'm blissfully unaware of all the thermal stuffs we have; but it
> looks like its somehow bolten onto cpufreq without feedback.
>
> The thing I worry about is thermal scaling the CPU back past where RT/DL
> tasks can still complete in time. It should not be able to do that, or
> rather, missing deadlines because thermal is about as useful as
> rebooting the device.

Right. If thermal throttling kicks in, the game is pretty much over.

That's why ideas float about taking the thermal constraints into
account upfront, but that's a different discussion entirely.

> I guess I'm saying is, the whole cpufreq/thermal 'interface' needs work
> anyhow.

Yes, it does.

2016-03-31 12:42:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

On Thu, Mar 31, 2016 at 02:18:33PM +0200, Rafael J. Wysocki wrote:
> On Thu, Mar 31, 2016 at 2:12 PM, Peter Zijlstra <[email protected]> wrote:
> >
> > Ingo reminded me that the schedutil governor is part of the scheduler
> > proper and can access scheduler data because of that.
> >
> > This allows us to remove the util and max arguments since only the
> > schedutil governor will use those, which leads to some further text
> > reduction:
> >
> > 43595 1226 24 44845 af2d defconfig-build/kernel/sched/fair.o.pre
> > 42907 1226 24 44157 ac7d defconfig-build/kernel/sched/fair.o.post
> >
> > Of course, we get more text in schedutil in return, but the below also
> > shows how we can benefit from not being tied to those two parameters by
> > doing a very coarse deadline reservation.
>
> OK
>
> Do you want this to go into the series or be folded into the schedutil
> patch or on top of it?

We can do it on top; no need to rush this.

2016-03-31 12:47:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 1/7][Resend] cpufreq: sched: Helpers to add and remove update_util hooks

On Tue, Mar 22, 2016 at 02:46:34AM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Replace the single helper for adding and removing cpufreq utilization
> update hooks, cpufreq_set_update_util_data(), with a pair of helpers,
> cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(),
> and modify the users of cpufreq_set_update_util_data() accordingly.
>
> With the new helpers, the code using them doesn't need to worry
> about the internals of struct update_util_data and in particular
> it doesn't need to worry about populating the func field in it
> properly upfront.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> No changes since v4 (this patch appeared then).
>
> ---
> drivers/cpufreq/cpufreq_governor.c | 76 ++++++++++++++++++-------------------
> drivers/cpufreq/intel_pstate.c | 8 +--
> include/linux/sched.h | 5 +-
> kernel/sched/cpufreq.c | 48 ++++++++++++++++++-----
> 4 files changed, 83 insertions(+), 54 deletions(-)
>

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2016-03-31 12:48:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

On Wed, Mar 30, 2016 at 04:00:24AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
>
> Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add
> mechanism for registering utilization update callbacks) that
> introduced cpufreq_update_util() called by the scheduler on
> utilization changes (from CFS) and RT/DL task status updates.
> In particular, CPU frequency scaling decisions may be based on
> the the utilization data passed to cpufreq_update_util() by CFS.
>
> The new governor is relatively simple.
>
> The frequency selection formula used by it depends on whether or not
> the utilization is frequency-invariant. In the frequency-invariant
> case the new CPU frequency is given by
>
> next_freq = 1.25 * max_freq * util / max
>
> where util and max are the last two arguments of cpufreq_update_util().
> In turn, if util is not frequency-invariant, the maximum frequency in
> the above formula is replaced with the current frequency of the CPU:
>
> next_freq = 1.25 * curr_freq * util / max
>
> The coefficient 1.25 corresponds to the frequency tipping point at
> (util / max) = 0.8.
>
> All of the computations are carried out in the utilization update
> handlers provided by the new governor. One of those handlers is
> used for cpufreq policies shared between multiple CPUs and the other
> one is for policies with one CPU only (and therefore it doesn't need
> to use any extra synchronization means).
>
> The governor supports fast frequency switching if that is supported
> by the cpufreq driver in use and possible for the given policy.
> In the fast switching case, all operations of the governor take
> place in its utilization update handlers. If fast switching cannot
> be used, the frequency switch operations are carried out with the
> help of a work item which only calls __cpufreq_driver_target()
> (under a mutex) to trigger a frequency update (to a value already
> computed beforehand in one of the utilization update handlers).
>
> Currently, the governor treats all of the RT and DL tasks as
> "unknown utilization" and sets the frequency to the allowed
> maximum when updated from the RT or DL sched classes. That
> heavy-handed approach should be replaced with something more
> subtle and specifically targeted at RT and DL tasks.
>
> The governor shares some tunables management code with the
> "ondemand" and "conservative" governors and uses some common
> definitions from cpufreq_governor.h, but apart from that it
> is stand-alone.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> drivers/cpufreq/Kconfig | 29 ++
> kernel/sched/Makefile | 1
> kernel/sched/cpufreq_schedutil.c | 528 +++++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 8
> 4 files changed, 566 insertions(+)

I think this is a good first step and we can definitely work from here;
afaict there are no (big) disagreements on the general approach, so

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2016-04-01 17:49:52

by Steve Muckle

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

On 03/29/2016 07:00 PM, Rafael J. Wysocki wrote:
...
> +config CPU_FREQ_GOV_SCHEDUTIL
> + tristate "'schedutil' cpufreq policy governor"
> + depends on CPU_FREQ
> + select CPU_FREQ_GOV_ATTR_SET
> + select IRQ_WORK
> + help
> + This governor makes decisions based on the utilization data provided
> + by the scheduler. It sets the CPU frequency to be proportional to
> + the utilization/capacity ratio coming from the scheduler. If the
> + utilization is frequency-invariant, the new frequency is also
> + proportional to the maximum available frequency. If that is not the
> + case, it is proportional to the current frequency of the CPU with the
> + tipping point at utilization/capacity equal to 80%.

This help text implies that the tipping point of 80% applies only to
non-frequency invariant configurations, rather than both. Possible to
rephrase?

...
> +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
> + unsigned long util, unsigned long max)
> +{
> + struct cpufreq_policy *policy = sg_policy->policy;
> + unsigned int max_f = policy->cpuinfo.max_freq;
> + u64 last_freq_update_time = sg_policy->last_freq_update_time;
> + unsigned int j;
> +
> + if (util == ULONG_MAX)
> + return max_f;
> +
> + for_each_cpu(j, policy->cpus) {
> + struct sugov_cpu *j_sg_cpu;
> + unsigned long j_util, j_max;
> + u64 delta_ns;
> +
> + if (j == smp_processor_id())
> + continue;
> +
> + j_sg_cpu = &per_cpu(sugov_cpu, j);
> + /*
> + * If the CPU utilization was last updated before the previous
> + * frequency update and the time elapsed between the last update
> + * of the CPU utilization and the last frequency update is long
> + * enough, don't take the CPU into account as it probably is
> + * idle now.
> + */
> + delta_ns = last_freq_update_time - j_sg_cpu->last_update;
> + if ((s64)delta_ns > TICK_NSEC)

>> Why not declare delta_ns as an s64 (also in suguv_should_update_freq)
>> and avoid the cast?
>
> I took this from __update_load_avg(), but it shouldn't matter here.

Did you mean to keep these casts?

thanks,
Steve

2016-04-01 18:15:11

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH v6 7/7][Resend] cpufreq: schedutil: New governor based on scheduler utilization data

On 03/31/2016 05:32 AM, Rafael J. Wysocki wrote:
> On Thu, Mar 31, 2016 at 2:24 PM, Peter Zijlstra <[email protected]> wrote:
>> On Mon, Mar 28, 2016 at 11:17:44AM -0700, Steve Muckle wrote:
>>> The scenario I'm contemplating is that while a CPU-intensive task is
>>> running a thermal interrupt goes off. The driver for this thermal
>>> interrupt responds by capping fmax. If this happens just after the tick,
>>> it seems possible that we could wait a full tick before changing the
>>> frequency. Given a 10ms tick it could be rather annoying for thermal
>>> management algorithms on some platforms (I'm familiar with a few).
>>
>> So I'm blissfully unaware of all the thermal stuffs we have; but it
>> looks like its somehow bolten onto cpufreq without feedback.
>>
>> The thing I worry about is thermal scaling the CPU back past where RT/DL
>> tasks can still complete in time. It should not be able to do that, or
>> rather, missing deadlines because thermal is about as useful as
>> rebooting the device.

I'd agree that impacting RT/DL activity because of throttling may be as
bad as as a reset, but that seems worst case. There could be some
graceful shutdown or notification/alarm that can be done. Or a platform
can simply choose to reset.

Shouldn't we try to give the system designer the option of doing
something in software (by throttling the CPUs as low as necessary to
continue operation) rather than giving up and relying on a hardware reset?

> Right. If thermal throttling kicks in, the game is pretty much over.
>
> That's why ideas float about taking the thermal constraints into
> account upfront, but that's a different discussion entirely.

Current mainstream mobile platforms frequently throttle during normal
operation. I think it's important to have a robust throttling mechanism
at least until the more proactive thermal management scheme is fully
developed and proves to be equally capable (if and when that happens).

>> I guess I'm saying is, the whole cpufreq/thermal 'interface' needs work
>> anyhow.
>
> Yes, it does.

Agreed!

thanks,
Steve

2016-04-01 19:14:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

On Fri, Apr 1, 2016 at 7:49 PM, Steve Muckle <[email protected]> wrote:
> On 03/29/2016 07:00 PM, Rafael J. Wysocki wrote:
> ...
>> +config CPU_FREQ_GOV_SCHEDUTIL
>> + tristate "'schedutil' cpufreq policy governor"
>> + depends on CPU_FREQ
>> + select CPU_FREQ_GOV_ATTR_SET
>> + select IRQ_WORK
>> + help
>> + This governor makes decisions based on the utilization data provided
>> + by the scheduler. It sets the CPU frequency to be proportional to
>> + the utilization/capacity ratio coming from the scheduler. If the
>> + utilization is frequency-invariant, the new frequency is also
>> + proportional to the maximum available frequency. If that is not the
>> + case, it is proportional to the current frequency of the CPU with the
>> + tipping point at utilization/capacity equal to 80%.
>
> This help text implies that the tipping point of 80% applies only to
> non-frequency invariant configurations, rather than both. Possible to
> rephrase?

Sure.

What about:

"If that is not the case, it is proportional to the current frequency
of the CPU. The frequency tipping point is at utilization/capacity
equal to 80% in both cases."

> ...
>> +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
>> + unsigned long util, unsigned long max)
>> +{
>> + struct cpufreq_policy *policy = sg_policy->policy;
>> + unsigned int max_f = policy->cpuinfo.max_freq;
>> + u64 last_freq_update_time = sg_policy->last_freq_update_time;
>> + unsigned int j;
>> +
>> + if (util == ULONG_MAX)
>> + return max_f;
>> +
>> + for_each_cpu(j, policy->cpus) {
>> + struct sugov_cpu *j_sg_cpu;
>> + unsigned long j_util, j_max;
>> + u64 delta_ns;
>> +
>> + if (j == smp_processor_id())
>> + continue;
>> +
>> + j_sg_cpu = &per_cpu(sugov_cpu, j);
>> + /*
>> + * If the CPU utilization was last updated before the previous
>> + * frequency update and the time elapsed between the last update
>> + * of the CPU utilization and the last frequency update is long
>> + * enough, don't take the CPU into account as it probably is
>> + * idle now.
>> + */
>> + delta_ns = last_freq_update_time - j_sg_cpu->last_update;
>> + if ((s64)delta_ns > TICK_NSEC)
>
>>> Why not declare delta_ns as an s64 (also in suguv_should_update_freq)
>>> and avoid the cast?
>>
>> I took this from __update_load_avg(), but it shouldn't matter here.
>
> Did you mean to keep these casts?

Not really. I'll fix that up shortly.

2016-04-01 19:23:12

by Steve Muckle

[permalink] [raw]
Subject: Re: [Update][PATCH v7 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

On 04/01/2016 12:14 PM, Rafael J. Wysocki wrote:
> On Fri, Apr 1, 2016 at 7:49 PM, Steve Muckle <[email protected]> wrote:
>> On 03/29/2016 07:00 PM, Rafael J. Wysocki wrote:
>> ...
>>> +config CPU_FREQ_GOV_SCHEDUTIL
>>> + tristate "'schedutil' cpufreq policy governor"
>>> + depends on CPU_FREQ
>>> + select CPU_FREQ_GOV_ATTR_SET
>>> + select IRQ_WORK
>>> + help
>>> + This governor makes decisions based on the utilization data provided
>>> + by the scheduler. It sets the CPU frequency to be proportional to
>>> + the utilization/capacity ratio coming from the scheduler. If the
>>> + utilization is frequency-invariant, the new frequency is also
>>> + proportional to the maximum available frequency. If that is not the
>>> + case, it is proportional to the current frequency of the CPU with the
>>> + tipping point at utilization/capacity equal to 80%.
>>
>> This help text implies that the tipping point of 80% applies only to
>> non-frequency invariant configurations, rather than both. Possible to
>> rephrase?
>
> Sure.
>
> What about:
>
> "If that is not the case, it is proportional to the current frequency
> of the CPU. The frequency tipping point is at utilization/capacity
> equal to 80% in both cases."

LGTM

thanks,
Steve

2016-04-01 23:04:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Update][PATCH v8 7/7] cpufreq: schedutil: New governor based on scheduler utilization data

From: Rafael J. Wysocki <[email protected]>

Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.

Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.

The new governor is relatively simple.

The frequency selection formula used by it depends on whether or not
the utilization is frequency-invariant. In the frequency-invariant
case the new CPU frequency is given by

next_freq = 1.25 * max_freq * util / max

where util and max are the last two arguments of cpufreq_update_util().
In turn, if util is not frequency-invariant, the maximum frequency in
the above formula is replaced with the current frequency of the CPU:

next_freq = 1.25 * curr_freq * util / max

The coefficient 1.25 corresponds to the frequency tipping point at
(util / max) = 0.8.

All of the computations are carried out in the utilization update
handlers provided by the new governor. One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).

The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers. If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).

Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes. That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.

The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---

Addressing Steve's comments and a build failure in linux-next.

Changes from v7:
- Update Kconfig help text.
- Change the data type of delta_ns to s64 and drop explicit casts to that
in sugov_should_update_freq() and sugov_next_freq_shared().
- Add cpu_frequency to the list of exported power tracepoints which is
needed for schedutil to be built as a module.

Changes from v6:
- Rebased on top of 4.6-rc1.
- Fixed the help text in Kconfig.
- sugov_should_update_freq() sets sg_policy->next_freq to UINT_MAX if
need_freq_update is set to enforce a frequency update (even if the new
frequency would be equal to the previously requested one).
- Dropped the limits check from sugov_update_commit() as
cpufreq_driver_fast_switch() applies the limits to the target frequency now.
- rate_limit_us_store() uses kstrtouint() to get the new tunable value.
- sugov_tunables_alloc() sets global_tunables (if necessary and possible).
- sugov_init() calls cpufreq_enable_fast_switch() and was rearranged a bit.

Changes from v5:
- Fixed sugov_update_commit() to set sg_policy->next_freq properly
in the "work item" branch.
- Used smp_processor_id() in sugov_irq_work() and restored work_in_progress.

Changes from v4:
- Use TICK_NSEC in sugov_next_freq_shared().
- Use schedule_work_on() to schedule work items and replace
work_in_progress with work_cpu (which is used both for scheduling
work items and as a "work in progress" marker).
- Rearrange sugov_update_commit() to only check policy->min/max if
fast switching is enabled.
- Replace util > max checks with util == ULONG_MAX checks to make
it clear that they are about a special case (RT/DL).

Changes from v3:
- The "next frequency" formula based on
http://marc.info/?l=linux-acpi&m=145756618321500&w=4 and
http://marc.info/?l=linux-kernel&m=145760739700716&w=4
- The governor goes into kernel/sched/ (again).

Changes from v2:
- The governor goes into drivers/cpufreq/.
- The "next frequency" formula has an additional 1.1 factor to allow
more util/max values to map onto the top-most frequency in case the
distance between that and the previous one is unproportionally small.
- sugov_update_commit() traces CPU frequency even if the new one is
the same as the previous one (otherwise, if the system is 100% loaded
for long enough, powertop starts to report that all CPUs are 100% idle).

---
drivers/cpufreq/Kconfig | 30 ++
kernel/sched/Makefile | 1
kernel/sched/cpufreq_schedutil.c | 528 +++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 8
kernel/trace/power-traces.c | 1
5 files changed, 568 insertions(+)

Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+ bool "schedutil"
+ select CPU_FREQ_GOV_SCHEDUTIL
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the 'schedutil' CPUFreq governor by default. If unsure,
+ have a look at the help section of that governor. The fallback
+ governor will be 'performance'.
+
endchoice

config CPU_FREQ_GOV_PERFORMANCE
@@ -188,6 +198,26 @@ config CPU_FREQ_GOV_CONSERVATIVE

If in doubt, say N.

+config CPU_FREQ_GOV_SCHEDUTIL
+ tristate "'schedutil' cpufreq policy governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_ATTR_SET
+ select IRQ_WORK
+ help
+ This governor makes decisions based on the utilization data provided
+ by the scheduler. It sets the CPU frequency to be proportional to
+ the utilization/capacity ratio coming from the scheduler. If the
+ utilization is frequency-invariant, the new frequency is also
+ proportional to the maximum available frequency. If that is not the
+ case, it is proportional to the current frequency of the CPU. The
+ frequency tipping point is at utilization/capacity equal to 80% in
+ both cases.
+
+ To compile this driver as a module, choose M here: the module will
+ be called cpufreq_schedutil.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"

config CPUFREQ_DT
Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -0,0 +1,528 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <trace/events/power.h>
+
+#include "sched.h"
+
+struct sugov_tunables {
+ struct gov_attr_set attr_set;
+ unsigned int rate_limit_us;
+};
+
+struct sugov_policy {
+ struct cpufreq_policy *policy;
+
+ struct sugov_tunables *tunables;
+ struct list_head tunables_hook;
+
+ raw_spinlock_t update_lock; /* For shared policies */
+ u64 last_freq_update_time;
+ s64 freq_update_delay_ns;
+ unsigned int next_freq;
+
+ /* The next fields are only needed if fast switch cannot be used. */
+ struct irq_work irq_work;
+ struct work_struct work;
+ struct mutex work_lock;
+ bool work_in_progress;
+
+ bool need_freq_update;
+};
+
+struct sugov_cpu {
+ struct update_util_data update_util;
+ struct sugov_policy *sg_policy;
+
+ /* The fields below are only needed when sharing a policy. */
+ unsigned long util;
+ unsigned long max;
+ u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
+ s64 delta_ns;
+
+ if (sg_policy->work_in_progress)
+ return false;
+
+ if (unlikely(sg_policy->need_freq_update)) {
+ sg_policy->need_freq_update = false;
+ /*
+ * This happens when limits change, so forget the previous
+ * next_freq value and force an update.
+ */
+ sg_policy->next_freq = UINT_MAX;
+ return true;
+ }
+
+ delta_ns = time - sg_policy->last_freq_update_time;
+ return delta_ns >= sg_policy->freq_update_delay_ns;
+}
+
+static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+ unsigned int next_freq)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+
+ sg_policy->last_freq_update_time = time;
+
+ if (policy->fast_switch_enabled) {
+ if (sg_policy->next_freq == next_freq) {
+ trace_cpu_frequency(policy->cur, smp_processor_id());
+ return;
+ }
+ sg_policy->next_freq = next_freq;
+ next_freq = cpufreq_driver_fast_switch(policy, next_freq);
+ if (next_freq == CPUFREQ_ENTRY_INVALID)
+ return;
+
+ policy->cur = next_freq;
+ trace_cpu_frequency(next_freq, smp_processor_id());
+ } else if (sg_policy->next_freq != next_freq) {
+ sg_policy->next_freq = next_freq;
+ sg_policy->work_in_progress = true;
+ irq_work_queue(&sg_policy->irq_work);
+ }
+}
+
+/**
+ * get_next_freq - Compute a new frequency for a given cpufreq policy.
+ * @policy: cpufreq policy object to compute the new frequency for.
+ * @util: Current CPU utilization.
+ * @max: CPU capacity.
+ *
+ * If the utilization is frequency-invariant, choose the new frequency to be
+ * proportional to it, that is
+ *
+ * next_freq = C * max_freq * util / max
+ *
+ * Otherwise, approximate the would-be frequency-invariant utilization by
+ * util_raw * (curr_freq / max_freq) which leads to
+ *
+ * next_freq = C * curr_freq * util_raw / max
+ *
+ * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
+ */
+static unsigned int get_next_freq(struct cpufreq_policy *policy,
+ unsigned long util, unsigned long max)
+{
+ unsigned int freq = arch_scale_freq_invariant() ?
+ policy->cpuinfo.max_freq : policy->cur;
+
+ return (freq + (freq >> 2)) * util / max;
+}
+
+static void sugov_update_single(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int next_f;
+
+ if (!sugov_should_update_freq(sg_policy, time))
+ return;
+
+ next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
+ get_next_freq(policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+}
+
+static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
+ unsigned long util, unsigned long max)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int max_f = policy->cpuinfo.max_freq;
+ u64 last_freq_update_time = sg_policy->last_freq_update_time;
+ unsigned int j;
+
+ if (util == ULONG_MAX)
+ return max_f;
+
+ for_each_cpu(j, policy->cpus) {
+ struct sugov_cpu *j_sg_cpu;
+ unsigned long j_util, j_max;
+ s64 delta_ns;
+
+ if (j == smp_processor_id())
+ continue;
+
+ j_sg_cpu = &per_cpu(sugov_cpu, j);
+ /*
+ * If the CPU utilization was last updated before the previous
+ * frequency update and the time elapsed between the last update
+ * of the CPU utilization and the last frequency update is long
+ * enough, don't take the CPU into account as it probably is
+ * idle now.
+ */
+ delta_ns = last_freq_update_time - j_sg_cpu->last_update;
+ if (delta_ns > TICK_NSEC)
+ continue;
+
+ j_util = j_sg_cpu->util;
+ if (j_util == ULONG_MAX)
+ return max_f;
+
+ j_max = j_sg_cpu->max;
+ if (j_util * max > j_max * util) {
+ util = j_util;
+ max = j_max;
+ }
+ }
+
+ return get_next_freq(policy, util, max);
+}
+
+static void sugov_update_shared(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int next_f;
+
+ raw_spin_lock(&sg_policy->update_lock);
+
+ sg_cpu->util = util;
+ sg_cpu->max = max;
+ sg_cpu->last_update = time;
+
+ if (sugov_should_update_freq(sg_policy, time)) {
+ next_f = sugov_next_freq_shared(sg_policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+ }
+
+ raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_work(struct work_struct *work)
+{
+ struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+
+ mutex_lock(&sg_policy->work_lock);
+ __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+ CPUFREQ_RELATION_L);
+ mutex_unlock(&sg_policy->work_lock);
+
+ sg_policy->work_in_progress = false;
+}
+
+static void sugov_irq_work(struct irq_work *irq_work)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
+ schedule_work_on(smp_processor_id(), &sg_policy->work);
+}
+
+/************************** sysfs interface ************************/
+
+static struct sugov_tunables *global_tunables;
+static DEFINE_MUTEX(global_tunables_lock);
+
+static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct sugov_tunables, attr_set);
+}
+
+static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+ return sprintf(buf, "%u\n", tunables->rate_limit_us);
+}
+
+static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+ struct sugov_policy *sg_policy;
+ unsigned int rate_limit_us;
+
+ if (kstrtouint(buf, 10, &rate_limit_us))
+ return -EINVAL;
+
+ tunables->rate_limit_us = rate_limit_us;
+
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+ sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+
+ return count;
+}
+
+static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+
+static struct attribute *sugov_attributes[] = {
+ &rate_limit_us.attr,
+ NULL
+};
+
+static struct kobj_type sugov_tunables_ktype = {
+ .default_attrs = sugov_attributes,
+ .sysfs_ops = &governor_sysfs_ops,
+};
+
+/********************** cpufreq governor interface *********************/
+
+static struct cpufreq_governor schedutil_gov;
+
+static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+ if (!sg_policy)
+ return NULL;
+
+ sg_policy->policy = policy;
+ init_irq_work(&sg_policy->irq_work, sugov_irq_work);
+ INIT_WORK(&sg_policy->work, sugov_work);
+ mutex_init(&sg_policy->work_lock);
+ raw_spin_lock_init(&sg_policy->update_lock);
+ return sg_policy;
+}
+
+static void sugov_policy_free(struct sugov_policy *sg_policy)
+{
+ mutex_destroy(&sg_policy->work_lock);
+ kfree(sg_policy);
+}
+
+static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
+{
+ struct sugov_tunables *tunables;
+
+ tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
+ if (tunables) {
+ gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
+ if (!have_governor_per_policy())
+ global_tunables = tunables;
+ }
+ return tunables;
+}
+
+static void sugov_tunables_free(struct sugov_tunables *tunables)
+{
+ if (!have_governor_per_policy())
+ global_tunables = NULL;
+
+ kfree(tunables);
+}
+
+static int sugov_init(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+ struct sugov_tunables *tunables;
+ unsigned int lat;
+ int ret = 0;
+
+ /* State should be equivalent to EXIT */
+ if (policy->governor_data)
+ return -EBUSY;
+
+ sg_policy = sugov_policy_alloc(policy);
+ if (!sg_policy)
+ return -ENOMEM;
+
+ mutex_lock(&global_tunables_lock);
+
+ if (global_tunables) {
+ if (WARN_ON(have_governor_per_policy())) {
+ ret = -EINVAL;
+ goto free_sg_policy;
+ }
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = global_tunables;
+
+ gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
+ goto out;
+ }
+
+ tunables = sugov_tunables_alloc(sg_policy);
+ if (!tunables) {
+ ret = -ENOMEM;
+ goto free_sg_policy;
+ }
+
+ tunables->rate_limit_us = LATENCY_MULTIPLIER;
+ lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
+ if (lat)
+ tunables->rate_limit_us *= lat;
+
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = tunables;
+
+ ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
+ get_governor_parent_kobj(policy), "%s",
+ schedutil_gov.name);
+ if (ret)
+ goto fail;
+
+ out:
+ mutex_unlock(&global_tunables_lock);
+
+ cpufreq_enable_fast_switch(policy);
+ return 0;
+
+ fail:
+ policy->governor_data = NULL;
+ sugov_tunables_free(tunables);
+
+ free_sg_policy:
+ mutex_unlock(&global_tunables_lock);
+
+ sugov_policy_free(sg_policy);
+ pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
+ return ret;
+}
+
+static int sugov_exit(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ struct sugov_tunables *tunables = sg_policy->tunables;
+ unsigned int count;
+
+ mutex_lock(&global_tunables_lock);
+
+ count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
+ policy->governor_data = NULL;
+ if (!count)
+ sugov_tunables_free(tunables);
+
+ mutex_unlock(&global_tunables_lock);
+
+ sugov_policy_free(sg_policy);
+ return 0;
+}
+
+static int sugov_start(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->last_freq_update_time = 0;
+ sg_policy->next_freq = UINT_MAX;
+ sg_policy->work_in_progress = false;
+ sg_policy->need_freq_update = false;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+ sg_cpu->sg_policy = sg_policy;
+ if (policy_is_shared(policy)) {
+ sg_cpu->util = ULONG_MAX;
+ sg_cpu->max = 0;
+ sg_cpu->last_update = 0;
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_shared);
+ } else {
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_single);
+ }
+ }
+ return 0;
+}
+
+static int sugov_stop(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ cpufreq_remove_update_util_hook(cpu);
+
+ synchronize_sched();
+
+ irq_work_sync(&sg_policy->irq_work);
+ cancel_work_sync(&sg_policy->work);
+ return 0;
+}
+
+static int sugov_limits(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+
+ if (!policy->fast_switch_enabled) {
+ mutex_lock(&sg_policy->work_lock);
+
+ if (policy->max < policy->cur)
+ __cpufreq_driver_target(policy, policy->max,
+ CPUFREQ_RELATION_H);
+ else if (policy->min > policy->cur)
+ __cpufreq_driver_target(policy, policy->min,
+ CPUFREQ_RELATION_L);
+
+ mutex_unlock(&sg_policy->work_lock);
+ }
+
+ sg_policy->need_freq_update = true;
+ return 0;
+}
+
+int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
+{
+ if (event == CPUFREQ_GOV_POLICY_INIT) {
+ return sugov_init(policy);
+ } else if (policy->governor_data) {
+ switch (event) {
+ case CPUFREQ_GOV_POLICY_EXIT:
+ return sugov_exit(policy);
+ case CPUFREQ_GOV_START:
+ return sugov_start(policy);
+ case CPUFREQ_GOV_STOP:
+ return sugov_stop(policy);
+ case CPUFREQ_GOV_LIMITS:
+ return sugov_limits(policy);
+ }
+ }
+ return -EINVAL;
+}
+
+static struct cpufreq_governor schedutil_gov = {
+ .name = "schedutil",
+ .governor = sugov_governor,
+ .owner = THIS_MODULE,
+};
+
+static int __init sugov_module_init(void)
+{
+ return cpufreq_register_governor(&schedutil_gov);
+}
+
+static void __exit sugov_module_exit(void)
+{
+ cpufreq_unregister_governor(&schedutil_gov);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+struct cpufreq_governor *cpufreq_default_governor(void)
+{
+ return &schedutil_gov;
+}
+
+fs_initcall(sugov_module_init);
+#else
+module_init(sugov_module_init);
+#endif
+module_exit(sugov_module_exit);
Index: linux-pm/kernel/sched/Makefile
===================================================================
--- linux-pm.orig/kernel/sched/Makefile
+++ linux-pm/kernel/sched/Makefile
@@ -24,3 +24,4 @@ obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -1842,6 +1842,14 @@ static inline void cpufreq_update_util(u
static inline void cpufreq_trigger_update(u64 time) {}
#endif /* CONFIG_CPU_FREQ */

+#ifdef arch_scale_freq_capacity
+#ifndef arch_scale_freq_invariant
+#define arch_scale_freq_invariant() (true)
+#endif
+#else /* arch_scale_freq_capacity */
+#define arch_scale_freq_invariant() (false)
+#endif
+
static inline void account_reset_rq(struct rq *rq)
{
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
Index: linux-pm/kernel/trace/power-traces.c
===================================================================
--- linux-pm.orig/kernel/trace/power-traces.c
+++ linux-pm/kernel/trace/power-traces.c
@@ -15,5 +15,6 @@

EXPORT_TRACEPOINT_SYMBOL_GPL(suspend_resume);
EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_frequency);
EXPORT_TRACEPOINT_SYMBOL_GPL(powernv_throttle);