Hi,
My previous intro message still applies somewhat, so here's a link:
http://marc.info/?l=linux-pm&m=145609673008122&w=2
The executive summary of the motivation is that I wanted to do two things:
use the utilization data from the scheduler (it's passed to the governor
as aguments of update callbacks anyway) and make it possible to set
CPU frequency without involving process context (fast frequency switching).
Both have been prototyped in the previous RFCs:
https://patchwork.kernel.org/patch/8426691/
https://patchwork.kernel.org/patch/8426741/
but in the meantime I found a couple of issues in there.
First off, the common governor code relied on by the previous version reset
the sample delay to 0 in order to force an immediate frequency update. That
doesn't work with the new governor, though, because it computes the frequency
to set in a cpufreq_update_util() callback and (when fast switching is not
used) passes that to a work item which sets the frequency and then restores
the sample delay. Thus if sysfs changes the sample delay to 0 when work_in_progress
is in effect, it will be overwritten by the work item and so discarded.
When using fast switching, the previous version would update the sample delay
from a scheduler path, but that (on a 32-bit system) might clash with an
update from sysfs leading to a result that's completely off. That value would
be less than the correct sample delay (I think), so in practice that shouldn't
matter that much, but still it's not nice.
The above means that schedutil cannot really share as much code as I thought it
could with "ondemand" and "conservative".
Moreover, I wanted to have a "rate_limit" tunable (instead of the sampling rate
which doesn't mean what the name suggests in schedutil), but that would be the
only one used by schedutil, so I ended up having to define a new struct to point
to from struct dbs_data just to hold that single value and I would need to
define ->init() and ->exit() callbacks for the governor for that reason (and
the common tunables in struct dbs_data wouldn't be used).
Not to mention the fact that the majority of the common governor code is not
really used by schedutil anyway.
Taking the above into account, I decided to decouple schedutil from the other
governors, but I wanted to avoid duplicating some of the tunables manipulation
code. Hence patches [3-4/6] taking that code into a separate file so schedutil
can use it too without pulling the rest of the common "ondemand" and
"conservative" code along with it.
Patch [5/6] adds support for fast switching to the core and the ACPI driver,
but doesn't hook it up to anything useful. That is done in the last patch
that actually adds the new governor.
That depends on two patches I sent previously, [1/6] that makes
cpufreq_update_util() use RCU-sched (one change from the previous version
as requested by Peter) and [2/6] that reworks acpi-cpufreq so the fast
switching (added later in patch [5/6]) can work with all of the frequency
setting methods the driver may use.
Comments welcome.
Thanks,
Rafael
From: Rafael J. Wysocki <[email protected]>
Setting a new CPU frequency and reading the current request value
in the ACPI cpufreq driver involves each at least two switch
instructions (there's more if the policy is shared). One of
them is present in drv_read/write() that prepares a command
structure and the other happens in subsequent do_drv_read/write()
when that structure is interpreted. However, all of those switches
may be avoided by using function pointers.
To that end, add two function pointers to struct acpi_cpufreq_data
to represent read and write operations on the frequency register
and set them up during policy intitialization to point to the pair
of routines suitable for the given processor (Intel/AMD MSR access
or I/O port access). Then, use those pointers in do_drv_read/write()
and modify drv_read/write() to prepare the command structure for
them without any checks.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpufreq/acpi-cpufreq.c | 208 ++++++++++++++++++-----------------------
1 file changed, 95 insertions(+), 113 deletions(-)
Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -70,6 +70,8 @@ struct acpi_cpufreq_data {
unsigned int cpu_feature;
unsigned int acpi_perf_cpu;
cpumask_var_t freqdomain_cpus;
+ void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val);
+ u32 (*cpu_freq_read)(struct acpi_pct_register *reg);
};
/* acpi_perf_data is a pointer to percpu data. */
@@ -243,125 +245,119 @@ static unsigned extract_freq(u32 val, st
}
}
-struct msr_addr {
- u32 reg;
-};
+u32 cpu_freq_read_intel(struct acpi_pct_register *not_used)
+{
+ u32 val, dummy;
-struct io_addr {
- u16 port;
- u8 bit_width;
-};
+ rdmsr(MSR_IA32_PERF_CTL, val, dummy);
+ return val;
+}
+
+void cpu_freq_write_intel(struct acpi_pct_register *not_used, u32 val)
+{
+ u32 lo, hi;
+
+ rdmsr(MSR_IA32_PERF_CTL, lo, hi);
+ lo = (lo & ~INTEL_MSR_RANGE) | (val & INTEL_MSR_RANGE);
+ wrmsr(MSR_IA32_PERF_CTL, lo, hi);
+}
+
+u32 cpu_freq_read_amd(struct acpi_pct_register *not_used)
+{
+ u32 val, dummy;
+
+ rdmsr(MSR_AMD_PERF_CTL, val, dummy);
+ return val;
+}
+
+void cpu_freq_write_amd(struct acpi_pct_register *not_used, u32 val)
+{
+ wrmsr(MSR_AMD_PERF_CTL, val, 0);
+}
+
+u32 cpu_freq_read_io(struct acpi_pct_register *reg)
+{
+ u32 val;
+
+ acpi_os_read_port(reg->address, &val, reg->bit_width);
+ return val;
+}
+
+void cpu_freq_write_io(struct acpi_pct_register *reg, u32 val)
+{
+ acpi_os_write_port(reg->address, val, reg->bit_width);
+}
struct drv_cmd {
- unsigned int type;
- const struct cpumask *mask;
- union {
- struct msr_addr msr;
- struct io_addr io;
- } addr;
+ struct acpi_pct_register *reg;
u32 val;
+ union {
+ void (*write)(struct acpi_pct_register *reg, u32 val);
+ u32 (*read)(struct acpi_pct_register *reg);
+ } func;
};
/* Called via smp_call_function_single(), on the target CPU */
static void do_drv_read(void *_cmd)
{
struct drv_cmd *cmd = _cmd;
- u32 h;
- switch (cmd->type) {
- case SYSTEM_INTEL_MSR_CAPABLE:
- case SYSTEM_AMD_MSR_CAPABLE:
- rdmsr(cmd->addr.msr.reg, cmd->val, h);
- break;
- case SYSTEM_IO_CAPABLE:
- acpi_os_read_port((acpi_io_address)cmd->addr.io.port,
- &cmd->val,
- (u32)cmd->addr.io.bit_width);
- break;
- default:
- break;
- }
+ cmd->val = cmd->func.read(cmd->reg);
}
-/* Called via smp_call_function_many(), on the target CPUs */
-static void do_drv_write(void *_cmd)
+static u32 drv_read(struct acpi_cpufreq_data *data, const struct cpumask *mask)
{
- struct drv_cmd *cmd = _cmd;
- u32 lo, hi;
+ struct acpi_processor_performance *perf = to_perf_data(data);
+ struct drv_cmd cmd = {
+ .reg = &perf->control_register,
+ .func.read = data->cpu_freq_read,
+ };
+ int err;
- switch (cmd->type) {
- case SYSTEM_INTEL_MSR_CAPABLE:
- rdmsr(cmd->addr.msr.reg, lo, hi);
- lo = (lo & ~INTEL_MSR_RANGE) | (cmd->val & INTEL_MSR_RANGE);
- wrmsr(cmd->addr.msr.reg, lo, hi);
- break;
- case SYSTEM_AMD_MSR_CAPABLE:
- wrmsr(cmd->addr.msr.reg, cmd->val, 0);
- break;
- case SYSTEM_IO_CAPABLE:
- acpi_os_write_port((acpi_io_address)cmd->addr.io.port,
- cmd->val,
- (u32)cmd->addr.io.bit_width);
- break;
- default:
- break;
- }
+ err = smp_call_function_any(mask, do_drv_read, &cmd, 1);
+ WARN_ON_ONCE(err); /* smp_call_function_any() was buggy? */
+ return cmd.val;
}
-static void drv_read(struct drv_cmd *cmd)
+/* Called via smp_call_function_many(), on the target CPUs */
+static void do_drv_write(void *_cmd)
{
- int err;
- cmd->val = 0;
+ struct drv_cmd *cmd = _cmd;
- err = smp_call_function_any(cmd->mask, do_drv_read, cmd, 1);
- WARN_ON_ONCE(err); /* smp_call_function_any() was buggy? */
+ cmd->func.write(cmd->reg, cmd->val);
}
-static void drv_write(struct drv_cmd *cmd)
+static void drv_write(struct acpi_cpufreq_data *data,
+ const struct cpumask *mask, u32 val)
{
+ struct acpi_processor_performance *perf = to_perf_data(data);
+ struct drv_cmd cmd = {
+ .reg = &perf->control_register,
+ .val = val,
+ .func.write = data->cpu_freq_write,
+ };
int this_cpu;
this_cpu = get_cpu();
- if (cpumask_test_cpu(this_cpu, cmd->mask))
- do_drv_write(cmd);
- smp_call_function_many(cmd->mask, do_drv_write, cmd, 1);
+ if (cpumask_test_cpu(this_cpu, mask))
+ do_drv_write(&cmd);
+
+ smp_call_function_many(mask, do_drv_write, &cmd, 1);
put_cpu();
}
-static u32
-get_cur_val(const struct cpumask *mask, struct acpi_cpufreq_data *data)
+static u32 get_cur_val(const struct cpumask *mask, struct acpi_cpufreq_data *data)
{
- struct acpi_processor_performance *perf;
- struct drv_cmd cmd;
+ u32 val;
if (unlikely(cpumask_empty(mask)))
return 0;
- switch (data->cpu_feature) {
- case SYSTEM_INTEL_MSR_CAPABLE:
- cmd.type = SYSTEM_INTEL_MSR_CAPABLE;
- cmd.addr.msr.reg = MSR_IA32_PERF_CTL;
- break;
- case SYSTEM_AMD_MSR_CAPABLE:
- cmd.type = SYSTEM_AMD_MSR_CAPABLE;
- cmd.addr.msr.reg = MSR_AMD_PERF_CTL;
- break;
- case SYSTEM_IO_CAPABLE:
- cmd.type = SYSTEM_IO_CAPABLE;
- perf = to_perf_data(data);
- cmd.addr.io.port = perf->control_register.address;
- cmd.addr.io.bit_width = perf->control_register.bit_width;
- break;
- default:
- return 0;
- }
-
- cmd.mask = mask;
- drv_read(&cmd);
+ val = drv_read(data, mask);
- pr_debug("get_cur_val = %u\n", cmd.val);
+ pr_debug("get_cur_val = %u\n", val);
- return cmd.val;
+ return val;
}
static unsigned int get_cur_freq_on_cpu(unsigned int cpu)
@@ -416,7 +412,7 @@ static int acpi_cpufreq_target(struct cp
{
struct acpi_cpufreq_data *data = policy->driver_data;
struct acpi_processor_performance *perf;
- struct drv_cmd cmd;
+ const struct cpumask *mask;
unsigned int next_perf_state = 0; /* Index into perf table */
int result = 0;
@@ -438,37 +434,17 @@ static int acpi_cpufreq_target(struct cp
}
}
- switch (data->cpu_feature) {
- case SYSTEM_INTEL_MSR_CAPABLE:
- cmd.type = SYSTEM_INTEL_MSR_CAPABLE;
- cmd.addr.msr.reg = MSR_IA32_PERF_CTL;
- cmd.val = (u32) perf->states[next_perf_state].control;
- break;
- case SYSTEM_AMD_MSR_CAPABLE:
- cmd.type = SYSTEM_AMD_MSR_CAPABLE;
- cmd.addr.msr.reg = MSR_AMD_PERF_CTL;
- cmd.val = (u32) perf->states[next_perf_state].control;
- break;
- case SYSTEM_IO_CAPABLE:
- cmd.type = SYSTEM_IO_CAPABLE;
- cmd.addr.io.port = perf->control_register.address;
- cmd.addr.io.bit_width = perf->control_register.bit_width;
- cmd.val = (u32) perf->states[next_perf_state].control;
- break;
- default:
- return -ENODEV;
- }
-
- /* cpufreq holds the hotplug lock, so we are safe from here on */
- if (policy->shared_type != CPUFREQ_SHARED_TYPE_ANY)
- cmd.mask = policy->cpus;
- else
- cmd.mask = cpumask_of(policy->cpu);
+ /*
+ * The core won't allow CPUs to go away until the governor has been
+ * stopped, so we can rely on the stability of policy->cpus.
+ */
+ mask = policy->shared_type == CPUFREQ_SHARED_TYPE_ANY ?
+ cpumask_of(policy->cpu) : policy->cpus;
- drv_write(&cmd);
+ drv_write(data, mask, perf->states[next_perf_state].control);
if (acpi_pstate_strict) {
- if (!check_freqs(cmd.mask, data->freq_table[index].frequency,
+ if (!check_freqs(mask, data->freq_table[index].frequency,
data)) {
pr_debug("acpi_cpufreq_target failed (%d)\n",
policy->cpu);
@@ -738,15 +714,21 @@ static int acpi_cpufreq_cpu_init(struct
}
pr_debug("SYSTEM IO addr space\n");
data->cpu_feature = SYSTEM_IO_CAPABLE;
+ data->cpu_freq_read = cpu_freq_read_io;
+ data->cpu_freq_write = cpu_freq_write_io;
break;
case ACPI_ADR_SPACE_FIXED_HARDWARE:
pr_debug("HARDWARE addr space\n");
if (check_est_cpu(cpu)) {
data->cpu_feature = SYSTEM_INTEL_MSR_CAPABLE;
+ data->cpu_freq_read = cpu_freq_read_intel;
+ data->cpu_freq_write = cpu_freq_write_intel;
break;
}
if (check_amd_hwpstate_cpu(cpu)) {
data->cpu_feature = SYSTEM_AMD_MSR_CAPABLE;
+ data->cpu_freq_read = cpu_freq_read_amd;
+ data->cpu_freq_write = cpu_freq_write_amd;
break;
}
result = -ENODEV;
From: Rafael J. Wysocki <[email protected]>
In addition to fields representing governor tunables, struct dbs_data
contains some fields needed for the management of objects of that
type. As it turns out, that part of struct dbs_data may be shared
with (future) governors that won't use the common code used by
"ondemand" and "conservative", so move it to a separate struct type
and modify the code using struct dbs_data to follow.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpufreq/cpufreq_conservative.c | 15 +++--
drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++-------------
drivers/cpufreq/cpufreq_governor.h | 36 +++++++------
drivers/cpufreq/cpufreq_ondemand.c | 19 ++++--
4 files changed, 97 insertions(+), 63 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -41,6 +41,13 @@
/* Ondemand Sampling types */
enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
+struct gov_tunables {
+ struct kobject kobj;
+ struct list_head policy_list;
+ struct mutex update_lock;
+ int usage_count;
+};
+
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
@@ -52,7 +59,7 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
/* Governor demand based switching data (per-policy or global). */
struct dbs_data {
- int usage_count;
+ struct gov_tunables gt;
void *tuners;
unsigned int min_sampling_rate;
unsigned int ignore_nice_load;
@@ -60,37 +67,34 @@ struct dbs_data {
unsigned int sampling_down_factor;
unsigned int up_threshold;
unsigned int io_is_busy;
-
- struct kobject kobj;
- struct list_head policy_dbs_list;
- /*
- * Protect concurrent updates to governor tunables from sysfs,
- * policy_dbs_list and usage_count.
- */
- struct mutex mutex;
};
+static inline struct dbs_data *to_dbs_data(struct gov_tunables *gt)
+{
+ return container_of(gt, struct dbs_data, gt);
+}
+
/* Governor's specific attributes */
-struct dbs_data;
struct governor_attr {
struct attribute attr;
- ssize_t (*show)(struct dbs_data *dbs_data, char *buf);
- ssize_t (*store)(struct dbs_data *dbs_data, const char *buf,
- size_t count);
+ ssize_t (*show)(struct gov_tunables *gt, char *buf);
+ ssize_t (*store)(struct gov_tunables *gt, const char *buf, size_t count);
};
#define gov_show_one(_gov, file_name) \
static ssize_t show_##file_name \
-(struct dbs_data *dbs_data, char *buf) \
+(struct gov_tunables *gt, char *buf) \
{ \
+ struct dbs_data *dbs_data = to_dbs_data(gt); \
struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \
return sprintf(buf, "%u\n", tuners->file_name); \
}
#define gov_show_one_common(file_name) \
static ssize_t show_##file_name \
-(struct dbs_data *dbs_data, char *buf) \
+(struct gov_tunables *gt, char *buf) \
{ \
+ struct dbs_data *dbs_data = to_dbs_data(gt); \
return sprintf(buf, "%u\n", dbs_data->file_name); \
}
@@ -184,7 +188,7 @@ void od_register_powersave_bias_handler(
(struct cpufreq_policy *, unsigned int, unsigned int),
unsigned int powersave_bias);
void od_unregister_powersave_bias_handler(void);
-ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
+ssize_t store_sampling_rate(struct gov_tunables *gt, const char *buf,
size_t count);
void gov_update_cpu_data(struct dbs_data *dbs_data);
#endif /* _CPUFREQ_GOVERNOR_H */
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -42,9 +42,10 @@ static DEFINE_MUTEX(gov_dbs_data_mutex);
* This must be called with dbs_data->mutex held, otherwise traversing
* policy_dbs_list isn't safe.
*/
-ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
+ssize_t store_sampling_rate(struct gov_tunables *gt, const char *buf,
size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(gt);
struct policy_dbs_info *policy_dbs;
unsigned int rate;
int ret;
@@ -58,7 +59,7 @@ ssize_t store_sampling_rate(struct dbs_d
* We are operating under dbs_data->mutex and so the list and its
* entries can't be freed concurrently.
*/
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, >->policy_list, list) {
mutex_lock(&policy_dbs->timer_mutex);
/*
* On 32-bit architectures this may race with the
@@ -95,7 +96,7 @@ void gov_update_cpu_data(struct dbs_data
{
struct policy_dbs_info *policy_dbs;
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &dbs_data->gt.policy_list, list) {
unsigned int j;
for_each_cpu(j, policy_dbs->policy->cpus) {
@@ -110,9 +111,9 @@ void gov_update_cpu_data(struct dbs_data
}
EXPORT_SYMBOL_GPL(gov_update_cpu_data);
-static inline struct dbs_data *to_dbs_data(struct kobject *kobj)
+static inline struct gov_tunables *to_gov_tunables(struct kobject *kobj)
{
- return container_of(kobj, struct dbs_data, kobj);
+ return container_of(kobj, struct gov_tunables, kobj);
}
static inline struct governor_attr *to_gov_attr(struct attribute *attr)
@@ -123,25 +124,24 @@ static inline struct governor_attr *to_g
static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
char *buf)
{
- struct dbs_data *dbs_data = to_dbs_data(kobj);
struct governor_attr *gattr = to_gov_attr(attr);
- return gattr->show(dbs_data, buf);
+ return gattr->show(to_gov_tunables(kobj), buf);
}
static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
const char *buf, size_t count)
{
- struct dbs_data *dbs_data = to_dbs_data(kobj);
+ struct gov_tunables *gt = to_gov_tunables(kobj);
struct governor_attr *gattr = to_gov_attr(attr);
int ret = -EBUSY;
- mutex_lock(&dbs_data->mutex);
+ mutex_lock(>->update_lock);
- if (dbs_data->usage_count)
- ret = gattr->store(dbs_data, buf, count);
+ if (gt->usage_count)
+ ret = gattr->store(gt, buf, count);
- mutex_unlock(&dbs_data->mutex);
+ mutex_unlock(>->update_lock);
return ret;
}
@@ -424,6 +424,41 @@ static void free_policy_dbs_info(struct
gov->free(policy_dbs);
}
+static void gov_tunables_init(struct gov_tunables *gt,
+ struct list_head *list_node)
+{
+ INIT_LIST_HEAD(>->policy_list);
+ mutex_init(>->update_lock);
+ gt->usage_count = 1;
+ list_add(list_node, >->policy_list);
+}
+
+static void gov_tunables_get(struct gov_tunables *gt,
+ struct list_head *list_node)
+{
+ mutex_lock(>->update_lock);
+ gt->usage_count++;
+ list_add(list_node, >->policy_list);
+ mutex_unlock(>->update_lock);
+}
+
+static unsigned int gov_tunables_put(struct gov_tunables *gt,
+ struct list_head *list_node)
+{
+ unsigned int count;
+
+ mutex_lock(>->update_lock);
+ list_del(list_node);
+ count = --gt->usage_count;
+ mutex_unlock(>->update_lock);
+ if (count)
+ return count;
+
+ kobject_put(>->kobj);
+ mutex_destroy(>->update_lock);
+ return 0;
+}
+
static int cpufreq_governor_init(struct cpufreq_policy *policy)
{
struct dbs_governor *gov = dbs_governor_of(policy);
@@ -452,10 +487,7 @@ static int cpufreq_governor_init(struct
policy_dbs->dbs_data = dbs_data;
policy->governor_data = policy_dbs;
- mutex_lock(&dbs_data->mutex);
- dbs_data->usage_count++;
- list_add(&policy_dbs->list, &dbs_data->policy_dbs_list);
- mutex_unlock(&dbs_data->mutex);
+ gov_tunables_get(&dbs_data->gt, &policy_dbs->list);
goto out;
}
@@ -465,8 +497,7 @@ static int cpufreq_governor_init(struct
goto free_policy_dbs_info;
}
- INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
- mutex_init(&dbs_data->mutex);
+ gov_tunables_init(&dbs_data->gt, &policy_dbs->list);
ret = gov->init(dbs_data, !policy->governor->initialized);
if (ret)
@@ -486,14 +517,11 @@ static int cpufreq_governor_init(struct
if (!have_governor_per_policy())
gov->gdbs_data = dbs_data;
- policy->governor_data = policy_dbs;
-
policy_dbs->dbs_data = dbs_data;
- dbs_data->usage_count = 1;
- list_add(&policy_dbs->list, &dbs_data->policy_dbs_list);
+ policy->governor_data = policy_dbs;
gov->kobj_type.sysfs_ops = &governor_sysfs_ops;
- ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type,
+ ret = kobject_init_and_add(&dbs_data->gt.kobj, &gov->kobj_type,
get_governor_parent_kobj(policy),
"%s", gov->gov.name);
if (!ret)
@@ -522,29 +550,21 @@ static int cpufreq_governor_exit(struct
struct dbs_governor *gov = dbs_governor_of(policy);
struct policy_dbs_info *policy_dbs = policy->governor_data;
struct dbs_data *dbs_data = policy_dbs->dbs_data;
- int count;
+ unsigned int count;
/* Protect gov->gdbs_data against concurrent updates. */
mutex_lock(&gov_dbs_data_mutex);
- mutex_lock(&dbs_data->mutex);
- list_del(&policy_dbs->list);
- count = --dbs_data->usage_count;
- mutex_unlock(&dbs_data->mutex);
+ count = gov_tunables_put(&dbs_data->gt, &policy_dbs->list);
- if (!count) {
- kobject_put(&dbs_data->kobj);
-
- policy->governor_data = NULL;
+ policy->governor_data = NULL;
+ if (!count) {
if (!have_governor_per_policy())
gov->gdbs_data = NULL;
gov->exit(dbs_data, policy->governor->initialized == 1);
- mutex_destroy(&dbs_data->mutex);
kfree(dbs_data);
- } else {
- policy->governor_data = NULL;
}
free_policy_dbs_info(policy_dbs, gov);
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -207,9 +207,10 @@ static unsigned int od_dbs_timer(struct
/************************** sysfs interface ************************/
static struct dbs_governor od_dbs_gov;
-static ssize_t store_io_is_busy(struct dbs_data *dbs_data, const char *buf,
+static ssize_t store_io_is_busy(struct gov_tunables *gt, const char *buf,
size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(gt);
unsigned int input;
int ret;
@@ -224,9 +225,10 @@ static ssize_t store_io_is_busy(struct d
return count;
}
-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
+static ssize_t store_up_threshold(struct gov_tunables *gt, const char *buf,
size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(gt);
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -240,9 +242,10 @@ static ssize_t store_up_threshold(struct
return count;
}
-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
+static ssize_t store_sampling_down_factor(struct gov_tunables *gt,
const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(gt);
struct policy_dbs_info *policy_dbs;
unsigned int input;
int ret;
@@ -254,7 +257,7 @@ static ssize_t store_sampling_down_facto
dbs_data->sampling_down_factor = input;
/* Reset down sampling multiplier in case it was active */
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, >->policy_list, list) {
/*
* Doing this without locking might lead to using different
* rate_mult values in od_update() and od_dbs_timer().
@@ -267,9 +270,10 @@ static ssize_t store_sampling_down_facto
return count;
}
-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
+static ssize_t store_ignore_nice_load(struct gov_tunables *gt,
const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(gt);
unsigned int input;
int ret;
@@ -291,9 +295,10 @@ static ssize_t store_ignore_nice_load(st
return count;
}
-static ssize_t store_powersave_bias(struct dbs_data *dbs_data, const char *buf,
+static ssize_t store_powersave_bias(struct gov_tunables *gt, const char *buf,
size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(gt);
struct od_dbs_tuners *od_tuners = dbs_data->tuners;
struct policy_dbs_info *policy_dbs;
unsigned int input;
@@ -308,7 +313,7 @@ static ssize_t store_powersave_bias(stru
od_tuners->powersave_bias = input;
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list)
+ list_for_each_entry(policy_dbs, >->policy_list, list)
ondemand_powersave_bias_init(policy_dbs->policy);
return count;
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -129,9 +129,10 @@ static struct notifier_block cs_cpufreq_
/************************** sysfs interface ************************/
static struct dbs_governor cs_dbs_gov;
-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
+static ssize_t store_sampling_down_factor(struct gov_tunables *gt,
const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(gt);
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -143,9 +144,10 @@ static ssize_t store_sampling_down_facto
return count;
}
-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
+static ssize_t store_up_threshold(struct gov_tunables *gt, const char *buf,
size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(gt);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
@@ -158,9 +160,10 @@ static ssize_t store_up_threshold(struct
return count;
}
-static ssize_t store_down_threshold(struct dbs_data *dbs_data, const char *buf,
+static ssize_t store_down_threshold(struct gov_tunables *gt, const char *buf,
size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(gt);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
@@ -175,9 +178,10 @@ static ssize_t store_down_threshold(stru
return count;
}
-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
+static ssize_t store_ignore_nice_load(struct gov_tunables *gt,
const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(gt);
unsigned int input;
int ret;
@@ -199,9 +203,10 @@ static ssize_t store_ignore_nice_load(st
return count;
}
-static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf,
+static ssize_t store_freq_step(struct gov_tunables *gt, const char *buf,
size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(gt);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
From: Rafael J. Wysocki <[email protected]>
Move abstract code related to struct gov_tunables to a separate (new)
file so it can be shared with (future) goverernors that won't share
more code with "ondemand" and "conservative".
No intentional functional changes.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpufreq/Kconfig | 4 +
drivers/cpufreq/Makefile | 1
drivers/cpufreq/cpufreq_governor.c | 82 ---------------------------
drivers/cpufreq/cpufreq_governor.h | 6 ++
drivers/cpufreq/cpufreq_governor_tunables.c | 84 ++++++++++++++++++++++++++++
5 files changed, 95 insertions(+), 82 deletions(-)
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -18,7 +18,11 @@ config CPU_FREQ
if CPU_FREQ
+config CPU_FREQ_GOV_TUNABLES
+ bool
+
config CPU_FREQ_GOV_COMMON
+ select CPU_FREQ_GOV_TUNABLES
select IRQ_WORK
bool
Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) +=
obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o
obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o
obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
+obj-$(CONFIG_CPU_FREQ_GOV_TUNABLES) += cpufreq_governor_tunables.o
obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -111,53 +111,6 @@ void gov_update_cpu_data(struct dbs_data
}
EXPORT_SYMBOL_GPL(gov_update_cpu_data);
-static inline struct gov_tunables *to_gov_tunables(struct kobject *kobj)
-{
- return container_of(kobj, struct gov_tunables, kobj);
-}
-
-static inline struct governor_attr *to_gov_attr(struct attribute *attr)
-{
- return container_of(attr, struct governor_attr, attr);
-}
-
-static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
- char *buf)
-{
- struct governor_attr *gattr = to_gov_attr(attr);
-
- return gattr->show(to_gov_tunables(kobj), buf);
-}
-
-static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
- const char *buf, size_t count)
-{
- struct gov_tunables *gt = to_gov_tunables(kobj);
- struct governor_attr *gattr = to_gov_attr(attr);
- int ret = -EBUSY;
-
- mutex_lock(>->update_lock);
-
- if (gt->usage_count)
- ret = gattr->store(gt, buf, count);
-
- mutex_unlock(>->update_lock);
-
- return ret;
-}
-
-/*
- * Sysfs Ops for accessing governor attributes.
- *
- * All show/store invocations for governor specific sysfs attributes, will first
- * call the below show/store callbacks and the attribute specific callback will
- * be called from within it.
- */
-static const struct sysfs_ops governor_sysfs_ops = {
- .show = governor_show,
- .store = governor_store,
-};
-
unsigned int dbs_update(struct cpufreq_policy *policy)
{
struct policy_dbs_info *policy_dbs = policy->governor_data;
@@ -424,41 +377,6 @@ static void free_policy_dbs_info(struct
gov->free(policy_dbs);
}
-static void gov_tunables_init(struct gov_tunables *gt,
- struct list_head *list_node)
-{
- INIT_LIST_HEAD(>->policy_list);
- mutex_init(>->update_lock);
- gt->usage_count = 1;
- list_add(list_node, >->policy_list);
-}
-
-static void gov_tunables_get(struct gov_tunables *gt,
- struct list_head *list_node)
-{
- mutex_lock(>->update_lock);
- gt->usage_count++;
- list_add(list_node, >->policy_list);
- mutex_unlock(>->update_lock);
-}
-
-static unsigned int gov_tunables_put(struct gov_tunables *gt,
- struct list_head *list_node)
-{
- unsigned int count;
-
- mutex_lock(>->update_lock);
- list_del(list_node);
- count = --gt->usage_count;
- mutex_unlock(>->update_lock);
- if (count)
- return count;
-
- kobject_put(>->kobj);
- mutex_destroy(>->update_lock);
- return 0;
-}
-
static int cpufreq_governor_init(struct cpufreq_policy *policy)
{
struct dbs_governor *gov = dbs_governor_of(policy);
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -48,6 +48,12 @@ struct gov_tunables {
int usage_count;
};
+extern const struct sysfs_ops governor_sysfs_ops;
+
+void gov_tunables_init(struct gov_tunables *gt, struct list_head *list_node);
+void gov_tunables_get(struct gov_tunables *gt, struct list_head *list_node);
+unsigned int gov_tunables_put(struct gov_tunables *gt, struct list_head *list_node);
+
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
Index: linux-pm/drivers/cpufreq/cpufreq_governor_tunables.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_governor_tunables.c
@@ -0,0 +1,84 @@
+/*
+ * Abstract code for CPUFreq governor tunables handling.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "cpufreq_governor.h"
+
+static inline struct gov_tunables *to_gov_tunables(struct kobject *kobj)
+{
+ return container_of(kobj, struct gov_tunables, kobj);
+}
+
+static inline struct governor_attr *to_gov_attr(struct attribute *attr)
+{
+ return container_of(attr, struct governor_attr, attr);
+}
+
+static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+{
+ struct governor_attr *gattr = to_gov_attr(attr);
+
+ return gattr->show(to_gov_tunables(kobj), buf);
+}
+
+static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
+ const char *buf, size_t count)
+{
+ struct gov_tunables *gt = to_gov_tunables(kobj);
+ struct governor_attr *gattr = to_gov_attr(attr);
+ int ret;
+
+ mutex_lock(>->update_lock);
+ ret = gt->usage_count ? gattr->store(gt, buf, count) : -EBUSY;
+ mutex_unlock(>->update_lock);
+ return ret;
+}
+
+const struct sysfs_ops governor_sysfs_ops = {
+ .show = governor_show,
+ .store = governor_store,
+};
+EXPORT_SYMBOL_GPL(governor_sysfs_ops);
+
+void gov_tunables_init(struct gov_tunables *gt, struct list_head *list_node)
+{
+ INIT_LIST_HEAD(>->policy_list);
+ mutex_init(>->update_lock);
+ gt->usage_count = 1;
+ list_add(list_node, >->policy_list);
+}
+EXPORT_SYMBOL_GPL(gov_tunables_init);
+
+void gov_tunables_get(struct gov_tunables *gt, struct list_head *list_node)
+{
+ mutex_lock(>->update_lock);
+ gt->usage_count++;
+ list_add(list_node, >->policy_list);
+ mutex_unlock(>->update_lock);
+}
+EXPORT_SYMBOL_GPL(gov_tunables_get);
+
+unsigned int gov_tunables_put(struct gov_tunables *gt, struct list_head *list_node)
+{
+ unsigned int count;
+
+ mutex_lock(>->update_lock);
+ list_del(list_node);
+ count = --gt->usage_count;
+ mutex_unlock(>->update_lock);
+ if (count)
+ return count;
+
+ kobject_put(>->kobj);
+ mutex_destroy(>->update_lock);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(gov_tunables_put);
From: Rafael J. Wysocki <[email protected]>
Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.
Doing that is possible after commit fe7034338ba0 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.
The new governor is relatively simple.
The frequency selection formula used by it is essentially the same
as the one used by the "ondemand" governor, although it doesn't use
the additional up_threshold parameter, but instead of computing the
load as the "non-idle CPU time" to "total CPU time" ratio, it takes
the utilization data provided by CFS as input. More specifically,
it represents "load" as the util/max ratio, where util and max
are the utilization and CPU capacity coming from CFS.
All of the computations are carried out in the utilization update
handlers provided by the new governor. One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).
The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers. If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).
Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes. That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.
The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
In addition to the changes mentioned in the intro message [0/6] this also
tweaks the frequency selection formula in a couple of ways.
First off, it uses min and max frequencies correctly (the formula from
"ondemand" is applied to cpuinfo.min/max_freq like the original and
policy->min/max are applied to the result later).
Second, RELATION_L is used most of the time except for the bottom 1/4
of the available frequency range (but also note that DL tasks are
treated in the same way as RT ones, meaning f_max is always used for
them).
Finally, the condition for discarding idle policy CPUs was modified
to also work if the rate limit is below the scheduling rate.
The code in sugov_init/exit/stop() and the irq_work handler look
very similar to the analogous code in cpufreq_governor.c, but it
is different enough that trying to avoid that duplication was not
practical.
Thanks,
Rafael
---
drivers/cpufreq/Kconfig | 26 +
drivers/cpufreq/Makefile | 1
drivers/cpufreq/cpufreq_schedutil.c | 501 ++++++++++++++++++++++++++++++++++++
3 files changed, 528 insertions(+)
Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
@@ -0,0 +1,501 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/percpu-defs.h>
+#include <linux/slab.h>
+
+#include "cpufreq_governor.h"
+
+struct sugov_tunables {
+ struct gov_tunables gt;
+ unsigned int rate_limit_us;
+};
+
+struct sugov_policy {
+ struct cpufreq_policy *policy;
+
+ struct sugov_tunables *tunables;
+ struct list_head tunables_hook;
+
+ raw_spinlock_t update_lock; /* For shared policies */
+ u64 last_freq_update_time;
+ s64 freq_update_delay_ns;
+ unsigned int next_freq;
+
+ /* The next fields are only needed if fast switch cannot be used. */
+ unsigned int relation;
+ struct irq_work irq_work;
+ struct work_struct work;
+ struct mutex work_lock;
+ bool work_in_progress;
+
+ bool need_freq_update;
+};
+
+struct sugov_cpu {
+ struct update_util_data update_util;
+ struct sugov_policy *sg_policy;
+
+ /* The fields below are only needed when sharing a policy. */
+ unsigned long util;
+ unsigned long max;
+ u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
+ u64 delta_ns;
+
+ if (sg_policy->work_in_progress)
+ return false;
+
+ if (unlikely(sg_policy->need_freq_update)) {
+ sg_policy->need_freq_update = false;
+ return true;
+ }
+
+ delta_ns = time - sg_policy->last_freq_update_time;
+ return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
+}
+
+static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+ unsigned long util, unsigned long max,
+ unsigned int next_freq)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int rel;
+
+ if (next_freq > policy->max)
+ next_freq = policy->max;
+ else if (next_freq < policy->min)
+ next_freq = policy->min;
+
+ sg_policy->last_freq_update_time = time;
+ if (sg_policy->next_freq == next_freq)
+ return;
+
+ sg_policy->next_freq = next_freq;
+ /*
+ * If utilization is less than max / 4, use RELATION_C to allow the
+ * minimum frequency to be selected more often in case the distance from
+ * it to the next available frequency in the table is significant.
+ */
+ rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
+ if (policy->fast_switch_possible) {
+ cpufreq_driver_fast_switch(policy, next_freq, rel);
+ } else {
+ sg_policy->relation = rel;
+ sg_policy->work_in_progress = true;
+ irq_work_queue(&sg_policy->irq_work);
+ }
+}
+
+static void sugov_update_single(struct update_util_data *data, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int min_f, max_f, next_f;
+
+ if (!sugov_should_update_freq(sg_policy, time))
+ return;
+
+ min_f = sg_policy->policy->cpuinfo.min_freq;
+ max_f = sg_policy->policy->cpuinfo.max_freq;
+ next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
+
+ sugov_update_commit(sg_policy, time, util, max, next_f);
+}
+
+static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
+ unsigned long util, unsigned long max)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int min_f = policy->cpuinfo.min_freq;
+ unsigned int max_f = policy->cpuinfo.max_freq;
+ u64 last_freq_update_time = sg_policy->last_freq_update_time;
+ unsigned int j;
+
+ if (util > max)
+ return max_f;
+
+ for_each_cpu(j, policy->cpus) {
+ struct sugov_cpu *j_sg_cpu;
+ unsigned long j_util, j_max;
+ u64 delta_ns;
+
+ if (j == smp_processor_id())
+ continue;
+
+ j_sg_cpu = &per_cpu(sugov_cpu, j);
+ /*
+ * If the CPU utilization was last updated before the previous
+ * frequency update and the time elapsed between the last update
+ * of the CPU utilization and the last frequency update is long
+ * enough, don't take the CPU into account as it probably is
+ * idle now.
+ */
+ delta_ns = last_freq_update_time - j_sg_cpu->last_update;
+ if ((s64)delta_ns > NSEC_PER_SEC / HZ)
+ continue;
+
+ j_util = j_sg_cpu->util;
+ j_max = j_sg_cpu->max;
+ if (j_util > j_max)
+ return max_f;
+
+ if (j_util * max > j_max * util) {
+ util = j_util;
+ max = j_max;
+ }
+ }
+
+ return min_f + util * (max_f - min_f) / max;
+}
+
+static void sugov_update_shared(struct update_util_data *data, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int next_f;
+
+ raw_spin_lock(&sg_policy->update_lock);
+
+ sg_cpu->util = util;
+ sg_cpu->max = max;
+ sg_cpu->last_update = time;
+
+ if (sugov_should_update_freq(sg_policy, time)) {
+ next_f = sugov_next_freq(sg_policy, util, max);
+ sugov_update_commit(sg_policy, time, util, max, next_f);
+ }
+
+ raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_work(struct work_struct *work)
+{
+ struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+
+ mutex_lock(&sg_policy->work_lock);
+ __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+ sg_policy->relation);
+ mutex_unlock(&sg_policy->work_lock);
+
+ sg_policy->work_in_progress = false;
+}
+
+static void sugov_irq_work(struct irq_work *irq_work)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
+ schedule_work(&sg_policy->work);
+}
+
+/************************** sysfs interface ************************/
+
+static struct sugov_tunables *global_tunables;
+static DEFINE_MUTEX(global_tunables_lock);
+
+static inline struct sugov_tunables *to_sugov_tunables(struct gov_tunables *gt)
+{
+ return container_of(gt, struct sugov_tunables, gt);
+}
+
+static ssize_t rate_limit_us_show(struct gov_tunables *gt, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(gt);
+
+ return sprintf(buf, "%u\n", tunables->rate_limit_us);
+}
+
+static ssize_t rate_limit_us_store(struct gov_tunables *gt, const char *buf,
+ size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(gt);
+ struct sugov_policy *sg_policy;
+ unsigned int rate_limit_us;
+ int ret;
+
+ ret = sscanf(buf, "%u", &rate_limit_us);
+ if (ret != 1)
+ return -EINVAL;
+
+ tunables->rate_limit_us = rate_limit_us;
+
+ list_for_each_entry(sg_policy, >->policy_list, tunables_hook)
+ sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+
+ return count;
+}
+
+static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+
+static struct attribute *sugov_attributes[] = {
+ &rate_limit_us.attr,
+ NULL
+};
+
+static struct kobj_type sugov_tunables_ktype = {
+ .default_attrs = sugov_attributes,
+ .sysfs_ops = &governor_sysfs_ops,
+};
+
+/********************** cpufreq governor interface *********************/
+
+static struct cpufreq_governor schedutil_gov;
+
+static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+ if (!sg_policy)
+ return NULL;
+
+ sg_policy->policy = policy;
+ init_irq_work(&sg_policy->irq_work, sugov_irq_work);
+ INIT_WORK(&sg_policy->work, sugov_work);
+ mutex_init(&sg_policy->work_lock);
+ raw_spin_lock_init(&sg_policy->update_lock);
+ return sg_policy;
+}
+
+static void sugov_policy_free(struct sugov_policy *sg_policy)
+{
+ mutex_destroy(&sg_policy->work_lock);
+ kfree(sg_policy);
+}
+
+static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
+{
+ struct sugov_tunables *tunables;
+
+ tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
+ if (tunables)
+ gov_tunables_init(&tunables->gt, &sg_policy->tunables_hook);
+
+ return tunables;
+}
+
+static void sugov_tunables_free(struct sugov_tunables *tunables)
+{
+ if (!have_governor_per_policy())
+ global_tunables = NULL;
+
+ kfree(tunables);
+}
+
+static int sugov_init(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+ struct sugov_tunables *tunables;
+ unsigned int lat;
+ int ret = 0;
+
+ /* State should be equivalent to EXIT */
+ if (policy->governor_data)
+ return -EBUSY;
+
+ sg_policy = sugov_policy_alloc(policy);
+ if (!sg_policy)
+ return -ENOMEM;
+
+ mutex_lock(&global_tunables_lock);
+
+ if (global_tunables) {
+ if (WARN_ON(have_governor_per_policy())) {
+ ret = -EINVAL;
+ goto free_sg_policy;
+ }
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = global_tunables;
+
+ gov_tunables_get(&global_tunables->gt, &sg_policy->tunables_hook);
+ goto out;
+ }
+
+ tunables = sugov_tunables_alloc(sg_policy);
+ if (!tunables) {
+ ret = -ENOMEM;
+ goto free_sg_policy;
+ }
+
+ tunables->rate_limit_us = LATENCY_MULTIPLIER;
+ lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
+ if (lat)
+ tunables->rate_limit_us *= lat;
+
+ if (!have_governor_per_policy())
+ global_tunables = tunables;
+
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = tunables;
+
+ ret = kobject_init_and_add(&tunables->gt.kobj, &sugov_tunables_ktype,
+ get_governor_parent_kobj(policy), "%s",
+ schedutil_gov.name);
+ if (!ret)
+ goto out;
+
+ /* Failure, so roll back. */
+ policy->governor_data = NULL;
+ sugov_tunables_free(tunables);
+
+ free_sg_policy:
+ pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
+ sugov_policy_free(sg_policy);
+
+ out:
+ mutex_unlock(&global_tunables_lock);
+ return ret;
+}
+
+static int sugov_exit(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ struct sugov_tunables *tunables = sg_policy->tunables;
+ unsigned int count;
+
+ mutex_lock(&global_tunables_lock);
+
+ count = gov_tunables_put(&tunables->gt, &sg_policy->tunables_hook);
+ policy->governor_data = NULL;
+ if (!count)
+ sugov_tunables_free(tunables);
+
+ mutex_unlock(&global_tunables_lock);
+
+ sugov_policy_free(sg_policy);
+ return 0;
+}
+
+static int sugov_start(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->last_freq_update_time = 0;
+ sg_policy->next_freq = UINT_MAX;
+ sg_policy->work_in_progress = false;
+ sg_policy->need_freq_update = false;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+ sg_cpu->sg_policy = sg_policy;
+ if (policy_is_shared(policy)) {
+ sg_cpu->util = ULONG_MAX;
+ sg_cpu->max = 0;
+ sg_cpu->last_update = 0;
+ sg_cpu->update_util.func = sugov_update_shared;
+ } else {
+ sg_cpu->update_util.func = sugov_update_single;
+ }
+ cpufreq_set_update_util_data(cpu, &sg_cpu->update_util);
+ }
+ return 0;
+}
+
+static int sugov_stop(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ cpufreq_set_update_util_data(cpu, NULL);
+
+ synchronize_sched();
+
+ irq_work_sync(&sg_policy->irq_work);
+ cancel_work_sync(&sg_policy->work);
+ return 0;
+}
+
+static int sugov_limits(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+
+ if (!policy->fast_switch_possible) {
+ mutex_lock(&sg_policy->work_lock);
+
+ if (policy->max < policy->cur)
+ __cpufreq_driver_target(policy, policy->max,
+ CPUFREQ_RELATION_H);
+ else if (policy->min > policy->cur)
+ __cpufreq_driver_target(policy, policy->min,
+ CPUFREQ_RELATION_L);
+
+ mutex_unlock(&sg_policy->work_lock);
+ }
+
+ sg_policy->need_freq_update = true;
+ return 0;
+}
+
+int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
+{
+ if (event == CPUFREQ_GOV_POLICY_INIT) {
+ return sugov_init(policy);
+ } else if (policy->governor_data) {
+ switch (event) {
+ case CPUFREQ_GOV_POLICY_EXIT:
+ return sugov_exit(policy);
+ case CPUFREQ_GOV_START:
+ return sugov_start(policy);
+ case CPUFREQ_GOV_STOP:
+ return sugov_stop(policy);
+ case CPUFREQ_GOV_LIMITS:
+ return sugov_limits(policy);
+ }
+ }
+ return -EINVAL;
+}
+
+static struct cpufreq_governor schedutil_gov = {
+ .name = "schedutil",
+ .governor = sugov_governor,
+ .max_transition_latency = TRANSITION_LATENCY_LIMIT,
+ .owner = THIS_MODULE,
+};
+
+static int __init sugov_module_init(void)
+{
+ return cpufreq_register_governor(&schedutil_gov);
+}
+
+static void __exit sugov_module_exit(void)
+{
+ cpufreq_unregister_governor(&schedutil_gov);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+struct cpufreq_governor *cpufreq_default_governor(void)
+{
+ return &schedutil_gov;
+}
+
+fs_initcall(sugov_module_init);
+#else
+module_init(sugov_module_init);
+#endif
+module_exit(sugov_module_exit);
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+ bool "schedutil"
+ select CPU_FREQ_GOV_SCHEDUTIL
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the 'schedutil' CPUFreq governor by default. If unsure,
+ have a look at the help section of that governor. The fallback
+ governor will be 'performance'.
+
endchoice
config CPU_FREQ_GOV_PERFORMANCE
@@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE
If in doubt, say N.
+config CPU_FREQ_GOV_SCHEDUTIL
+ tristate "'schedutil' cpufreq policy governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_TUNABLES
+ select IRQ_WORK
+ help
+ The frequency selection formula used by this governor is analogous
+ to the one used by 'ondemand', but instead of computing CPU load
+ as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+ utilization data provided by the scheduler as input.
+
+ To compile this driver as a module, choose M here: the
+ module will be called cpufreq_schedutil.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"
config CPUFREQ_DT
Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE) +=
obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) += cpufreq_userspace.o
obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o
obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
obj-$(CONFIG_CPU_FREQ_GOV_TUNABLES) += cpufreq_governor_tunables.o
From: Rafael J. Wysocki <[email protected]>
Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
to support that method if available.
Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context via a
new helper function, cpufreq_driver_fast_switch(). Add a new
policy flag, fast_switch_possible, to be set if fast frequency
switching can be used for the given policy.
Implement the ->fast_switch callback in the ACPI cpufreq driver
and make it set fast_switch_possible during policy initialization
as appropriate.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
The most important change from the previous version is that the
->fast_switch() callback takes an additional "relation" argument
and now the governor can use it to choose a selection method.
---
drivers/cpufreq/acpi-cpufreq.c | 53 +++++++++++++++++++++++++++++++++++++++++
drivers/cpufreq/cpufreq.c | 33 +++++++++++++++++++++++++
include/linux/cpufreq.h | 6 ++++
3 files changed, 92 insertions(+)
Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -458,6 +458,55 @@ static int acpi_cpufreq_target(struct cp
return result;
}
+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq,
+ unsigned int relation)
+{
+ struct acpi_cpufreq_data *data = policy->driver_data;
+ struct acpi_processor_performance *perf;
+ struct cpufreq_frequency_table *entry, *found;
+ unsigned int next_perf_state, next_freq, freq;
+
+ /*
+ * Find the closest frequency above target_freq or equal to it.
+ *
+ * The table is sorted in the reverse order with respect to the
+ * frequency and all of the entries are valid (see the initialization).
+ */
+ entry = data->freq_table;
+ do {
+ entry++;
+ freq = entry->frequency;
+ } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
+ found = entry - 1;
+ /*
+ * Use the one found or the previous one, depending on the relation.
+ * CPUFREQ_RELATION_H is not taken into account here, but it is not
+ * expected to be passed to this function anyway.
+ */
+ next_freq = found->frequency;
+ if (freq == CPUFREQ_TABLE_END || relation != CPUFREQ_RELATION_C ||
+ target_freq - freq >= next_freq - target_freq) {
+ next_perf_state = found->driver_data;
+ } else {
+ next_freq = freq;
+ next_perf_state = entry->driver_data;
+ }
+
+ perf = to_perf_data(data);
+ if (perf->state == next_perf_state) {
+ if (unlikely(data->resume))
+ data->resume = 0;
+ else
+ return next_freq;
+ }
+
+ data->cpu_freq_write(&perf->control_register,
+ perf->states[next_perf_state].control);
+ perf->state = next_perf_state;
+ return next_freq;
+}
+
static unsigned long
acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
{
@@ -740,6 +789,9 @@ static int acpi_cpufreq_cpu_init(struct
goto err_unreg;
}
+ policy->fast_switch_possible = !acpi_pstate_strict &&
+ !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
+
data->freq_table = kzalloc(sizeof(*data->freq_table) *
(perf->state_count+1), GFP_KERNEL);
if (!data->freq_table) {
@@ -874,6 +926,7 @@ static struct freq_attr *acpi_cpufreq_at
static struct cpufreq_driver acpi_cpufreq_driver = {
.verify = cpufreq_generic_frequency_table_verify,
.target_index = acpi_cpufreq_target,
+ .fast_switch = acpi_cpufreq_fast_switch,
.bios_limit = acpi_processor_get_bios_limit,
.init = acpi_cpufreq_cpu_init,
.exit = acpi_cpufreq_cpu_exit,
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -1772,6 +1772,39 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
* GOVERNORS *
*********************************************************************/
+/**
+ * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
+ * @policy: cpufreq policy to switch the frequency for.
+ * @target_freq: New frequency to set (may be approximate).
+ * @relation: Relation to use for frequency selection.
+ *
+ * Carry out a fast frequency switch from interrupt context.
+ *
+ * This function must not be called if policy->fast_switch_possible is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same policy and that it will never be called in
+ * parallel with either ->target() or ->target_index() for the same policy.
+ *
+ * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
+ * callback, the hardware configuration must be preserved.
+ */
+void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq, unsigned int relation)
+{
+ unsigned int freq;
+
+ if (target_freq == policy->cur)
+ return;
+
+ freq = cpufreq_driver->fast_switch(policy, target_freq, relation);
+ if (freq != CPUFREQ_ENTRY_INVALID) {
+ policy->cur = freq;
+ trace_cpu_frequency(freq, smp_processor_id());
+ }
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
+
/* Must set freqs->new to intermediate frequency */
static int __target_intermediate(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, int index)
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -81,6 +81,7 @@ struct cpufreq_policy {
struct cpufreq_governor *governor; /* see below */
void *governor_data;
char last_governor[CPUFREQ_NAME_LEN]; /* last governor used */
+ bool fast_switch_possible;
struct work_struct update; /* if update_policy() needs to be
* called, but you're in IRQ context */
@@ -270,6 +271,9 @@ struct cpufreq_driver {
unsigned int relation); /* Deprecated */
int (*target_index)(struct cpufreq_policy *policy,
unsigned int index);
+ unsigned int (*fast_switch)(struct cpufreq_policy *policy,
+ unsigned int target_freq,
+ unsigned int relation);
/*
* Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
* unset.
@@ -484,6 +488,8 @@ struct cpufreq_governor {
};
/* Pass a target to the cpufreq driver */
+void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq, unsigned int relation);
int cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
From: Rafael J. Wysocki <[email protected]>
Use the observation that cpufreq_update_util() is only called
by the scheduler with rq->lock held, so the callers of
cpufreq_set_update_util_data() can use synchronize_sched()
instead of synchronize_rcu() to wait for cpufreq_update_util()
to complete. Moreover, if they are updated to do that,
rcu_read_(un)lock() calls in cpufreq_update_util() might be
replaced with rcu_read_(un)lock_sched(), respectively, but
those aren't really necessary, because the scheduler calls
that function from RCU-sched read-side critical sections
already.
In addition to that, if cpufreq_set_update_util_data() checks
the func field in the struct update_util_data before setting
the per-CPU pointer to it, the data->func check may be dropped
from cpufreq_update_util() as well.
Make the above changes to reduce the overhead from
cpufreq_update_util() in the scheduler paths invoking it
and to make the cleanup after removing its callbacks less
heavy-weight somewhat.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Supersedes https://patchwork.kernel.org/patch/8443191/
---
drivers/cpufreq/cpufreq.c | 23 ++++++++++++++++-------
drivers/cpufreq/cpufreq_governor.c | 2 +-
drivers/cpufreq/intel_pstate.c | 4 ++--
3 files changed, 19 insertions(+), 10 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -77,12 +77,15 @@ static DEFINE_PER_CPU(struct update_util
* to call from cpufreq_update_util(). That function will be called from an RCU
* read-side critical section, so it must not sleep.
*
- * Callers must use RCU callbacks to free any memory that might be accessed
- * via the old update_util_data pointer or invoke synchronize_rcu() right after
- * this function to avoid use-after-free.
+ * Callers must use RCU-sched callbacks to free any memory that might be
+ * accessed via the old update_util_data pointer or invoke synchronize_sched()
+ * right after this function to avoid use-after-free.
*/
void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
{
+ if (WARN_ON(data && !data->func))
+ return;
+
rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
}
EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
@@ -95,18 +98,24 @@ EXPORT_SYMBOL_GPL(cpufreq_set_update_uti
*
* This function is called by the scheduler on every invocation of
* update_load_avg() on the CPU whose utilization is being updated.
+ *
+ * It can only be called from RCU-sched read-side critical sections.
*/
void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
{
struct update_util_data *data;
- rcu_read_lock();
+#ifdef CONFIG_LOCKDEP
+ WARN_ON(debug_locks && !rcu_read_lock_sched_held());
+#endif
data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
- if (data && data->func)
+ /*
+ * If this isn't inside of an RCU-sched read-side critical section, data
+ * may become NULL after the check below.
+ */
+ if (data)
data->func(data, time, util, max);
-
- rcu_read_unlock();
}
/* Flag to suspend/resume CPUFreq governors */
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -280,7 +280,7 @@ static inline void gov_clear_update_util
for_each_cpu(i, policy->cpus)
cpufreq_set_update_util_data(i, NULL);
- synchronize_rcu();
+ synchronize_sched();
}
static void gov_cancel_work(struct cpufreq_policy *policy)
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -1174,7 +1174,7 @@ static void intel_pstate_stop_cpu(struct
pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
cpufreq_set_update_util_data(cpu_num, NULL);
- synchronize_rcu();
+ synchronize_sched();
if (hwp_active)
return;
@@ -1442,7 +1442,7 @@ out:
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu]) {
cpufreq_set_update_util_data(cpu, NULL);
- synchronize_rcu();
+ synchronize_sched();
kfree(all_cpu_data[cpu]);
}
}
Hi Rafael,
On 2 March 2016 at 03:27, Rafael J. Wysocki <[email protected]> wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
>
> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> mechanism for registering utilization update callbacks) that
> introduced cpufreq_update_util() called by the scheduler on
> utilization changes (from CFS) and RT/DL task status updates.
> In particular, CPU frequency scaling decisions may be based on
> the the utilization data passed to cpufreq_update_util() by CFS.
>
> The new governor is relatively simple.
>
> The frequency selection formula used by it is essentially the same
> as the one used by the "ondemand" governor, although it doesn't use
> the additional up_threshold parameter, but instead of computing the
> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
> the utilization data provided by CFS as input. More specifically,
> it represents "load" as the util/max ratio, where util and max
> are the utilization and CPU capacity coming from CFS.
>
[snip]
> +
> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> + unsigned long util, unsigned long max,
> + unsigned int next_freq)
> +{
> + struct cpufreq_policy *policy = sg_policy->policy;
> + unsigned int rel;
> +
> + if (next_freq > policy->max)
> + next_freq = policy->max;
> + else if (next_freq < policy->min)
> + next_freq = policy->min;
> +
> + sg_policy->last_freq_update_time = time;
> + if (sg_policy->next_freq == next_freq)
> + return;
> +
> + sg_policy->next_freq = next_freq;
> + /*
> + * If utilization is less than max / 4, use RELATION_C to allow the
> + * minimum frequency to be selected more often in case the distance from
> + * it to the next available frequency in the table is significant.
> + */
> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
> + if (policy->fast_switch_possible) {
> + cpufreq_driver_fast_switch(policy, next_freq, rel);
> + } else {
> + sg_policy->relation = rel;
> + sg_policy->work_in_progress = true;
> + irq_work_queue(&sg_policy->irq_work);
> + }
> +}
> +
> +static void sugov_update_single(struct update_util_data *data, u64 time,
> + unsigned long util, unsigned long max)
> +{
> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
> + struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> + unsigned int min_f, max_f, next_f;
> +
> + if (!sugov_should_update_freq(sg_policy, time))
> + return;
> +
> + min_f = sg_policy->policy->cpuinfo.min_freq;
> + max_f = sg_policy->policy->cpuinfo.max_freq;
> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
I think it has been pointed out in another email's thread but you
should change the way the next_f is computed. util reflects the
utilization of a CPU from 0 to its max compute capacity whereas
ondemand was using the load at the current frequency during the last
time window. I have understood that you want to keep same formula than
ondemand as a starting point but you use a different input to
calculate the next frequency so i don't see the rational of keeping
this formula. Saying that, even the simple formula next_f = util > max
? max_f : util * (max_f) / max will not work properly if the frequency
invariance is enable because the utilization becomes capped by the
current compute capacity so next_f will never be higher than current
freq (unless a task move on the rq). That was one reason of using a
threshold in sched-freq proposal (and there are on going dev to try to
solve this limitation).
IIIUC, frequency invariance is not enable on your platform so you have
not seen the problem but you have probably see that selection of your
next_f was not really stable. Without frequency invariance, the
utilization will be overestimated when running at lower frequency so
the governor will probably select a frequency that is higher than
necessary but then the utilization will decrease at this higher
frequency so the governor will probably decrease the frequency and so
on until you found the right frequency that will generate the right
utilisation value
Regards,
Vincent
> +
> + sugov_update_commit(sg_policy, time, util, max, next_f);
> +}
> +
> +static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
> + unsigned long util, unsigned long max)
> +{
> + struct cpufreq_policy *policy = sg_policy->policy;
> + unsigned int min_f = policy->cpuinfo.min_freq;
> + unsigned int max_f = policy->cpuinfo.max_freq;
> + u64 last_freq_update_time = sg_policy->last_freq_update_time;
> + unsigned int j;
> +
> + if (util > max)
> + return max_f;
> +
> + for_each_cpu(j, policy->cpus) {
> + struct sugov_cpu *j_sg_cpu;
> + unsigned long j_util, j_max;
> + u64 delta_ns;
> +
> + if (j == smp_processor_id())
> + continue;
> +
> + j_sg_cpu = &per_cpu(sugov_cpu, j);
> + /*
> + * If the CPU utilization was last updated before the previous
> + * frequency update and the time elapsed between the last update
> + * of the CPU utilization and the last frequency update is long
> + * enough, don't take the CPU into account as it probably is
> + * idle now.
> + */
> + delta_ns = last_freq_update_time - j_sg_cpu->last_update;
> + if ((s64)delta_ns > NSEC_PER_SEC / HZ)
> + continue;
> +
> + j_util = j_sg_cpu->util;
> + j_max = j_sg_cpu->max;
> + if (j_util > j_max)
> + return max_f;
> +
> + if (j_util * max > j_max * util) {
> + util = j_util;
> + max = j_max;
> + }
> + }
> +
> + return min_f + util * (max_f - min_f) / max;
> +}
> +
[snip]
On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot
<[email protected]> wrote:
> Hi Rafael,
>
>
> On 2 March 2016 at 03:27, Rafael J. Wysocki <[email protected]> wrote:
>> From: Rafael J. Wysocki <[email protected]>
>>
>> Add a new cpufreq scaling governor, called "schedutil", that uses
>> scheduler-provided CPU utilization information as input for making
>> its decisions.
>>
>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
>> mechanism for registering utilization update callbacks) that
>> introduced cpufreq_update_util() called by the scheduler on
>> utilization changes (from CFS) and RT/DL task status updates.
>> In particular, CPU frequency scaling decisions may be based on
>> the the utilization data passed to cpufreq_update_util() by CFS.
>>
>> The new governor is relatively simple.
>>
>> The frequency selection formula used by it is essentially the same
>> as the one used by the "ondemand" governor, although it doesn't use
>> the additional up_threshold parameter, but instead of computing the
>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
>> the utilization data provided by CFS as input. More specifically,
>> it represents "load" as the util/max ratio, where util and max
>> are the utilization and CPU capacity coming from CFS.
>>
>
> [snip]
>
>> +
>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>> + unsigned long util, unsigned long max,
>> + unsigned int next_freq)
>> +{
>> + struct cpufreq_policy *policy = sg_policy->policy;
>> + unsigned int rel;
>> +
>> + if (next_freq > policy->max)
>> + next_freq = policy->max;
>> + else if (next_freq < policy->min)
>> + next_freq = policy->min;
>> +
>> + sg_policy->last_freq_update_time = time;
>> + if (sg_policy->next_freq == next_freq)
>> + return;
>> +
>> + sg_policy->next_freq = next_freq;
>> + /*
>> + * If utilization is less than max / 4, use RELATION_C to allow the
>> + * minimum frequency to be selected more often in case the distance from
>> + * it to the next available frequency in the table is significant.
>> + */
>> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
>> + if (policy->fast_switch_possible) {
>> + cpufreq_driver_fast_switch(policy, next_freq, rel);
>> + } else {
>> + sg_policy->relation = rel;
>> + sg_policy->work_in_progress = true;
>> + irq_work_queue(&sg_policy->irq_work);
>> + }
>> +}
>> +
>> +static void sugov_update_single(struct update_util_data *data, u64 time,
>> + unsigned long util, unsigned long max)
>> +{
>> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
>> + struct sugov_policy *sg_policy = sg_cpu->sg_policy;
>> + unsigned int min_f, max_f, next_f;
>> +
>> + if (!sugov_should_update_freq(sg_policy, time))
>> + return;
>> +
>> + min_f = sg_policy->policy->cpuinfo.min_freq;
>> + max_f = sg_policy->policy->cpuinfo.max_freq;
>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
>
> I think it has been pointed out in another email's thread but you
> should change the way the next_f is computed. util reflects the
> utilization of a CPU from 0 to its max compute capacity whereas
> ondemand was using the load at the current frequency during the last
> time window. I have understood that you want to keep same formula than
> ondemand as a starting point but you use a different input to
> calculate the next frequency so i don't see the rational of keeping
> this formula.
It is a formula that causes the entire available frequency range to be
utilized proportionally to the utilization as reported by the
scheduler (modulo the policy->min/max limits). Its (significant IMO)
advantage is that it doesn't require any additional factors that would
need to be determined somehow.
> Saying that, even the simple formula next_f = util > max
> ? max_f : util * (max_f) / max will not work properly if the frequency
> invariance is enable because the utilization becomes capped by the
> current compute capacity so next_f will never be higher than current
> freq (unless a task move on the rq). That was one reason of using a
> threshold in sched-freq proposal (and there are on going dev to try to
> solve this limitation).
Well, a different formula will have to be used along with frequency
invariance, then.
> IIIUC, frequency invariance is not enable on your platform so you have
> not seen the problem but you have probably see that selection of your
> next_f was not really stable. Without frequency invariance, the
> utilization will be overestimated when running at lower frequency so
> the governor will probably select a frequency that is higher than
> necessary but then the utilization will decrease at this higher
> frequency so the governor will probably decrease the frequency and so
> on until you found the right frequency that will generate the right
> utilisation value
I don't have any problems with that to be honest and if you aim at
selecting the perfect frequency at the first attempt, then good luck
with that anyway.
Now, I'm not saying that the formula used in this patch cannot be
improved or similar. It very well may be possible to improve it. I'm
only saying that it is good enough to start with, because of the
reasons mentioned above.
Still, if you can suggest to me what other formula specifically should
be used here, I'll consider using it. Which will probably mean
comparing the two and seeing which one leads to better results.
Thanks,
Rafael
On Wed, Mar 2, 2016 at 6:58 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot
> <[email protected]> wrote:
>> Hi Rafael,
>>
>>
>> On 2 March 2016 at 03:27, Rafael J. Wysocki <[email protected]> wrote:
>>> From: Rafael J. Wysocki <[email protected]>
>>>
>>> Add a new cpufreq scaling governor, called "schedutil", that uses
>>> scheduler-provided CPU utilization information as input for making
>>> its decisions.
>>>
>>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
>>> mechanism for registering utilization update callbacks) that
>>> introduced cpufreq_update_util() called by the scheduler on
>>> utilization changes (from CFS) and RT/DL task status updates.
>>> In particular, CPU frequency scaling decisions may be based on
>>> the the utilization data passed to cpufreq_update_util() by CFS.
>>>
>>> The new governor is relatively simple.
>>>
>>> The frequency selection formula used by it is essentially the same
>>> as the one used by the "ondemand" governor, although it doesn't use
>>> the additional up_threshold parameter, but instead of computing the
>>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
>>> the utilization data provided by CFS as input. More specifically,
>>> it represents "load" as the util/max ratio, where util and max
>>> are the utilization and CPU capacity coming from CFS.
>>>
>>
>> [snip]
>>
>>> +
>>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>>> + unsigned long util, unsigned long max,
>>> + unsigned int next_freq)
>>> +{
>>> + struct cpufreq_policy *policy = sg_policy->policy;
>>> + unsigned int rel;
>>> +
>>> + if (next_freq > policy->max)
>>> + next_freq = policy->max;
>>> + else if (next_freq < policy->min)
>>> + next_freq = policy->min;
>>> +
>>> + sg_policy->last_freq_update_time = time;
>>> + if (sg_policy->next_freq == next_freq)
>>> + return;
>>> +
>>> + sg_policy->next_freq = next_freq;
>>> + /*
>>> + * If utilization is less than max / 4, use RELATION_C to allow the
>>> + * minimum frequency to be selected more often in case the distance from
>>> + * it to the next available frequency in the table is significant.
>>> + */
>>> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
>>> + if (policy->fast_switch_possible) {
>>> + cpufreq_driver_fast_switch(policy, next_freq, rel);
>>> + } else {
>>> + sg_policy->relation = rel;
>>> + sg_policy->work_in_progress = true;
>>> + irq_work_queue(&sg_policy->irq_work);
>>> + }
>>> +}
>>> +
>>> +static void sugov_update_single(struct update_util_data *data, u64 time,
>>> + unsigned long util, unsigned long max)
>>> +{
>>> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
>>> + struct sugov_policy *sg_policy = sg_cpu->sg_policy;
>>> + unsigned int min_f, max_f, next_f;
>>> +
>>> + if (!sugov_should_update_freq(sg_policy, time))
>>> + return;
>>> +
>>> + min_f = sg_policy->policy->cpuinfo.min_freq;
>>> + max_f = sg_policy->policy->cpuinfo.max_freq;
>>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
>>
>> I think it has been pointed out in another email's thread but you
>> should change the way the next_f is computed. util reflects the
>> utilization of a CPU from 0 to its max compute capacity whereas
>> ondemand was using the load at the current frequency during the last
>> time window. I have understood that you want to keep same formula than
>> ondemand as a starting point but you use a different input to
>> calculate the next frequency so i don't see the rational of keeping
>> this formula.
>
> It is a formula that causes the entire available frequency range to be
> utilized proportionally to the utilization as reported by the
> scheduler (modulo the policy->min/max limits). Its (significant IMO)
> advantage is that it doesn't require any additional factors that would
> need to be determined somehow.
In case a more formal derivation of this formula is needed, it is
based on the following 3 assumptions:
(1) Performance is a linear function of frequency.
(2) Required performance is a linear function of the utilization ratio
x = util/max as provided by the scheduler (0 <= x <= 1).
(3) The minimum possible frequency (min_freq) corresponds to x = 0 and
the maximum possible frequency (max_freq) corresponds to x = 1.
(1) and (2) combined imply that
f = a * x + b
(f - frequency, a, b - constants to be determined) and then (3) quite
trivially leads to b = min_freq and a = max_freq - min_freq.
Now, of course, the linearity assumptions may be questioned, but then
it's just the first approximation. If you go any further, though, you
end up with an expansion series like this:
f(x) = c_0 + c_1 * x + c_2 * x^2 + c_3 * x^3 + ...
where all of the c_j need to be determined in principle. With luck,
if you can guess what kind of a function f(x) may be, it may be
possible to reduce the number of coefficients to determine, but
question is whether or not that is going to work universally for all
systems.
Thanks,
Rafael
On 02-03-16, 03:04, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Use the observation that cpufreq_update_util() is only called
> by the scheduler with rq->lock held, so the callers of
> cpufreq_set_update_util_data() can use synchronize_sched()
> instead of synchronize_rcu() to wait for cpufreq_update_util()
> to complete. Moreover, if they are updated to do that,
> rcu_read_(un)lock() calls in cpufreq_update_util() might be
> replaced with rcu_read_(un)lock_sched(), respectively, but
> those aren't really necessary, because the scheduler calls
> that function from RCU-sched read-side critical sections
> already.
>
> In addition to that, if cpufreq_set_update_util_data() checks
> the func field in the struct update_util_data before setting
> the per-CPU pointer to it, the data->func check may be dropped
> from cpufreq_update_util() as well.
>
> Make the above changes to reduce the overhead from
> cpufreq_update_util() in the scheduler paths invoking it
> and to make the cleanup after removing its callbacks less
> heavy-weight somewhat.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> Supersedes https://patchwork.kernel.org/patch/8443191/
Acked-by: Viresh Kumar <[email protected]>
--
viresh
On 02-03-16, 03:08, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> In addition to fields representing governor tunables, struct dbs_data
> contains some fields needed for the management of objects of that
> type. As it turns out, that part of struct dbs_data may be shared
> with (future) governors that won't use the common code used by
> "ondemand" and "conservative", so move it to a separate struct type
> and modify the code using struct dbs_data to follow.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> drivers/cpufreq/cpufreq_conservative.c | 15 +++--
> drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++-------------
> drivers/cpufreq/cpufreq_governor.h | 36 +++++++------
> drivers/cpufreq/cpufreq_ondemand.c | 19 ++++--
> 4 files changed, 97 insertions(+), 63 deletions(-)
>
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.h
> @@ -41,6 +41,13 @@
> /* Ondemand Sampling types */
> enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
>
> +struct gov_tunables {
> + struct kobject kobj;
> + struct list_head policy_list;
> + struct mutex update_lock;
> + int usage_count;
> +};
Everything else looks fine, but I don't think that you have named it
properly. Every thing else present in struct dbs_data are tunables,
but not this. And so gov_tunables doesn't suit at all here..
--
viresh
On 02-03-16, 03:12, Rafael J. Wysocki wrote:
> Index: linux-pm/drivers/cpufreq/cpufreq.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq.c
> +++ linux-pm/drivers/cpufreq/cpufreq.c
> @@ -1772,6 +1772,39 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
> * GOVERNORS *
> *********************************************************************/
>
> +/**
> + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
> + * @policy: cpufreq policy to switch the frequency for.
> + * @target_freq: New frequency to set (may be approximate).
> + * @relation: Relation to use for frequency selection.
> + *
> + * Carry out a fast frequency switch from interrupt context.
> + *
> + * This function must not be called if policy->fast_switch_possible is unset.
> + *
> + * Governors calling this function must guarantee that it will never be invoked
> + * twice in parallel for the same policy and that it will never be called in
> + * parallel with either ->target() or ->target_index() for the same policy.
> + *
> + * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
> + * callback, the hardware configuration must be preserved.
> + */
> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> + unsigned int target_freq, unsigned int relation)
> +{
> + unsigned int freq;
> +
> + if (target_freq == policy->cur)
Maybe an unlikely() here ?
> + return;
> +
> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation);
> + if (freq != CPUFREQ_ENTRY_INVALID) {
> + policy->cur = freq;
Hmm.. What will happen to the code relying on the cpufreq-notifiers
now ?
> + trace_cpu_frequency(freq, smp_processor_id());
> + }
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
--
viresh
On 02-03-16, 03:10, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Move abstract code related to struct gov_tunables to a separate (new)
> file so it can be shared with (future) goverernors that won't share
> more code with "ondemand" and "conservative".
>
> No intentional functional changes.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> drivers/cpufreq/Kconfig | 4 +
> drivers/cpufreq/Makefile | 1
> drivers/cpufreq/cpufreq_governor.c | 82 ---------------------------
> drivers/cpufreq/cpufreq_governor.h | 6 ++
> drivers/cpufreq/cpufreq_governor_tunables.c | 84 ++++++++++++++++++++++++++++
These aren't governor tunables, isn't it? Tunables were the fields
that could be tuned, but this is something else.
--
viresh
On Wed, Mar 02, 2016 at 03:12:33AM +0100, Rafael J. Wysocki wrote:
> The most important change from the previous version is that the
> ->fast_switch() callback takes an additional "relation" argument
> and now the governor can use it to choose a selection method.
> +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> + unsigned int target_freq,
> + unsigned int relation)
Would it make sense to replace the {target_freq, relation} pair with
something like the CPPC {min_freq, max_freq} pair?
Then you could use the closest frequency to max provided it is larger
than min.
This communicates more actual information in the same number of
parameters and would thereby allow for a more flexible (better)
frequency selection.
On Wed, Mar 02, 2016 at 03:12:33AM +0100, Rafael J. Wysocki wrote:
> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> + unsigned int target_freq, unsigned int relation)
> +{
> + unsigned int freq;
> +
> + if (target_freq == policy->cur)
> + return;
But what if relation is different from last time? ;-)
Hi,
On 02/03/16 03:04, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
[...]
> @@ -95,18 +98,24 @@ EXPORT_SYMBOL_GPL(cpufreq_set_update_uti
> *
> * This function is called by the scheduler on every invocation of
> * update_load_avg() on the CPU whose utilization is being updated.
> + *
> + * It can only be called from RCU-sched read-side critical sections.
> */
> void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
> {
> struct update_util_data *data;
>
> - rcu_read_lock();
> +#ifdef CONFIG_LOCKDEP
> + WARN_ON(debug_locks && !rcu_read_lock_sched_held());
> +#endif
>
> data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
I think you need to s/rcu_dereference/rcu_dereference_sched/ here or
RCU will complain:
[ 0.106313] ===============================
[ 0.106322] [ INFO: suspicious RCU usage. ]
[ 0.106334] 4.5.0-rc6+ #93 Not tainted
[ 0.106342] -------------------------------
[ 0.106353] /media/hdd1tb/work/integration/kernel/drivers/cpufreq/cpufreq.c:113 suspicious rcu_dereference_check() usage!
[ 0.106361]
[ 0.106361] other info that might help us debug this:
[ 0.106361]
[ 0.106375]
[ 0.106375] rcu_scheduler_active = 1, debug_locks = 1
[ 0.106387] 1 lock held by swapper/0/0:
[ 0.106395] #0: (&rq->lock){-.....}, at: [<ffffffc000743204>] __schedule+0xec/0xadc
[ 0.106436]
[ 0.106436] stack backtrace:
[ 0.106450] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.5.0-rc6+ #93
[ 0.106459] Hardware name: ARM Juno development board (r2) (DT)
[ 0.106468] Call trace:
[ 0.106483] [<ffffffc00008a8a8>] dump_backtrace+0x0/0x210
[ 0.106496] [<ffffffc00008aad8>] show_stack+0x20/0x28
[ 0.106511] [<ffffffc0004261a4>] dump_stack+0xa8/0xe0
[ 0.106526] [<ffffffc000120e9c>] lockdep_rcu_suspicious+0xd4/0x114
[ 0.106540] [<ffffffc0005d8180>] cpufreq_update_util+0xd4/0xd8
[ 0.106554] [<ffffffc000105b9c>] set_next_entity+0x540/0xf7c
[ 0.106569] [<ffffffc00010f78c>] pick_next_task_fair+0x9c/0x754
[ 0.106580] [<ffffffc00074351c>] __schedule+0x404/0xadc
[ 0.106592] [<ffffffc000743de0>] schedule+0x40/0xa0
[ 0.106603] [<ffffffc000744094>] schedule_preempt_disabled+0x1c/0x2c
[ 0.106617] [<ffffffc000741190>] rest_init+0x14c/0x164
[ 0.106631] [<ffffffc0009f9990>] start_kernel+0x3c0/0x3d4
[ 0.106642] [<ffffffc0000811b4>] 0xffffffc0000811b4
Best,
- Juri
On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote:
> >>> + min_f = sg_policy->policy->cpuinfo.min_freq;
> >>> + max_f = sg_policy->policy->cpuinfo.max_freq;
> >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
> In case a more formal derivation of this formula is needed, it is
> based on the following 3 assumptions:
>
> (1) Performance is a linear function of frequency.
> (2) Required performance is a linear function of the utilization ratio
> x = util/max as provided by the scheduler (0 <= x <= 1).
> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
> the maximum possible frequency (max_freq) corresponds to x = 1.
>
> (1) and (2) combined imply that
>
> f = a * x + b
>
> (f - frequency, a, b - constants to be determined) and then (3) quite
> trivially leads to b = min_freq and a = max_freq - min_freq.
3 is the problem, that just doesn't make sense and is probably the
reason why you see very little selection of the min freq.
Suppose a machine with the following frequencies:
500, 750, 1000
And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) =
700 make any sense? Per your point 1, it should should be asking for
0.4 * 1000 = 400.
Because, per 1, at 500 it runs exactly half as fast as at 1000, and we
only need 0.4 times as much. Therefore 500 is more than sufficient.
Note. we all know that 1 is a 'broken' assumption, but lacking anything
better I think its a reasonable one to make.
On 03/03/16 13:20, Peter Zijlstra wrote:
> On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote:
> > >>> + min_f = sg_policy->policy->cpuinfo.min_freq;
> > >>> + max_f = sg_policy->policy->cpuinfo.max_freq;
> > >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
>
> > In case a more formal derivation of this formula is needed, it is
> > based on the following 3 assumptions:
> >
> > (1) Performance is a linear function of frequency.
> > (2) Required performance is a linear function of the utilization ratio
> > x = util/max as provided by the scheduler (0 <= x <= 1).
>
> > (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
> > the maximum possible frequency (max_freq) corresponds to x = 1.
> >
> > (1) and (2) combined imply that
> >
> > f = a * x + b
> >
> > (f - frequency, a, b - constants to be determined) and then (3) quite
> > trivially leads to b = min_freq and a = max_freq - min_freq.
>
> 3 is the problem, that just doesn't make sense and is probably the
> reason why you see very little selection of the min freq.
>
> Suppose a machine with the following frequencies:
>
> 500, 750, 1000
>
> And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) =
> 700 make any sense? Per your point 1, it should should be asking for
> 0.4 * 1000 = 400.
>
> Because, per 1, at 500 it runs exactly half as fast as at 1000, and we
> only need 0.4 times as much. Therefore 500 is more than sufficient.
>
Oh, and that is probably also why the governor can reach max OPP with
freq invariance enabled (the point Vincent was making). When we run at
500 the util signal is capped at that capacity, but the formula makes us
requesting more, so we can jump to the next step and so on.
On Thu, Mar 03, 2016 at 11:47:01AM +0000, Juri Lelli wrote:
> > +#ifdef CONFIG_LOCKDEP
> > + WARN_ON(debug_locks && !rcu_read_lock_sched_held());
> > +#endif
> >
> > data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
>
> I think you need to s/rcu_dereference/rcu_dereference_sched/ here or
> RCU will complain:
Ah, indeed ;-)
On 2 March 2016 at 18:58, Rafael J. Wysocki <[email protected]> wrote:
> On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot
> <[email protected]> wrote:
>> Hi Rafael,
>>
>>
>> On 2 March 2016 at 03:27, Rafael J. Wysocki <[email protected]> wrote:
>>> From: Rafael J. Wysocki <[email protected]>
>>>
>>> Add a new cpufreq scaling governor, called "schedutil", that uses
>>> scheduler-provided CPU utilization information as input for making
>>> its decisions.
>>>
>>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
>>> mechanism for registering utilization update callbacks) that
>>> introduced cpufreq_update_util() called by the scheduler on
>>> utilization changes (from CFS) and RT/DL task status updates.
>>> In particular, CPU frequency scaling decisions may be based on
>>> the the utilization data passed to cpufreq_update_util() by CFS.
>>>
>>> The new governor is relatively simple.
>>>
>>> The frequency selection formula used by it is essentially the same
>>> as the one used by the "ondemand" governor, although it doesn't use
>>> the additional up_threshold parameter, but instead of computing the
>>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
>>> the utilization data provided by CFS as input. More specifically,
>>> it represents "load" as the util/max ratio, where util and max
>>> are the utilization and CPU capacity coming from CFS.
>>>
>>
>> [snip]
>>
>>> +
>>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>>> + unsigned long util, unsigned long max,
>>> + unsigned int next_freq)
>>> +{
>>> + struct cpufreq_policy *policy = sg_policy->policy;
>>> + unsigned int rel;
>>> +
>>> + if (next_freq > policy->max)
>>> + next_freq = policy->max;
>>> + else if (next_freq < policy->min)
>>> + next_freq = policy->min;
>>> +
>>> + sg_policy->last_freq_update_time = time;
>>> + if (sg_policy->next_freq == next_freq)
>>> + return;
>>> +
>>> + sg_policy->next_freq = next_freq;
>>> + /*
>>> + * If utilization is less than max / 4, use RELATION_C to allow the
>>> + * minimum frequency to be selected more often in case the distance from
>>> + * it to the next available frequency in the table is significant.
>>> + */
>>> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
>>> + if (policy->fast_switch_possible) {
>>> + cpufreq_driver_fast_switch(policy, next_freq, rel);
>>> + } else {
>>> + sg_policy->relation = rel;
>>> + sg_policy->work_in_progress = true;
>>> + irq_work_queue(&sg_policy->irq_work);
>>> + }
>>> +}
>>> +
>>> +static void sugov_update_single(struct update_util_data *data, u64 time,
>>> + unsigned long util, unsigned long max)
>>> +{
>>> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
>>> + struct sugov_policy *sg_policy = sg_cpu->sg_policy;
>>> + unsigned int min_f, max_f, next_f;
>>> +
>>> + if (!sugov_should_update_freq(sg_policy, time))
>>> + return;
>>> +
>>> + min_f = sg_policy->policy->cpuinfo.min_freq;
>>> + max_f = sg_policy->policy->cpuinfo.max_freq;
>>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
>>
>> I think it has been pointed out in another email's thread but you
>> should change the way the next_f is computed. util reflects the
>> utilization of a CPU from 0 to its max compute capacity whereas
>> ondemand was using the load at the current frequency during the last
>> time window. I have understood that you want to keep same formula than
>> ondemand as a starting point but you use a different input to
>> calculate the next frequency so i don't see the rational of keeping
>> this formula.
>
> It is a formula that causes the entire available frequency range to be
> utilized proportionally to the utilization as reported by the
> scheduler (modulo the policy->min/max limits). Its (significant IMO)
> advantage is that it doesn't require any additional factors that would
> need to be determined somehow.
>
>> Saying that, even the simple formula next_f = util > max
>> ? max_f : util * (max_f) / max will not work properly if the frequency
>> invariance is enable because the utilization becomes capped by the
>> current compute capacity so next_f will never be higher than current
>> freq (unless a task move on the rq). That was one reason of using a
>> threshold in sched-freq proposal (and there are on going dev to try to
>> solve this limitation).
>
> Well, a different formula will have to be used along with frequency
> invariance, then.
>
>> IIIUC, frequency invariance is not enable on your platform so you have
>> not seen the problem but you have probably see that selection of your
>> next_f was not really stable. Without frequency invariance, the
>> utilization will be overestimated when running at lower frequency so
>> the governor will probably select a frequency that is higher than
>> necessary but then the utilization will decrease at this higher
>> frequency so the governor will probably decrease the frequency and so
>> on until you found the right frequency that will generate the right
>> utilisation value
>
> I don't have any problems with that to be honest and if you aim at
> selecting the perfect frequency at the first attempt, then good luck
> with that anyway.
I mainly want to prevent any useless and periodic frequency switch
because of an utilization that changes with the current frequency (if
frequency invariance is not used) and that can make the formula
selects another frequency than the current one. That what i can see
when testing it .
Sorry for the late reply, i was trying to do some test on my board but
was facing some crash issue (not link with your patchset). So i have
done some tests and i can see such instable behavior. I have generated
a load of 33% at max frequency (3ms runs every 9ms) and i can see the
frequency that toggles without any good reason. Saying that, i can see
similar thing with ondemand.
Vincent
>
> Now, I'm not saying that the formula used in this patch cannot be
> improved or similar. It very well may be possible to improve it. I'm
> only saying that it is good enough to start with, because of the
> reasons mentioned above.
>
> Still, if you can suggest to me what other formula specifically should
> be used here, I'll consider using it. Which will probably mean
> comparing the two and seeing which one leads to better results.
>
> Thanks,
> Rafael
On 2 March 2016 at 23:49, Rafael J. Wysocki <[email protected]> wrote:
> On Wed, Mar 2, 2016 at 6:58 PM, Rafael J. Wysocki <[email protected]> wrote:
>> On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot
>> <[email protected]> wrote:
>>> Hi Rafael,
>>>
>>>
>>> On 2 March 2016 at 03:27, Rafael J. Wysocki <[email protected]> wrote:
>>>> From: Rafael J. Wysocki <[email protected]>
>>>>
>>>> Add a new cpufreq scaling governor, called "schedutil", that uses
>>>> scheduler-provided CPU utilization information as input for making
>>>> its decisions.
>>>>
>>>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
>>>> mechanism for registering utilization update callbacks) that
>>>> introduced cpufreq_update_util() called by the scheduler on
>>>> utilization changes (from CFS) and RT/DL task status updates.
>>>> In particular, CPU frequency scaling decisions may be based on
>>>> the the utilization data passed to cpufreq_update_util() by CFS.
>>>>
>>>> The new governor is relatively simple.
>>>>
>>>> The frequency selection formula used by it is essentially the same
>>>> as the one used by the "ondemand" governor, although it doesn't use
>>>> the additional up_threshold parameter, but instead of computing the
>>>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
>>>> the utilization data provided by CFS as input. More specifically,
>>>> it represents "load" as the util/max ratio, where util and max
>>>> are the utilization and CPU capacity coming from CFS.
>>>>
>>>
>>> [snip]
>>>
>>>> +
>>>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>>>> + unsigned long util, unsigned long max,
>>>> + unsigned int next_freq)
>>>> +{
>>>> + struct cpufreq_policy *policy = sg_policy->policy;
>>>> + unsigned int rel;
>>>> +
>>>> + if (next_freq > policy->max)
>>>> + next_freq = policy->max;
>>>> + else if (next_freq < policy->min)
>>>> + next_freq = policy->min;
>>>> +
>>>> + sg_policy->last_freq_update_time = time;
>>>> + if (sg_policy->next_freq == next_freq)
>>>> + return;
>>>> +
>>>> + sg_policy->next_freq = next_freq;
>>>> + /*
>>>> + * If utilization is less than max / 4, use RELATION_C to allow the
>>>> + * minimum frequency to be selected more often in case the distance from
>>>> + * it to the next available frequency in the table is significant.
>>>> + */
>>>> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
>>>> + if (policy->fast_switch_possible) {
>>>> + cpufreq_driver_fast_switch(policy, next_freq, rel);
>>>> + } else {
>>>> + sg_policy->relation = rel;
>>>> + sg_policy->work_in_progress = true;
>>>> + irq_work_queue(&sg_policy->irq_work);
>>>> + }
>>>> +}
>>>> +
>>>> +static void sugov_update_single(struct update_util_data *data, u64 time,
>>>> + unsigned long util, unsigned long max)
>>>> +{
>>>> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
>>>> + struct sugov_policy *sg_policy = sg_cpu->sg_policy;
>>>> + unsigned int min_f, max_f, next_f;
>>>> +
>>>> + if (!sugov_should_update_freq(sg_policy, time))
>>>> + return;
>>>> +
>>>> + min_f = sg_policy->policy->cpuinfo.min_freq;
>>>> + max_f = sg_policy->policy->cpuinfo.max_freq;
>>>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
>>>
>>> I think it has been pointed out in another email's thread but you
>>> should change the way the next_f is computed. util reflects the
>>> utilization of a CPU from 0 to its max compute capacity whereas
>>> ondemand was using the load at the current frequency during the last
>>> time window. I have understood that you want to keep same formula than
>>> ondemand as a starting point but you use a different input to
>>> calculate the next frequency so i don't see the rational of keeping
>>> this formula.
>>
>> It is a formula that causes the entire available frequency range to be
>> utilized proportionally to the utilization as reported by the
>> scheduler (modulo the policy->min/max limits). Its (significant IMO)
>> advantage is that it doesn't require any additional factors that would
>> need to be determined somehow.
>
> In case a more formal derivation of this formula is needed, it is
> based on the following 3 assumptions:
>
> (1) Performance is a linear function of frequency.
> (2) Required performance is a linear function of the utilization ratio
> x = util/max as provided by the scheduler (0 <= x <= 1).
Just to mention that the utilization that you are using, varies with
the frequency which add another variable in your equation
> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
> the maximum possible frequency (max_freq) corresponds to x = 1.
>
> (1) and (2) combined imply that
>
> f = a * x + b
>
> (f - frequency, a, b - constants to be determined) and then (3) quite
> trivially leads to b = min_freq and a = max_freq - min_freq.
>
> Now, of course, the linearity assumptions may be questioned, but then
> it's just the first approximation. If you go any further, though, you
> end up with an expansion series like this:
>
> f(x) = c_0 + c_1 * x + c_2 * x^2 + c_3 * x^3 + ...
>
> where all of the c_j need to be determined in principle. With luck,
> if you can guess what kind of a function f(x) may be, it may be
> possible to reduce the number of coefficients to determine, but
> question is whether or not that is going to work universally for all
> systems.
>
> Thanks,
> Rafael
On Thu, Mar 03, 2016 at 03:01:15PM +0100, Vincent Guittot wrote:
> > In case a more formal derivation of this formula is needed, it is
> > based on the following 3 assumptions:
> >
> > (1) Performance is a linear function of frequency.
> > (2) Required performance is a linear function of the utilization ratio
> > x = util/max as provided by the scheduler (0 <= x <= 1).
>
> Just to mention that the utilization that you are using, varies with
> the frequency which add another variable in your equation
Right, x86 hasn't implemented arch_scale_freq_capacity(), so the
utilization values we use are all over the map. If we lower freq, the
util will go up, which would result in us bumping the freq again, etc..
On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote:
>> >>> + min_f = sg_policy->policy->cpuinfo.min_freq;
>> >>> + max_f = sg_policy->policy->cpuinfo.max_freq;
>> >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
>
>> In case a more formal derivation of this formula is needed, it is
>> based on the following 3 assumptions:
>>
>> (1) Performance is a linear function of frequency.
>> (2) Required performance is a linear function of the utilization ratio
>> x = util/max as provided by the scheduler (0 <= x <= 1).
>
>> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
>> the maximum possible frequency (max_freq) corresponds to x = 1.
>>
>> (1) and (2) combined imply that
>>
>> f = a * x + b
>>
>> (f - frequency, a, b - constants to be determined) and then (3) quite
>> trivially leads to b = min_freq and a = max_freq - min_freq.
>
> 3 is the problem, that just doesn't make sense and is probably the
> reason why you see very little selection of the min freq.
It is about mapping the entire [0,1] interval to the available frequency range.
I till overprovision things (the smaller x the more), but then it may
help the race-to-idle a bit in theory.
> Suppose a machine with the following frequencies:
>
> 500, 750, 1000
>
> And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) =
> 700 make any sense? Per your point 1, it should should be asking for
> 0.4 * 1000 = 400.
>
> Because, per 1, at 500 it runs exactly half as fast as at 1000, and we
> only need 0.4 times as much. Therefore 500 is more than sufficient.
OK, but then I don't see why this reasoning only applies to the lower
bound of the frequency range. Is there any reason why x = 1 should be
the only point mapping to max_freq?
If not, then I think it's reasonable to map the middle of the
available frequency range to x = 0.5 and then we have b = 0 and a =
(max_freq + min_freq) / 2.
I'll try that and see how it goes.
> Note. we all know that 1 is a 'broken' assumption, but lacking anything
> better I think its a reasonable one to make.
Right.
On Thu, Mar 03, 2016 at 04:38:17PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 03, 2016 at 03:01:15PM +0100, Vincent Guittot wrote:
> > > In case a more formal derivation of this formula is needed, it is
> > > based on the following 3 assumptions:
> > >
> > > (1) Performance is a linear function of frequency.
> > > (2) Required performance is a linear function of the utilization ratio
> > > x = util/max as provided by the scheduler (0 <= x <= 1).
> >
> > Just to mention that the utilization that you are using, varies with
> > the frequency which add another variable in your equation
>
> Right, x86 hasn't implemented arch_scale_freq_capacity(), so the
> utilization values we use are all over the map. If we lower freq, the
> util will go up, which would result in us bumping the freq again, etc..
Something like the completely untested below should maybe work.
Rafael?
---
arch/x86/include/asm/topology.h | 19 +++++++++++++++++++
arch/x86/kernel/smpboot.c | 24 ++++++++++++++++++++++++
kernel/sched/core.c | 1 +
kernel/sched/sched.h | 7 +++++++
4 files changed, 51 insertions(+)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 7f991bd5031b..af7b7259db94 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -146,4 +146,23 @@ struct pci_bus;
int x86_pci_root_bus_node(int bus);
void x86_pci_root_bus_resources(int bus, struct list_head *resources);
+#ifdef CONFIG_SMP
+
+#define arch_scale_freq_tick arch_scale_freq_tick
+#define arch_scale_freq_capacity arch_scale_freq_capacity
+
+DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
+
+static inline arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+ if (static_cpu_has(X86_FEATURE_APERFMPERF))
+ return per_cpu(arch_cpu_freq, cpu);
+ else
+ return SCHED_CAPACITY_SCALE;
+}
+
+extern void arch_scale_freq_tick(void);
+
+#endif
+
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 3bf1e0b5f827..7d459577ee44 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1647,3 +1647,27 @@ void native_play_dead(void)
}
#endif
+
+static DEFINE_PER_CPU(u64, arch_prev_aperf);
+static DEFINE_PER_CPU(u64, arch_prev_mperf);
+DEFINE_PER_CPU(unsigned long, arch_cpu_freq);
+
+void arch_scale_freq_tick(void)
+{
+ u64 aperf, mperf;
+ u64 acnt, mcnt;
+
+ if (!static_cpu_has(X86_FEATURE_APERFMPERF))
+ return;
+
+ aperf = rdmsrl(MSR_IA32_APERF);
+ mperf = rdmsrl(MSR_IA32_APERF);
+
+ acnt = aperf - this_cpu_read(arch_prev_aperf);
+ mcnt = mperf - this_cpu_read(arch_prev_mperf);
+
+ this_cpu_write(arch_prev_aperf, aperf);
+ this_cpu_write(arch_prev_mperf, mperf);
+
+ this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt));
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 96e323b26ea9..35dbf909afb2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2901,6 +2901,7 @@ void scheduler_tick(void)
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq->curr;
+ arch_scale_freq_tick();
sched_clock_tick();
raw_spin_lock(&rq->lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index baa32075f98e..c3825c920e3f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1408,6 +1408,13 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
}
#endif
+#ifndef arch_scale_freq_tick
+static __always_inline
+void arch_scale_freq_tick(void)
+{
+}
+#endif
+
#ifndef arch_scale_cpu_capacity
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote:
> >> >>> + min_f = sg_policy->policy->cpuinfo.min_freq;
> >> >>> + max_f = sg_policy->policy->cpuinfo.max_freq;
> >> >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
> >
> >> In case a more formal derivation of this formula is needed, it is
> >> based on the following 3 assumptions:
> >>
> >> (1) Performance is a linear function of frequency.
> >> (2) Required performance is a linear function of the utilization ratio
> >> x = util/max as provided by the scheduler (0 <= x <= 1).
> >
> >> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
> >> the maximum possible frequency (max_freq) corresponds to x = 1.
> >>
> >> (1) and (2) combined imply that
> >>
> >> f = a * x + b
> >>
> >> (f - frequency, a, b - constants to be determined) and then (3) quite
> >> trivially leads to b = min_freq and a = max_freq - min_freq.
> >
> > 3 is the problem, that just doesn't make sense and is probably the
> > reason why you see very little selection of the min freq.
>
> It is about mapping the entire [0,1] interval to the available frequency range.
Yeah, but I don't see why that makes sense..
> I till overprovision things (the smaller x the more), but then it may
> help the race-to-idle a bit in theory.
So, since we also have the cpuidle information, could we not make a
better guess at race-to-idle?
> > Suppose a machine with the following frequencies:
> >
> > 500, 750, 1000
> >
> > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) =
> > 700 make any sense? Per your point 1, it should should be asking for
> > 0.4 * 1000 = 400.
> >
> > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we
> > only need 0.4 times as much. Therefore 500 is more than sufficient.
>
> OK, but then I don't see why this reasoning only applies to the lower
> bound of the frequency range. Is there any reason why x = 1 should be
> the only point mapping to max_freq?
Well, everything that goes over the second to last freq would end up at
the last (max) freq.
Take again the 500,750,1000 example, everything that's >750 would end up
at 1000 (for relation_l, >875 for _c).
But given the platform's cpuidle information, maybe coupled with an avg
idle est, we can compute the benefit of race-to-idle and over provision
based on that, right?
> If not, then I think it's reasonable to map the middle of the
> available frequency range to x = 0.5 and then we have b = 0 and a =
> (max_freq + min_freq) / 2.
So I really think that approach falls apart on the low util bits, you
effectively always run above min speed, even if min is already vstly
over provisioned.
On Thu, Mar 03, 2016 at 05:28:29PM +0100, Peter Zijlstra wrote:
> +void arch_scale_freq_tick(void)
> +{
> + u64 aperf, mperf;
> + u64 acnt, mcnt;
> +
> + if (!static_cpu_has(X86_FEATURE_APERFMPERF))
> + return;
> +
> + aperf = rdmsrl(MSR_IA32_APERF);
> + mperf = rdmsrl(MSR_IA32_APERF);
Actually reading MPERF increases the chances of this working.
> +
> + acnt = aperf - this_cpu_read(arch_prev_aperf);
> + mcnt = mperf - this_cpu_read(arch_prev_mperf);
> +
> + this_cpu_write(arch_prev_aperf, aperf);
> + this_cpu_write(arch_prev_mperf, mperf);
> +
> + this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt));
> +}
On Thu, Mar 03, 2016 at 05:37:35PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote:
> > >> f = a * x + b
> > If not, then I think it's reasonable to map the middle of the
> > available frequency range to x = 0.5 and then we have b = 0 and a =
> > (max_freq + min_freq) / 2.
>
> So I really think that approach falls apart on the low util bits, you
> effectively always run above min speed, even if min is already vstly
> over provisioned.
Ah nevermind, I cannot read. Yes that is worth trying I suppose. But the
b=0,a=1 thing seems more natural still.
On Thu, Mar 03, 2016 at 04:55:44PM +0000, Juri Lelli wrote:
> On 03/03/16 17:37, Peter Zijlstra wrote:
> > But given the platform's cpuidle information, maybe coupled with an avg
> > idle est, we can compute the benefit of race-to-idle and over provision
> > based on that, right?
> >
>
> Shouldn't this kind of considerations be a scheduler thing? I'm not
> really getting why we want to put more "intelligence" in a new governor.
> Also, if I understand Ingo's point correctly, I think we want to make
> this kind of policy decisions inside the scheduler.
Well sure, put it in kernel/sched/cpufreq.c or wherever. My point was
more that we don't have to guess/hardcode race-to-idle assumptions but
can actually calculate some of that.
On 03/03/16 17:37, Peter Zijlstra wrote:
> On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote:
> > On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <[email protected]> wrote:
> > > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote:
> > >> >>> + min_f = sg_policy->policy->cpuinfo.min_freq;
> > >> >>> + max_f = sg_policy->policy->cpuinfo.max_freq;
> > >> >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
> > >
> > >> In case a more formal derivation of this formula is needed, it is
> > >> based on the following 3 assumptions:
> > >>
> > >> (1) Performance is a linear function of frequency.
> > >> (2) Required performance is a linear function of the utilization ratio
> > >> x = util/max as provided by the scheduler (0 <= x <= 1).
> > >
> > >> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
> > >> the maximum possible frequency (max_freq) corresponds to x = 1.
> > >>
> > >> (1) and (2) combined imply that
> > >>
> > >> f = a * x + b
> > >>
> > >> (f - frequency, a, b - constants to be determined) and then (3) quite
> > >> trivially leads to b = min_freq and a = max_freq - min_freq.
> > >
> > > 3 is the problem, that just doesn't make sense and is probably the
> > > reason why you see very little selection of the min freq.
> >
> > It is about mapping the entire [0,1] interval to the available frequency range.
>
> Yeah, but I don't see why that makes sense..
>
> > I till overprovision things (the smaller x the more), but then it may
> > help the race-to-idle a bit in theory.
>
> So, since we also have the cpuidle information, could we not make a
> better guess at race-to-idle?
>
> > > Suppose a machine with the following frequencies:
> > >
> > > 500, 750, 1000
> > >
> > > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) =
> > > 700 make any sense? Per your point 1, it should should be asking for
> > > 0.4 * 1000 = 400.
> > >
> > > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we
> > > only need 0.4 times as much. Therefore 500 is more than sufficient.
> >
> > OK, but then I don't see why this reasoning only applies to the lower
> > bound of the frequency range. Is there any reason why x = 1 should be
> > the only point mapping to max_freq?
>
> Well, everything that goes over the second to last freq would end up at
> the last (max) freq.
>
> Take again the 500,750,1000 example, everything that's >750 would end up
> at 1000 (for relation_l, >875 for _c).
>
> But given the platform's cpuidle information, maybe coupled with an avg
> idle est, we can compute the benefit of race-to-idle and over provision
> based on that, right?
>
Shouldn't this kind of considerations be a scheduler thing? I'm not
really getting why we want to put more "intelligence" in a new governor.
Also, if I understand Ingo's point correctly, I think we want to make
this kind of policy decisions inside the scheduler.
On 03/03/16 17:56, Peter Zijlstra wrote:
> On Thu, Mar 03, 2016 at 04:55:44PM +0000, Juri Lelli wrote:
> > On 03/03/16 17:37, Peter Zijlstra wrote:
> > > But given the platform's cpuidle information, maybe coupled with an avg
> > > idle est, we can compute the benefit of race-to-idle and over provision
> > > based on that, right?
> > >
> >
> > Shouldn't this kind of considerations be a scheduler thing? I'm not
> > really getting why we want to put more "intelligence" in a new governor.
> > Also, if I understand Ingo's point correctly, I think we want to make
> > this kind of policy decisions inside the scheduler.
>
> Well sure, put it in kernel/sched/cpufreq.c or wherever. My point was
> more that we don't have to guess/hardcode race-to-idle assumptions but
> can actually calculate some of that.
>
Right, thanks for clarifying!
On 03/03/16 16:28, Peter Zijlstra wrote:
> On Thu, Mar 03, 2016 at 04:38:17PM +0100, Peter Zijlstra wrote:
>> On Thu, Mar 03, 2016 at 03:01:15PM +0100, Vincent Guittot wrote:
>>>> In case a more formal derivation of this formula is needed, it is
>>>> based on the following 3 assumptions:
>>>>
>>>> (1) Performance is a linear function of frequency.
>>>> (2) Required performance is a linear function of the utilization ratio
>>>> x = util/max as provided by the scheduler (0 <= x <= 1).
>>>
>>> Just to mention that the utilization that you are using, varies with
>>> the frequency which add another variable in your equation
>>
>> Right, x86 hasn't implemented arch_scale_freq_capacity(), so the
>> utilization values we use are all over the map. If we lower freq, the
>> util will go up, which would result in us bumping the freq again, etc..
>
> Something like the completely untested below should maybe work.
>
> Rafael?
>
[...]
> +void arch_scale_freq_tick(void)
> +{
> + u64 aperf, mperf;
> + u64 acnt, mcnt;
> +
> + if (!static_cpu_has(X86_FEATURE_APERFMPERF))
> + return;
> +
> + aperf = rdmsrl(MSR_IA32_APERF);
> + mperf = rdmsrl(MSR_IA32_APERF);
> +
> + acnt = aperf - this_cpu_read(arch_prev_aperf);
> + mcnt = mperf - this_cpu_read(arch_prev_mperf);
> +
> + this_cpu_write(arch_prev_aperf, aperf);
> + this_cpu_write(arch_prev_mperf, mperf);
> +
> + this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt));
Wasn't there the problem that this ratio goes to zero if the cpu is idle
in the old power estimation approach on x86?
[...]
On Thu, Mar 03, 2016 at 05:28:55PM +0000, Dietmar Eggemann wrote:
> > +void arch_scale_freq_tick(void)
> > +{
> > + u64 aperf, mperf;
> > + u64 acnt, mcnt;
> > +
> > + if (!static_cpu_has(X86_FEATURE_APERFMPERF))
> > + return;
> > +
> > + aperf = rdmsrl(MSR_IA32_APERF);
> > + mperf = rdmsrl(MSR_IA32_APERF);
> > +
> > + acnt = aperf - this_cpu_read(arch_prev_aperf);
> > + mcnt = mperf - this_cpu_read(arch_prev_mperf);
> > +
> > + this_cpu_write(arch_prev_aperf, aperf);
> > + this_cpu_write(arch_prev_mperf, mperf);
> > +
> > + this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt));
>
> Wasn't there the problem that this ratio goes to zero if the cpu is idle
> in the old power estimation approach on x86?
Yeah, there was something funky.
SDM says they only count in C0 (ie. !idle), so it _should_ work.
On Thu, Mar 3, 2016 at 5:28 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Mar 03, 2016 at 04:38:17PM +0100, Peter Zijlstra wrote:
>> On Thu, Mar 03, 2016 at 03:01:15PM +0100, Vincent Guittot wrote:
>> > > In case a more formal derivation of this formula is needed, it is
>> > > based on the following 3 assumptions:
>> > >
>> > > (1) Performance is a linear function of frequency.
>> > > (2) Required performance is a linear function of the utilization ratio
>> > > x = util/max as provided by the scheduler (0 <= x <= 1).
>> >
>> > Just to mention that the utilization that you are using, varies with
>> > the frequency which add another variable in your equation
>>
>> Right, x86 hasn't implemented arch_scale_freq_capacity(), so the
>> utilization values we use are all over the map. If we lower freq, the
>> util will go up, which would result in us bumping the freq again, etc..
>
> Something like the completely untested below should maybe work.
>
> Rafael?
It looks reasonable (modulo the MPERF reading typo you've noticed),
but can we get back to that later?
I'll first try to address the Ingo's feedback (which I hope I
understood correctly) and some other comments people had and resend
the series.
> ---
> arch/x86/include/asm/topology.h | 19 +++++++++++++++++++
> arch/x86/kernel/smpboot.c | 24 ++++++++++++++++++++++++
> kernel/sched/core.c | 1 +
> kernel/sched/sched.h | 7 +++++++
> 4 files changed, 51 insertions(+)
>
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 7f991bd5031b..af7b7259db94 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -146,4 +146,23 @@ struct pci_bus;
> int x86_pci_root_bus_node(int bus);
> void x86_pci_root_bus_resources(int bus, struct list_head *resources);
>
> +#ifdef CONFIG_SMP
> +
> +#define arch_scale_freq_tick arch_scale_freq_tick
> +#define arch_scale_freq_capacity arch_scale_freq_capacity
> +
> +DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
> +
> +static inline arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> +{
> + if (static_cpu_has(X86_FEATURE_APERFMPERF))
> + return per_cpu(arch_cpu_freq, cpu);
> + else
> + return SCHED_CAPACITY_SCALE;
> +}
> +
> +extern void arch_scale_freq_tick(void);
> +
> +#endif
> +
> #endif /* _ASM_X86_TOPOLOGY_H */
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 3bf1e0b5f827..7d459577ee44 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1647,3 +1647,27 @@ void native_play_dead(void)
> }
>
> #endif
> +
> +static DEFINE_PER_CPU(u64, arch_prev_aperf);
> +static DEFINE_PER_CPU(u64, arch_prev_mperf);
> +DEFINE_PER_CPU(unsigned long, arch_cpu_freq);
> +
> +void arch_scale_freq_tick(void)
> +{
> + u64 aperf, mperf;
> + u64 acnt, mcnt;
> +
> + if (!static_cpu_has(X86_FEATURE_APERFMPERF))
> + return;
> +
> + aperf = rdmsrl(MSR_IA32_APERF);
> + mperf = rdmsrl(MSR_IA32_APERF);
> +
> + acnt = aperf - this_cpu_read(arch_prev_aperf);
> + mcnt = mperf - this_cpu_read(arch_prev_mperf);
> +
> + this_cpu_write(arch_prev_aperf, aperf);
> + this_cpu_write(arch_prev_mperf, mperf);
> +
> + this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt));
> +}
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 96e323b26ea9..35dbf909afb2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2901,6 +2901,7 @@ void scheduler_tick(void)
> struct rq *rq = cpu_rq(cpu);
> struct task_struct *curr = rq->curr;
>
> + arch_scale_freq_tick();
> sched_clock_tick();
>
> raw_spin_lock(&rq->lock);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index baa32075f98e..c3825c920e3f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1408,6 +1408,13 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> }
> #endif
>
> +#ifndef arch_scale_freq_tick
> +static __always_inline
> +void arch_scale_freq_tick(void)
> +{
> +}
> +#endif
> +
> #ifndef arch_scale_cpu_capacity
> static __always_inline
> unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
On 03/03/16 18:26, Peter Zijlstra wrote:
> On Thu, Mar 03, 2016 at 05:28:55PM +0000, Dietmar Eggemann wrote:
>>> +void arch_scale_freq_tick(void)
>>> +{
>>> + u64 aperf, mperf;
>>> + u64 acnt, mcnt;
>>> +
>>> + if (!static_cpu_has(X86_FEATURE_APERFMPERF))
>>> + return;
>>> +
>>> + aperf = rdmsrl(MSR_IA32_APERF);
>>> + mperf = rdmsrl(MSR_IA32_APERF);
>>> +
>>> + acnt = aperf - this_cpu_read(arch_prev_aperf);
>>> + mcnt = mperf - this_cpu_read(arch_prev_mperf);
>>> +
>>> + this_cpu_write(arch_prev_aperf, aperf);
>>> + this_cpu_write(arch_prev_mperf, mperf);
>>> +
>>> + this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt));
>>
>> Wasn't there the problem that this ratio goes to zero if the cpu is idle
>> in the old power estimation approach on x86?
>
> Yeah, there was something funky.
>
> SDM says they only count in C0 (ie. !idle), so it _should_ work.
I see, back than the problem was 0 capacity in idle but this is about
frequency.
On Thu, Mar 3, 2016 at 6:53 AM, Viresh Kumar <[email protected]> wrote:
> On 02-03-16, 03:08, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>>
>> In addition to fields representing governor tunables, struct dbs_data
>> contains some fields needed for the management of objects of that
>> type. As it turns out, that part of struct dbs_data may be shared
>> with (future) governors that won't use the common code used by
>> "ondemand" and "conservative", so move it to a separate struct type
>> and modify the code using struct dbs_data to follow.
>>
>> Signed-off-by: Rafael J. Wysocki <[email protected]>
>> ---
>> drivers/cpufreq/cpufreq_conservative.c | 15 +++--
>> drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++-------------
>> drivers/cpufreq/cpufreq_governor.h | 36 +++++++------
>> drivers/cpufreq/cpufreq_ondemand.c | 19 ++++--
>> 4 files changed, 97 insertions(+), 63 deletions(-)
>>
>> Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
>> ===================================================================
>> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
>> +++ linux-pm/drivers/cpufreq/cpufreq_governor.h
>> @@ -41,6 +41,13 @@
>> /* Ondemand Sampling types */
>> enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
>>
>> +struct gov_tunables {
>> + struct kobject kobj;
>> + struct list_head policy_list;
>> + struct mutex update_lock;
>> + int usage_count;
>> +};
>
> Everything else looks fine, but I don't think that you have named it
> properly. Every thing else present in struct dbs_data are tunables,
> but not this. And so gov_tunables doesn't suit at all here..
So this is a totally bicycle shed discussion argument which makes it
seriously irritating.
Does it really matter so much how this structure is called?
Essentially, it is something to build your tunables structure around
and you can treat it as a counterpart of a C++ abstract class. So the
name *does* make sense in that context.
That said, what about gov_attr_set?
Thanks,
Rafael
On Thu, Mar 3, 2016 at 12:18 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Mar 02, 2016 at 03:12:33AM +0100, Rafael J. Wysocki wrote:
>> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>> + unsigned int target_freq, unsigned int relation)
>> +{
>> + unsigned int freq;
>> +
>> + if (target_freq == policy->cur)
>> + return;
>
> But what if relation is different from last time? ;-)
Doh
Never mind, I'll drop this check (that said this mistake is present
elsewhere too IIRC).
On 03/03/2016 05:07 AM, Vincent Guittot wrote:
> I mainly want to prevent any useless and periodic frequency switch
> because of an utilization that changes with the current frequency (if
> frequency invariance is not used) and that can make the formula
> selects another frequency than the current one. That what i can see
> when testing it .
>
> Sorry for the late reply, i was trying to do some test on my board but
> was facing some crash issue (not link with your patchset). So i have
> done some tests and i can see such instable behavior. I have generated
> a load of 33% at max frequency (3ms runs every 9ms) and i can see the
> frequency that toggles without any good reason. Saying that, i can see
> similar thing with ondemand.
FWIW I ran some performance numbers on my chromebook 2. Initially I
forgot to bring in the frequency invariance support but that yielded an
opportunity to see the impact of it.
The tests below consist of a periodic workload. The OH (overhead)
numbers show how close the workload got to running as slow as fmin (100%
= as slow as powersave gov, 0% = as fast as perf gov). The OR (overrun)
number is the count of instances where the busy work exceeded the period.
First a comparison of schedutil with and without frequency invariance.
Run and period are in milliseconds.
scu (no inv) scu (w/inv)
run period busy % OR OH OR OH
1 100 1.00% 0 79.72% 0 95.86%
10 1000 1.00% 0 24.52% 0 71.61%
1 10 10.00% 0 21.25% 0 41.78%
10 100 10.00% 0 26.06% 0 47.96%
100 1000 10.00% 0 6.36% 0 26.03%
6 33 18.18% 0 15.67% 0 31.61%
66 333 19.82% 0 8.94% 0 29.46%
4 10 40.00% 0 6.26% 0 12.93%
40 100 40.00% 0 6.93% 2 14.08%
400 1000 40.00% 0 1.65% 0 11.58%
5 9 55.56% 0 3.70% 0 7.70%
50 90 55.56% 1 4.19% 6 8.06%
500 900 55.56% 0 1.35% 5 6.94%
9 12 75.00% 0 1.60% 56 3.59%
90 120 75.00% 0 1.88% 21 3.94%
900 1200 75.00% 0 0.73% 4 4.41%
Frequency invariance causes schedutil overhead to increase noticeably. I
haven't dug into traces or anything. Perhaps this is due to the
algorithm overshooting then overcorrecting etc., I do not yet know.
Here is a comparison, with frequency invariance, of ondemand and
interactive with schedfreq and schedutil. The first two columns (run and
period) are omitted so the table will fit.
ondemand interactive schedfreq schedutil
busy % OR OH OR OH OR OH OR OH
1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86%
1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61%
10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78%
10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96%
10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03%
18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61%
19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46%
40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93%
40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08%
40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58%
55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70%
55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06%
55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94%
75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59%
75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94%
75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41%
Aside from the 2nd and 3rd tests schedutil is showing decreased
performance across the board. The fifth test is particularly bad.
The catch is that I do not have power numbers to go with this data, as
I'm not currently equipped to gather them. So more analysis is
definitely needed to capture the full story.
thanks,
Steve
On Thu, Mar 3, 2016 at 9:06 PM, Steve Muckle <[email protected]> wrote:
> On 03/03/2016 05:07 AM, Vincent Guittot wrote:
>> I mainly want to prevent any useless and periodic frequency switch
>> because of an utilization that changes with the current frequency (if
>> frequency invariance is not used) and that can make the formula
>> selects another frequency than the current one. That what i can see
>> when testing it .
>>
>> Sorry for the late reply, i was trying to do some test on my board but
>> was facing some crash issue (not link with your patchset). So i have
>> done some tests and i can see such instable behavior. I have generated
>> a load of 33% at max frequency (3ms runs every 9ms) and i can see the
>> frequency that toggles without any good reason. Saying that, i can see
>> similar thing with ondemand.
>
> FWIW I ran some performance numbers on my chromebook 2. Initially I
> forgot to bring in the frequency invariance support but that yielded an
> opportunity to see the impact of it.
>
> The tests below consist of a periodic workload. The OH (overhead)
> numbers show how close the workload got to running as slow as fmin (100%
> = as slow as powersave gov, 0% = as fast as perf gov). The OR (overrun)
> number is the count of instances where the busy work exceeded the period.
>
> First a comparison of schedutil with and without frequency invariance.
> Run and period are in milliseconds.
>
> scu (no inv) scu (w/inv)
> run period busy % OR OH OR OH
> 1 100 1.00% 0 79.72% 0 95.86%
> 10 1000 1.00% 0 24.52% 0 71.61%
> 1 10 10.00% 0 21.25% 0 41.78%
> 10 100 10.00% 0 26.06% 0 47.96%
> 100 1000 10.00% 0 6.36% 0 26.03%
> 6 33 18.18% 0 15.67% 0 31.61%
> 66 333 19.82% 0 8.94% 0 29.46%
> 4 10 40.00% 0 6.26% 0 12.93%
> 40 100 40.00% 0 6.93% 2 14.08%
> 400 1000 40.00% 0 1.65% 0 11.58%
> 5 9 55.56% 0 3.70% 0 7.70%
> 50 90 55.56% 1 4.19% 6 8.06%
> 500 900 55.56% 0 1.35% 5 6.94%
> 9 12 75.00% 0 1.60% 56 3.59%
> 90 120 75.00% 0 1.88% 21 3.94%
> 900 1200 75.00% 0 0.73% 4 4.41%
>
> Frequency invariance causes schedutil overhead to increase noticeably. I
> haven't dug into traces or anything. Perhaps this is due to the
> algorithm overshooting then overcorrecting etc., I do not yet know.
So as I said, the formula I used didn't take invariance into account,
so that's quite as expected.
> Here is a comparison, with frequency invariance, of ondemand and
> interactive with schedfreq and schedutil. The first two columns (run and
> period) are omitted so the table will fit.
>
> ondemand interactive schedfreq schedutil
> busy % OR OH OR OH OR OH OR OH
> 1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86%
> 1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61%
> 10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78%
> 10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96%
> 10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03%
> 18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61%
> 19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46%
> 40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93%
> 40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08%
> 40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58%
> 55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70%
> 55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06%
> 55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94%
> 75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59%
> 75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94%
> 75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41%
>
> Aside from the 2nd and 3rd tests schedutil is showing decreased
> performance across the board. The fifth test is particularly bad.
I guess you mean performance in terms of the overhead?
Thanks,
Rafael
On Thu, Mar 3, 2016 at 12:16 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Mar 02, 2016 at 03:12:33AM +0100, Rafael J. Wysocki wrote:
>> The most important change from the previous version is that the
>> ->fast_switch() callback takes an additional "relation" argument
>> and now the governor can use it to choose a selection method.
>
>> +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
>> + unsigned int target_freq,
>> + unsigned int relation)
>
> Would it make sense to replace the {target_freq, relation} pair with
> something like the CPPC {min_freq, max_freq} pair?
Yes, it would in general, but since I use __cpufreq_driver_target() in
the "slow driver" case, that would need to be reworked too for
consistency. So I'd prefer to do that later.
> Then you could use the closest frequency to max provided it is larger
> than min.
>
> This communicates more actual information in the same number of
> parameters and would thereby allow for a more flexible (better)
> frequency selection.
Agreed.
On Thu, Mar 03, 2016 at 09:56:40PM +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 3, 2016 at 12:16 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Mar 02, 2016 at 03:12:33AM +0100, Rafael J. Wysocki wrote:
> >> The most important change from the previous version is that the
> >> ->fast_switch() callback takes an additional "relation" argument
> >> and now the governor can use it to choose a selection method.
> >
> >> +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> >> + unsigned int target_freq,
> >> + unsigned int relation)
> >
> > Would it make sense to replace the {target_freq, relation} pair with
> > something like the CPPC {min_freq, max_freq} pair?
>
> Yes, it would in general, but since I use __cpufreq_driver_target() in
> the "slow driver" case, that would need to be reworked too for
> consistency. So I'd prefer to do that later.
OK, fair enough.
On 03/03/2016 12:20 PM, Rafael J. Wysocki wrote:
>> Here is a comparison, with frequency invariance, of ondemand and
>> interactive with schedfreq and schedutil. The first two columns (run and
>> period) are omitted so the table will fit.
>>
>> ondemand interactive schedfreq schedutil
>> busy % OR OH OR OH OR OH OR OH
>> 1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86%
>> 1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61%
>> 10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78%
>> 10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96%
>> 10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03%
>> 18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61%
>> 19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46%
>> 40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93%
>> 40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08%
>> 40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58%
>> 55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70%
>> 55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06%
>> 55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94%
>> 75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59%
>> 75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94%
>> 75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41%
>>
>> Aside from the 2nd and 3rd tests schedutil is showing decreased
>> performance across the board. The fifth test is particularly bad.
>
> I guess you mean performance in terms of the overhead?
Correct. This overhead metric describes how fast the workload completes,
with 0% equaling the perf governor and 100% equaling the powersave
governor. So it's a reflection of general performance using the
governor. It's called "overhead" I imagine (the metric predates my
involvement) as it is something introduced/caused by the policy of the
governor.
thanks,
Steve
On Thu, Mar 3, 2016 at 5:47 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Mar 03, 2016 at 05:37:35PM +0100, Peter Zijlstra wrote:
>> On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote:
>> > >> f = a * x + b
>
>> > If not, then I think it's reasonable to map the middle of the
>> > available frequency range to x = 0.5 and then we have b = 0 and a =
>> > (max_freq + min_freq) / 2.
That actually should be a = max_freq + min_freq, because I want
(max_freq + min_freq) / 2 = a / 2.
>> So I really think that approach falls apart on the low util bits, you
>> effectively always run above min speed, even if min is already vstly
>> over provisioned.
>
> Ah nevermind, I cannot read. Yes that is worth trying I suppose. But the
> b=0,a=1 thing seems more natural still.
It is somewhat imbalanced, though. If all of the values of x are
equally probable (or equally frequent), the probability of running
above the middle frequency is lower than the probability of running
below it.
On Thu, Mar 3, 2016 at 7:00 AM, Viresh Kumar <[email protected]> wrote:
> On 02-03-16, 03:12, Rafael J. Wysocki wrote:
>> Index: linux-pm/drivers/cpufreq/cpufreq.c
>> ===================================================================
>> --- linux-pm.orig/drivers/cpufreq/cpufreq.c
>> +++ linux-pm/drivers/cpufreq/cpufreq.c
>> @@ -1772,6 +1772,39 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
>> * GOVERNORS *
>> *********************************************************************/
>>
>> +/**
>> + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
>> + * @policy: cpufreq policy to switch the frequency for.
>> + * @target_freq: New frequency to set (may be approximate).
>> + * @relation: Relation to use for frequency selection.
>> + *
>> + * Carry out a fast frequency switch from interrupt context.
>> + *
>> + * This function must not be called if policy->fast_switch_possible is unset.
>> + *
>> + * Governors calling this function must guarantee that it will never be invoked
>> + * twice in parallel for the same policy and that it will never be called in
>> + * parallel with either ->target() or ->target_index() for the same policy.
>> + *
>> + * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
>> + * callback, the hardware configuration must be preserved.
>> + */
>> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>> + unsigned int target_freq, unsigned int relation)
>> +{
>> + unsigned int freq;
>> +
>> + if (target_freq == policy->cur)
>
> Maybe an unlikely() here ?
>
>> + return;
>> +
>> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation);
>> + if (freq != CPUFREQ_ENTRY_INVALID) {
>> + policy->cur = freq;
>
> Hmm.. What will happen to the code relying on the cpufreq-notifiers
> now ?
It will have a problem.
For that code it's like the CPU changing the frequency and not telling
it (which is not unusual for that matter).
Thanks,
Rafael
From: Rafael J. Wysocki <[email protected]>
A subsequent change set will introduce a new cpufreq governor using
CPU utilization information from the scheduler, so introduce
cpufreq_update_util() (again) to allow that information to be passed to
the new governor and make cpufreq_trigger_update() call it internally.
To that end, add a new ->update_util callback pointer to struct
freq_update_hook to be set by entities that want to use the util
and max arguments and make cpufreq_update_util() use that callback
if available or the ->func callback that only takes the time argument
otherwise.
In addition to that, arrange helpers to set/clear the utilization
update hooks in such a way that the full ->update_util callbacks
can only be set by code inside the kernel/sched/ directory.
Update the current users of cpufreq_set_freq_update_hook() to use
the new helpers.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
New patch. Maybe slightly over the top, but at least it should be clear
who uses the util and max arguments and who doesn't use them after it.
---
drivers/cpufreq/cpufreq_governor.c | 76 +++++++++++++--------------
drivers/cpufreq/intel_pstate.c | 8 +-
include/linux/sched.h | 10 +--
kernel/sched/cpufreq.c | 101 +++++++++++++++++++++++++++++--------
kernel/sched/fair.c | 8 ++
kernel/sched/sched.h | 16 +++++
6 files changed, 150 insertions(+), 69 deletions(-)
Index: linux-pm/include/linux/sched.h
===================================================================
--- linux-pm.orig/include/linux/sched.h
+++ linux-pm/include/linux/sched.h
@@ -2363,15 +2363,15 @@ static inline bool sched_can_stop_tick(v
#endif
#ifdef CONFIG_CPU_FREQ
-void cpufreq_trigger_update(u64 time);
-
struct freq_update_hook {
void (*func)(struct freq_update_hook *hook, u64 time);
+ void (*update_util)(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max);
};
-void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook);
-#else
-static inline void cpufreq_trigger_update(u64 time) {}
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
+ void (*func)(struct freq_update_hook *hook, u64 time));
+void cpufreq_clear_freq_update_hook(int cpu);
#endif
#ifdef CONFIG_SCHED_AUTOGROUP
Index: linux-pm/kernel/sched/cpufreq.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq.c
+++ linux-pm/kernel/sched/cpufreq.c
@@ -9,12 +9,12 @@
* published by the Free Software Foundation.
*/
-#include <linux/sched.h>
+#include "sched.h"
static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
/**
- * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
+ * set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
* @cpu: The CPU to set the pointer for.
* @hook: New pointer value.
*
@@ -27,23 +27,96 @@ static DEFINE_PER_CPU(struct freq_update
* accessed via the old update_util_data pointer or invoke synchronize_sched()
* right after this function to avoid use-after-free.
*/
-void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook)
+static void set_freq_update_hook(int cpu, struct freq_update_hook *hook)
{
- if (WARN_ON(hook && !hook->func))
+ rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
+}
+
+/**
+ * cpufreq_set_freq_update_hook - Set the CPU's frequency update callback.
+ * @cpu: The CPU to set the callback for.
+ * @hook: New freq_update_hook pointer value.
+ * @func: Callback function to use with the new hook.
+ */
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
+ void (*func)(struct freq_update_hook *hook, u64 time))
+{
+ if (WARN_ON(!hook || !func))
return;
- rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
+ hook->func = func;
+ set_freq_update_hook(cpu, hook);
}
EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook);
/**
+ * cpufreq_set_update_util_hook - Set the CPU's utilization update callback.
+ * @cpu: The CPU to set the callback for.
+ * @hook: New freq_update_hook pointer value.
+ * @update_util: Callback function to use with the new hook.
+ */
+void cpufreq_set_update_util_hook(int cpu, struct freq_update_hook *hook,
+ void (*update_util)(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max))
+{
+ if (WARN_ON(!hook || !update_util))
+ return;
+
+ hook->update_util = update_util;
+ set_freq_update_hook(cpu, hook);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_hook);
+
+/**
+ * cpufreq_set_update_util_hook - Clear the CPU's freq_update_hook pointer.
+ * @cpu: The CPU to clear the pointer for.
+ */
+void cpufreq_clear_freq_update_hook(int cpu)
+{
+ set_freq_update_hook(cpu, NULL);
+}
+EXPORT_SYMBOL_GPL(cpufreq_clear_freq_update_hook);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @time: Current time.
+ * @util: CPU utilization.
+ * @max: CPU capacity.
+ *
+ * This function is called on every invocation of update_load_avg() on the CPU
+ * whose utilization is being updated.
+ *
+ * It can only be called from RCU-sched read-side critical sections.
+ */
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+ struct freq_update_hook *hook;
+
+#ifdef CONFIG_LOCKDEP
+ WARN_ON(debug_locks && !rcu_read_lock_sched_held());
+#endif
+
+ hook = rcu_dereference(*this_cpu_ptr(&cpufreq_freq_update_hook));
+ /*
+ * If this isn't inside of an RCU-sched read-side critical section, hook
+ * may become NULL after the check below.
+ */
+ if (hook) {
+ if (hook->update_util)
+ hook->update_util(hook, time, util, max);
+ else
+ hook->func(hook, time);
+ }
+}
+
+/**
* cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
* @time: Current time.
*
* The way cpufreq is currently arranged requires it to evaluate the CPU
* performance state (frequency/voltage) on a regular basis. To facilitate
- * that, this function is called by update_load_avg() in CFS when executed for
- * the current CPU's runqueue.
+ * that, cpufreq_update_util() is called by update_load_avg() in CFS when
+ * executed for the current CPU's runqueue.
*
* However, this isn't sufficient to prevent the CPU from being stuck in a
* completely inadequate performance level for too long, because the calls
@@ -57,17 +130,5 @@ EXPORT_SYMBOL_GPL(cpufreq_set_freq_updat
*/
void cpufreq_trigger_update(u64 time)
{
- struct freq_update_hook *hook;
-
-#ifdef CONFIG_LOCKDEP
- WARN_ON(debug_locks && !rcu_read_lock_sched_held());
-#endif
-
- hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
- /*
- * If this isn't inside of an RCU-sched read-side critical section, hook
- * may become NULL after the check below.
- */
- if (hook)
- hook->func(hook, time);
+ cpufreq_update_util(time, ULONG_MAX, 0);
}
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2839,6 +2839,8 @@ static inline void update_load_avg(struc
update_tg_load_avg(cfs_rq, 0);
if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+ unsigned long max = rq->cpu_capacity_orig;
+
/*
* There are a few boundary cases this might miss but it should
* get called often enough that that should (hopefully) not be
@@ -2847,9 +2849,11 @@ static inline void update_load_avg(struc
* the next tick/schedule should update.
*
* It will not get called when we go idle, because the idle
- * thread is a different class (!fair).
+ * thread is a different class (!fair), nor will the utilization
+- * number include things like RT tasks.
*/
- cpufreq_trigger_update(rq_clock(rq));
+ cpufreq_update_util(rq_clock(rq),
+ min(cfs_rq->avg.util_avg, max), max);
}
}
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -1739,3 +1739,19 @@ static inline u64 irq_time_read(int cpu)
}
#endif /* CONFIG_64BIT */
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+#ifdef CONFIG_CPU_FREQ
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+void cpufreq_trigger_update(u64 time);
+void cpufreq_set_update_util_hook(int cpu, struct freq_update_hook *hook,
+ void (*update_util)(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max));
+static inline void cpufreq_clear_update_util_hook(int cpu)
+{
+ cpufreq_clear_freq_update_hook(cpu);
+}
+#else
+static inline void cpufreq_update_util(u64 time, unsigned long util,
+ unsigned long max) {}
+static inline void cpufreq_trigger_update(u64 time) {}
+#endif /* CONFIG_CPU_FREQ */
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -1088,8 +1088,8 @@ static int intel_pstate_init_cpu(unsigne
intel_pstate_busy_pid_reset(cpu);
intel_pstate_sample(cpu, 0);
- cpu->update_hook.func = intel_pstate_freq_update;
- cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook);
+ cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook,
+ intel_pstate_freq_update);
pr_debug("intel_pstate: controlling: cpu %d\n", cpunum);
@@ -1173,7 +1173,7 @@ static void intel_pstate_stop_cpu(struct
pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
- cpufreq_set_freq_update_hook(cpu_num, NULL);
+ cpufreq_clear_freq_update_hook(cpu_num);
synchronize_sched();
if (hwp_active)
@@ -1441,7 +1441,7 @@ out:
get_online_cpus();
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu]) {
- cpufreq_set_freq_update_hook(cpu, NULL);
+ cpufreq_clear_freq_update_hook(cpu);
synchronize_sched();
kfree(all_cpu_data[cpu]);
}
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -211,43 +211,6 @@ unsigned int dbs_update(struct cpufreq_p
}
EXPORT_SYMBOL_GPL(dbs_update);
-static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs,
- unsigned int delay_us)
-{
- struct cpufreq_policy *policy = policy_dbs->policy;
- int cpu;
-
- gov_update_sample_delay(policy_dbs, delay_us);
- policy_dbs->last_sample_time = 0;
-
- for_each_cpu(cpu, policy->cpus) {
- struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
-
- cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook);
- }
-}
-
-static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy)
-{
- int i;
-
- for_each_cpu(i, policy->cpus)
- cpufreq_set_freq_update_hook(i, NULL);
-
- synchronize_sched();
-}
-
-static void gov_cancel_work(struct cpufreq_policy *policy)
-{
- struct policy_dbs_info *policy_dbs = policy->governor_data;
-
- gov_clear_freq_update_hooks(policy_dbs->policy);
- irq_work_sync(&policy_dbs->irq_work);
- cancel_work_sync(&policy_dbs->work);
- atomic_set(&policy_dbs->work_count, 0);
- policy_dbs->work_in_progress = false;
-}
-
static void dbs_work_handler(struct work_struct *work)
{
struct policy_dbs_info *policy_dbs;
@@ -334,6 +297,44 @@ static void dbs_freq_update_handler(stru
irq_work_queue(&policy_dbs->irq_work);
}
+static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs,
+ unsigned int delay_us)
+{
+ struct cpufreq_policy *policy = policy_dbs->policy;
+ int cpu;
+
+ gov_update_sample_delay(policy_dbs, delay_us);
+ policy_dbs->last_sample_time = 0;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
+
+ cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook,
+ dbs_freq_update_handler);
+ }
+}
+
+static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy)
+{
+ int i;
+
+ for_each_cpu(i, policy->cpus)
+ cpufreq_clear_freq_update_hook(i);
+
+ synchronize_sched();
+}
+
+static void gov_cancel_work(struct cpufreq_policy *policy)
+{
+ struct policy_dbs_info *policy_dbs = policy->governor_data;
+
+ gov_clear_freq_update_hooks(policy_dbs->policy);
+ irq_work_sync(&policy_dbs->irq_work);
+ cancel_work_sync(&policy_dbs->work);
+ atomic_set(&policy_dbs->work_count, 0);
+ policy_dbs->work_in_progress = false;
+}
+
static struct policy_dbs_info *alloc_policy_dbs_info(struct cpufreq_policy *policy,
struct dbs_governor *gov)
{
@@ -356,7 +357,6 @@ static struct policy_dbs_info *alloc_pol
struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
j_cdbs->policy_dbs = policy_dbs;
- j_cdbs->update_hook.func = dbs_freq_update_handler;
}
return policy_dbs;
}
From: Rafael J. Wysocki <[email protected]>
Commit fe7034338ba0 (cpufreq: Add mechanism for registering
utilization update callbacks) added cpufreq_update_util() to be
called by the scheduler (from the CFS part) on utilization updates.
The goal was to allow CFS to pass utilization information to cpufreq
and to trigger it to evaluate the frequency/voltage configuration
(P-state) of every CPU on a regular basis.
However, the last two arguments of that function are never used by
the current code, so CFS might simply call cpufreq_trigger_update()
instead of it (like the RT and DL sched classes).
For this reason, drop the last two arguments of cpufreq_update_util(),
rename it to cpufreq_trigger_update() and modify CFS to call it.
Moreover, since the utilization is not involved in that now, rename
data types, functions and variables related to cpufreq_trigger_update()
to reflect that (eg. struct update_util_data becomes struct
freq_update_hook and so on).
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
New patch.
Not strictly necessary, but I like the new names better. :-)
---
drivers/cpufreq/cpufreq.c | 52 +++++++++++++++++++++----------------
drivers/cpufreq/cpufreq_governor.c | 25 ++++++++---------
drivers/cpufreq/cpufreq_governor.h | 2 -
drivers/cpufreq/intel_pstate.c | 15 ++++------
include/linux/cpufreq.h | 32 ++--------------------
kernel/sched/deadline.c | 2 -
kernel/sched/fair.c | 13 +--------
kernel/sched/rt.c | 2 -
8 files changed, 58 insertions(+), 85 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -65,57 +65,65 @@ static struct cpufreq_driver *cpufreq_dr
static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
static DEFINE_RWLOCK(cpufreq_driver_lock);
-static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
/**
- * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
* @cpu: The CPU to set the pointer for.
- * @data: New pointer value.
+ * @hook: New pointer value.
*
- * Set and publish the update_util_data pointer for the given CPU. That pointer
- * points to a struct update_util_data object containing a callback function
- * to call from cpufreq_update_util(). That function will be called from an RCU
- * read-side critical section, so it must not sleep.
+ * Set and publish the freq_update_hook pointer for the given CPU. That pointer
+ * points to a struct freq_update_hook object containing a callback function
+ * to call from cpufreq_trigger_update(). That function will be called from
+ * an RCU read-side critical section, so it must not sleep.
*
* Callers must use RCU-sched callbacks to free any memory that might be
* accessed via the old update_util_data pointer or invoke synchronize_sched()
* right after this function to avoid use-after-free.
*/
-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook)
{
- if (WARN_ON(data && !data->func))
+ if (WARN_ON(hook && !hook->func))
return;
- rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+ rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
}
-EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook);
/**
- * cpufreq_update_util - Take a note about CPU utilization changes.
+ * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
* @time: Current time.
- * @util: Current utilization.
- * @max: Utilization ceiling.
*
- * This function is called by the scheduler on every invocation of
- * update_load_avg() on the CPU whose utilization is being updated.
+ * The way cpufreq is currently arranged requires it to evaluate the CPU
+ * performance state (frequency/voltage) on a regular basis. To facilitate
+ * that, this function is called by update_load_avg() in CFS when executed for
+ * the current CPU's runqueue.
*
- * It can only be called from RCU-sched read-side critical sections.
+ * However, this isn't sufficient to prevent the CPU from being stuck in a
+ * completely inadequate performance level for too long, because the calls
+ * from CFS will not be made if RT or deadline tasks are active all the time
+ * (or there are RT and DL tasks only).
+ *
+ * As a workaround for that issue, this function is called by the RT and DL
+ * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * but that really is a band-aid. Going forward it should be replaced with
+ * solutions targeted more specifically at RT and DL tasks.
*/
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+void cpufreq_trigger_update(u64 time)
{
- struct update_util_data *data;
+ struct freq_update_hook *hook;
#ifdef CONFIG_LOCKDEP
WARN_ON(debug_locks && !rcu_read_lock_sched_held());
#endif
- data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
+ hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
/*
* If this isn't inside of an RCU-sched read-side critical section, data
* may become NULL after the check below.
*/
- if (data)
- data->func(data, time, util, max);
+ if (hook)
+ hook->func(hook, time);
}
/* Flag to suspend/resume CPUFreq governors */
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -146,35 +146,13 @@ static inline bool policy_is_shared(stru
extern struct kobject *cpufreq_global_kobject;
#ifdef CONFIG_CPU_FREQ
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+void cpufreq_trigger_update(u64 time);
-/**
- * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
- * @time: Current time.
- *
- * The way cpufreq is currently arranged requires it to evaluate the CPU
- * performance state (frequency/voltage) on a regular basis to prevent it from
- * being stuck in a completely inadequate performance level for too long.
- * That is not guaranteed to happen if the updates are only triggered from CFS,
- * though, because they may not be coming in if RT or deadline tasks are active
- * all the time (or there are RT and DL tasks only).
- *
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
- * but that really is a band-aid. Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
- */
-static inline void cpufreq_trigger_update(u64 time)
-{
- cpufreq_update_util(time, ULONG_MAX, 0);
-}
-
-struct update_util_data {
- void (*func)(struct update_util_data *data,
- u64 time, unsigned long util, unsigned long max);
+struct freq_update_hook {
+ void (*func)(struct freq_update_hook *hook, u64 time);
};
-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook);
unsigned int cpufreq_get(unsigned int cpu);
unsigned int cpufreq_quick_get(unsigned int cpu);
@@ -187,8 +165,6 @@ int cpufreq_update_policy(unsigned int c
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
#else
-static inline void cpufreq_update_util(u64 time, unsigned long util,
- unsigned long max) {}
static inline void cpufreq_trigger_update(u64 time) {}
static inline unsigned int cpufreq_get(unsigned int cpu)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -62,10 +62,10 @@ ssize_t store_sampling_rate(struct dbs_d
mutex_lock(&policy_dbs->timer_mutex);
/*
* On 32-bit architectures this may race with the
- * sample_delay_ns read in dbs_update_util_handler(), but that
+ * sample_delay_ns read in dbs_freq_update_handler(), but that
* really doesn't matter. If the read returns a value that's
* too big, the sample will be skipped, but the next invocation
- * of dbs_update_util_handler() (when the update has been
+ * of dbs_freq_update_handler() (when the update has been
* completed) will take a sample.
*
* If this runs in parallel with dbs_work_handler(), we may end
@@ -257,7 +257,7 @@ unsigned int dbs_update(struct cpufreq_p
}
EXPORT_SYMBOL_GPL(dbs_update);
-static void gov_set_update_util(struct policy_dbs_info *policy_dbs,
+static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs,
unsigned int delay_us)
{
struct cpufreq_policy *policy = policy_dbs->policy;
@@ -269,16 +269,16 @@ static void gov_set_update_util(struct p
for_each_cpu(cpu, policy->cpus) {
struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
- cpufreq_set_update_util_data(cpu, &cdbs->update_util);
+ cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook);
}
}
-static inline void gov_clear_update_util(struct cpufreq_policy *policy)
+static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy)
{
int i;
for_each_cpu(i, policy->cpus)
- cpufreq_set_update_util_data(i, NULL);
+ cpufreq_set_freq_update_hook(i, NULL);
synchronize_sched();
}
@@ -287,7 +287,7 @@ static void gov_cancel_work(struct cpufr
{
struct policy_dbs_info *policy_dbs = policy->governor_data;
- gov_clear_update_util(policy_dbs->policy);
+ gov_clear_freq_update_hooks(policy_dbs->policy);
irq_work_sync(&policy_dbs->irq_work);
cancel_work_sync(&policy_dbs->work);
atomic_set(&policy_dbs->work_count, 0);
@@ -331,10 +331,9 @@ static void dbs_irq_work(struct irq_work
schedule_work(&policy_dbs->work);
}
-static void dbs_update_util_handler(struct update_util_data *data, u64 time,
- unsigned long util, unsigned long max)
+static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time)
{
- struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
+ struct cpu_dbs_info *cdbs = container_of(hook, struct cpu_dbs_info, update_hook);
struct policy_dbs_info *policy_dbs = cdbs->policy_dbs;
u64 delta_ns, lst;
@@ -403,7 +402,7 @@ static struct policy_dbs_info *alloc_pol
struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
j_cdbs->policy_dbs = policy_dbs;
- j_cdbs->update_util.func = dbs_update_util_handler;
+ j_cdbs->update_hook.func = dbs_freq_update_handler;
}
return policy_dbs;
}
@@ -419,7 +418,7 @@ static void free_policy_dbs_info(struct
struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
j_cdbs->policy_dbs = NULL;
- j_cdbs->update_util.func = NULL;
+ j_cdbs->update_hook.func = NULL;
}
gov->free(policy_dbs);
}
@@ -586,7 +585,7 @@ static int cpufreq_governor_start(struct
gov->start(policy);
- gov_set_update_util(policy_dbs, sampling_rate);
+ gov_set_freq_update_hooks(policy_dbs, sampling_rate);
return 0;
}
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -144,7 +144,7 @@ struct cpu_dbs_info {
* wake-up from idle.
*/
unsigned int prev_load;
- struct update_util_data update_util;
+ struct freq_update_hook update_hook;
struct policy_dbs_info *policy_dbs;
};
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -103,7 +103,7 @@ struct _pid {
struct cpudata {
int cpu;
- struct update_util_data update_util;
+ struct freq_update_hook update_hook;
struct pstate_data pstate;
struct vid_data vid;
@@ -1019,10 +1019,9 @@ static inline void intel_pstate_adjust_b
sample->freq);
}
-static void intel_pstate_update_util(struct update_util_data *data, u64 time,
- unsigned long util, unsigned long max)
+static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time)
{
- struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+ struct cpudata *cpu = container_of(hook, struct cpudata, update_hook);
u64 delta_ns = time - cpu->sample.time;
if ((s64)delta_ns >= pid_params.sample_rate_ns) {
@@ -1088,8 +1087,8 @@ static int intel_pstate_init_cpu(unsigne
intel_pstate_busy_pid_reset(cpu);
intel_pstate_sample(cpu, 0);
- cpu->update_util.func = intel_pstate_update_util;
- cpufreq_set_update_util_data(cpunum, &cpu->update_util);
+ cpu->update_hook.func = intel_pstate_freq_update;
+ cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook);
pr_debug("intel_pstate: controlling: cpu %d\n", cpunum);
@@ -1173,7 +1172,7 @@ static void intel_pstate_stop_cpu(struct
pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
- cpufreq_set_update_util_data(cpu_num, NULL);
+ cpufreq_set_freq_update_hook(cpu_num, NULL);
synchronize_sched();
if (hwp_active)
@@ -1441,7 +1440,7 @@ out:
get_online_cpus();
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu]) {
- cpufreq_set_update_util_data(cpu, NULL);
+ cpufreq_set_freq_update_hook(cpu, NULL);
synchronize_sched();
kfree(all_cpu_data[cpu]);
}
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2839,8 +2839,6 @@ static inline void update_load_avg(struc
update_tg_load_avg(cfs_rq, 0);
if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
- unsigned long max = rq->cpu_capacity_orig;
-
/*
* There are a few boundary cases this might miss but it should
* get called often enough that that should (hopefully) not be
@@ -2849,16 +2847,9 @@ static inline void update_load_avg(struc
* the next tick/schedule should update.
*
* It will not get called when we go idle, because the idle
- * thread is a different class (!fair), nor will the utilization
- * number include things like RT tasks.
- *
- * As is, the util number is not freq-invariant (we'd have to
- * implement arch_scale_freq_capacity() for that).
- *
- * See cpu_util().
+ * thread is a different class (!fair).
*/
- cpufreq_update_util(rq_clock(rq),
- min(cfs_rq->avg.util_avg, max), max);
+ cpufreq_trigger_update(rq_clock(rq));
}
}
Index: linux-pm/kernel/sched/deadline.c
===================================================================
--- linux-pm.orig/kernel/sched/deadline.c
+++ linux-pm/kernel/sched/deadline.c
@@ -726,7 +726,7 @@ static void update_curr_dl(struct rq *rq
if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
- /* Kick cpufreq (see the comment in linux/cpufreq.h). */
+ /* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
cpufreq_trigger_update(rq_clock(rq));
Index: linux-pm/kernel/sched/rt.c
===================================================================
--- linux-pm.orig/kernel/sched/rt.c
+++ linux-pm/kernel/sched/rt.c
@@ -945,7 +945,7 @@ static void update_curr_rt(struct rq *rq
if (curr->sched_class != &rt_sched_class)
return;
- /* Kick cpufreq (see the comment in linux/cpufreq.h). */
+ /* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
cpufreq_trigger_update(rq_clock(rq));
From: Rafael J. Wysocki <[email protected]>
Create cpufreq.c under kernel/sched/ and move the cpufreq code
related to the scheduler to that file. Also move the headers
related to that code from cpufreq.h to sched.h.
No functional changes.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
New patch.
---
drivers/cpufreq/cpufreq.c | 61 ------------------------------
drivers/cpufreq/cpufreq_governor.c | 1
drivers/cpufreq/intel_pstate.c | 1
include/linux/cpufreq.h | 10 -----
include/linux/sched.h | 12 ++++++
kernel/sched/Makefile | 1
kernel/sched/cpufreq.c | 73 +++++++++++++++++++++++++++++++++++++
7 files changed, 88 insertions(+), 71 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -65,67 +65,6 @@ static struct cpufreq_driver *cpufreq_dr
static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
static DEFINE_RWLOCK(cpufreq_driver_lock);
-static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
-
-/**
- * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
- * @cpu: The CPU to set the pointer for.
- * @hook: New pointer value.
- *
- * Set and publish the freq_update_hook pointer for the given CPU. That pointer
- * points to a struct freq_update_hook object containing a callback function
- * to call from cpufreq_trigger_update(). That function will be called from
- * an RCU read-side critical section, so it must not sleep.
- *
- * Callers must use RCU-sched callbacks to free any memory that might be
- * accessed via the old update_util_data pointer or invoke synchronize_sched()
- * right after this function to avoid use-after-free.
- */
-void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook)
-{
- if (WARN_ON(hook && !hook->func))
- return;
-
- rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
-}
-EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook);
-
-/**
- * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
- * @time: Current time.
- *
- * The way cpufreq is currently arranged requires it to evaluate the CPU
- * performance state (frequency/voltage) on a regular basis. To facilitate
- * that, this function is called by update_load_avg() in CFS when executed for
- * the current CPU's runqueue.
- *
- * However, this isn't sufficient to prevent the CPU from being stuck in a
- * completely inadequate performance level for too long, because the calls
- * from CFS will not be made if RT or deadline tasks are active all the time
- * (or there are RT and DL tasks only).
- *
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
- * but that really is a band-aid. Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
- */
-void cpufreq_trigger_update(u64 time)
-{
- struct freq_update_hook *hook;
-
-#ifdef CONFIG_LOCKDEP
- WARN_ON(debug_locks && !rcu_read_lock_sched_held());
-#endif
-
- hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
- /*
- * If this isn't inside of an RCU-sched read-side critical section, data
- * may become NULL after the check below.
- */
- if (hook)
- hook->func(hook, time);
-}
-
/* Flag to suspend/resume CPUFreq governors */
static bool cpufreq_suspended;
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -18,6 +18,7 @@
#include <linux/export.h>
#include <linux/kernel_stat.h>
+#include <linux/sched.h>
#include <linux/slab.h>
#include "cpufreq_governor.h"
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -21,6 +21,7 @@
#include <linux/list.h>
#include <linux/cpu.h>
#include <linux/cpufreq.h>
+#include <linux/sched.h>
#include <linux/sysfs.h>
#include <linux/types.h>
#include <linux/fs.h>
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -146,14 +146,6 @@ static inline bool policy_is_shared(stru
extern struct kobject *cpufreq_global_kobject;
#ifdef CONFIG_CPU_FREQ
-void cpufreq_trigger_update(u64 time);
-
-struct freq_update_hook {
- void (*func)(struct freq_update_hook *hook, u64 time);
-};
-
-void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook);
-
unsigned int cpufreq_get(unsigned int cpu);
unsigned int cpufreq_quick_get(unsigned int cpu);
unsigned int cpufreq_quick_get_max(unsigned int cpu);
@@ -165,8 +157,6 @@ int cpufreq_update_policy(unsigned int c
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
#else
-static inline void cpufreq_trigger_update(u64 time) {}
-
static inline unsigned int cpufreq_get(unsigned int cpu)
{
return 0;
Index: linux-pm/include/linux/sched.h
===================================================================
--- linux-pm.orig/include/linux/sched.h
+++ linux-pm/include/linux/sched.h
@@ -2362,6 +2362,18 @@ extern u64 scheduler_tick_max_deferment(
static inline bool sched_can_stop_tick(void) { return false; }
#endif
+#ifdef CONFIG_CPU_FREQ
+void cpufreq_trigger_update(u64 time);
+
+struct freq_update_hook {
+ void (*func)(struct freq_update_hook *hook, u64 time);
+};
+
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook);
+#else
+static inline void cpufreq_trigger_update(u64 time) {}
+#endif
+
#ifdef CONFIG_SCHED_AUTOGROUP
extern void sched_autogroup_create_attach(struct task_struct *p);
extern void sched_autogroup_detach(struct task_struct *p);
Index: linux-pm/kernel/sched/Makefile
===================================================================
--- linux-pm.orig/kernel/sched/Makefile
+++ linux-pm/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_gr
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ) += cpufreq.o
Index: linux-pm/kernel/sched/cpufreq.c
===================================================================
--- /dev/null
+++ linux-pm/kernel/sched/cpufreq.c
@@ -0,0 +1,73 @@
+/*
+ * Scheduler code and data structures related to cpufreq.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sched.h>
+
+static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
+
+/**
+ * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @hook: New pointer value.
+ *
+ * Set and publish the freq_update_hook pointer for the given CPU. That pointer
+ * points to a struct freq_update_hook object containing a callback function
+ * to call from cpufreq_trigger_update(). That function will be called from
+ * an RCU read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU-sched callbacks to free any memory that might be
+ * accessed via the old update_util_data pointer or invoke synchronize_sched()
+ * right after this function to avoid use-after-free.
+ */
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook)
+{
+ if (WARN_ON(hook && !hook->func))
+ return;
+
+ rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook);
+
+/**
+ * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
+ * @time: Current time.
+ *
+ * The way cpufreq is currently arranged requires it to evaluate the CPU
+ * performance state (frequency/voltage) on a regular basis. To facilitate
+ * that, this function is called by update_load_avg() in CFS when executed for
+ * the current CPU's runqueue.
+ *
+ * However, this isn't sufficient to prevent the CPU from being stuck in a
+ * completely inadequate performance level for too long, because the calls
+ * from CFS will not be made if RT or deadline tasks are active all the time
+ * (or there are RT and DL tasks only).
+ *
+ * As a workaround for that issue, this function is called by the RT and DL
+ * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * but that really is a band-aid. Going forward it should be replaced with
+ * solutions targeted more specifically at RT and DL tasks.
+ */
+void cpufreq_trigger_update(u64 time)
+{
+ struct freq_update_hook *hook;
+
+#ifdef CONFIG_LOCKDEP
+ WARN_ON(debug_locks && !rcu_read_lock_sched_held());
+#endif
+
+ hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
+ /*
+ * If this isn't inside of an RCU-sched read-side critical section, hook
+ * may become NULL after the check below.
+ */
+ if (hook)
+ hook->func(hook, time);
+}
From: Rafael J. Wysocki <[email protected]>
Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.
Doing that is possible after commit fe7034338ba0 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.
The new governor is relatively simple.
The frequency selection formula used by it is
next_freq = util * max_freq / max
where util and max are the utilization and CPU capacity coming from CFS.
All of the computations are carried out in the utilization update
handlers provided by the new governor. One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).
The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers. If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).
Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes. That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.
The governor shares some sysfs attributes management code with
the "ondemand" and "conservative" governors and uses some common
definitions from cpufreq.h, but apart from that it is stand-alone.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Changes from the previous version:
- New frequency selection formula and modifications related to that.
- The file is now located in kernel/sched/.
Initially, I had hoped that it would be possible to split the code
into a library part that might go into kernel/sched/ and the governor
interface plus sysfs-related code, but that split would have been
artificial and I wanted the governor to be one module as a whole. So
that didn't work out.
Also the way it is configured and built is somewhat bizarre, as the
Kconfig options are in the cpufreq Kconfig, but the code they are
related to is located in kernel/sched/ (which is not exactly
straightforward).
Overall, I'd be happier if the governor could stay in drivers/cpufreq/.
---
drivers/cpufreq/Kconfig | 26 +
drivers/cpufreq/cpufreq_governor.h | 1
include/linux/cpufreq.h | 3
kernel/sched/Makefile | 1
kernel/sched/cpufreq_schedutil.c | 487 +++++++++++++++++++++++++++++++++++++
5 files changed, 517 insertions(+), 1 deletion(-)
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+ bool "schedutil"
+ select CPU_FREQ_GOV_SCHEDUTIL
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the 'schedutil' CPUFreq governor by default. If unsure,
+ have a look at the help section of that governor. The fallback
+ governor will be 'performance'.
+
endchoice
config CPU_FREQ_GOV_PERFORMANCE
@@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE
If in doubt, say N.
+config CPU_FREQ_GOV_SCHEDUTIL
+ tristate "'schedutil' cpufreq policy governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_ATTR_SET
+ select IRQ_WORK
+ help
+ The frequency selection formula used by this governor is analogous
+ to the one used by 'ondemand', but instead of computing CPU load
+ as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+ utilization data provided by the scheduler as input.
+
+ To compile this driver as a module, choose M here: the
+ module will be called cpufreq_schedutil.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"
config CPUFREQ_DT
Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -0,0 +1,487 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+
+#include "sched.h"
+
+struct sugov_tunables {
+ struct gov_attr_set attr_set;
+ unsigned int rate_limit_us;
+};
+
+struct sugov_policy {
+ struct cpufreq_policy *policy;
+
+ struct sugov_tunables *tunables;
+ struct list_head tunables_hook;
+
+ raw_spinlock_t update_lock; /* For shared policies */
+ u64 last_freq_update_time;
+ s64 freq_update_delay_ns;
+ unsigned int next_freq;
+
+ /* The next fields are only needed if fast switch cannot be used. */
+ struct irq_work irq_work;
+ struct work_struct work;
+ struct mutex work_lock;
+ bool work_in_progress;
+
+ bool need_freq_update;
+};
+
+struct sugov_cpu {
+ struct freq_update_hook update_hook;
+ struct sugov_policy *sg_policy;
+
+ /* The fields below are only needed when sharing a policy. */
+ unsigned long util;
+ unsigned long max;
+ u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
+ u64 delta_ns;
+
+ if (sg_policy->work_in_progress)
+ return false;
+
+ if (unlikely(sg_policy->need_freq_update)) {
+ sg_policy->need_freq_update = false;
+ return true;
+ }
+
+ delta_ns = time - sg_policy->last_freq_update_time;
+ return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
+}
+
+static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+ unsigned int next_freq)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+
+ if (next_freq > policy->max)
+ next_freq = policy->max;
+ else if (next_freq < policy->min)
+ next_freq = policy->min;
+
+ sg_policy->last_freq_update_time = time;
+ if (sg_policy->next_freq == next_freq)
+ return;
+
+ sg_policy->next_freq = next_freq;
+ if (policy->fast_switch_possible) {
+ cpufreq_driver_fast_switch(policy, next_freq, CPUFREQ_RELATION_L);
+ } else {
+ sg_policy->work_in_progress = true;
+ irq_work_queue(&sg_policy->irq_work);
+ }
+}
+
+static void sugov_update_single(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int max_f, next_f;
+
+ if (!sugov_should_update_freq(sg_policy, time))
+ return;
+
+ max_f = sg_policy->policy->cpuinfo.max_freq;
+ next_f = util > max ? max_f : util * max_f / max;
+ sugov_update_commit(sg_policy, time, next_f);
+}
+
+static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
+ unsigned long util, unsigned long max)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int max_f = policy->cpuinfo.max_freq;
+ u64 last_freq_update_time = sg_policy->last_freq_update_time;
+ unsigned int j;
+
+ if (util > max)
+ return max_f;
+
+ for_each_cpu(j, policy->cpus) {
+ struct sugov_cpu *j_sg_cpu;
+ unsigned long j_util, j_max;
+ u64 delta_ns;
+
+ if (j == smp_processor_id())
+ continue;
+
+ j_sg_cpu = &per_cpu(sugov_cpu, j);
+ /*
+ * If the CPU utilization was last updated before the previous
+ * frequency update and the time elapsed between the last update
+ * of the CPU utilization and the last frequency update is long
+ * enough, don't take the CPU into account as it probably is
+ * idle now.
+ */
+ delta_ns = last_freq_update_time - j_sg_cpu->last_update;
+ if ((s64)delta_ns > NSEC_PER_SEC / HZ)
+ continue;
+
+ j_util = j_sg_cpu->util;
+ j_max = j_sg_cpu->max;
+ if (j_util > j_max)
+ return max_f;
+
+ if (j_util * max > j_max * util) {
+ util = j_util;
+ max = j_max;
+ }
+ }
+
+ return util * max_f / max;
+}
+
+static void sugov_update_shared(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int next_f;
+
+ raw_spin_lock(&sg_policy->update_lock);
+
+ sg_cpu->util = util;
+ sg_cpu->max = max;
+ sg_cpu->last_update = time;
+
+ if (sugov_should_update_freq(sg_policy, time)) {
+ next_f = sugov_next_freq(sg_policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+ }
+
+ raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_work(struct work_struct *work)
+{
+ struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+
+ mutex_lock(&sg_policy->work_lock);
+ __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+ CPUFREQ_RELATION_L);
+ mutex_unlock(&sg_policy->work_lock);
+
+ sg_policy->work_in_progress = false;
+}
+
+static void sugov_irq_work(struct irq_work *irq_work)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
+ schedule_work(&sg_policy->work);
+}
+
+/************************** sysfs interface ************************/
+
+static struct sugov_tunables *global_tunables;
+static DEFINE_MUTEX(global_tunables_lock);
+
+static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct sugov_tunables, attr_set);
+}
+
+static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+ return sprintf(buf, "%u\n", tunables->rate_limit_us);
+}
+
+static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+ struct sugov_policy *sg_policy;
+ unsigned int rate_limit_us;
+ int ret;
+
+ ret = sscanf(buf, "%u", &rate_limit_us);
+ if (ret != 1)
+ return -EINVAL;
+
+ tunables->rate_limit_us = rate_limit_us;
+
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+ sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+
+ return count;
+}
+
+static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+
+static struct attribute *sugov_attributes[] = {
+ &rate_limit_us.attr,
+ NULL
+};
+
+static struct kobj_type sugov_tunables_ktype = {
+ .default_attrs = sugov_attributes,
+ .sysfs_ops = &governor_sysfs_ops,
+};
+
+/********************** cpufreq governor interface *********************/
+
+static struct cpufreq_governor schedutil_gov;
+
+static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+ if (!sg_policy)
+ return NULL;
+
+ sg_policy->policy = policy;
+ init_irq_work(&sg_policy->irq_work, sugov_irq_work);
+ INIT_WORK(&sg_policy->work, sugov_work);
+ mutex_init(&sg_policy->work_lock);
+ raw_spin_lock_init(&sg_policy->update_lock);
+ return sg_policy;
+}
+
+static void sugov_policy_free(struct sugov_policy *sg_policy)
+{
+ mutex_destroy(&sg_policy->work_lock);
+ kfree(sg_policy);
+}
+
+static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
+{
+ struct sugov_tunables *tunables;
+
+ tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
+ if (tunables)
+ gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
+
+ return tunables;
+}
+
+static void sugov_tunables_free(struct sugov_tunables *tunables)
+{
+ if (!have_governor_per_policy())
+ global_tunables = NULL;
+
+ kfree(tunables);
+}
+
+static int sugov_init(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+ struct sugov_tunables *tunables;
+ unsigned int lat;
+ int ret = 0;
+
+ /* State should be equivalent to EXIT */
+ if (policy->governor_data)
+ return -EBUSY;
+
+ sg_policy = sugov_policy_alloc(policy);
+ if (!sg_policy)
+ return -ENOMEM;
+
+ mutex_lock(&global_tunables_lock);
+
+ if (global_tunables) {
+ if (WARN_ON(have_governor_per_policy())) {
+ ret = -EINVAL;
+ goto free_sg_policy;
+ }
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = global_tunables;
+
+ gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
+ goto out;
+ }
+
+ tunables = sugov_tunables_alloc(sg_policy);
+ if (!tunables) {
+ ret = -ENOMEM;
+ goto free_sg_policy;
+ }
+
+ tunables->rate_limit_us = LATENCY_MULTIPLIER;
+ lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
+ if (lat)
+ tunables->rate_limit_us *= lat;
+
+ if (!have_governor_per_policy())
+ global_tunables = tunables;
+
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = tunables;
+
+ ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
+ get_governor_parent_kobj(policy), "%s",
+ schedutil_gov.name);
+ if (!ret)
+ goto out;
+
+ /* Failure, so roll back. */
+ policy->governor_data = NULL;
+ sugov_tunables_free(tunables);
+
+ free_sg_policy:
+ pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
+ sugov_policy_free(sg_policy);
+
+ out:
+ mutex_unlock(&global_tunables_lock);
+ return ret;
+}
+
+static int sugov_exit(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ struct sugov_tunables *tunables = sg_policy->tunables;
+ unsigned int count;
+
+ mutex_lock(&global_tunables_lock);
+
+ count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
+ policy->governor_data = NULL;
+ if (!count)
+ sugov_tunables_free(tunables);
+
+ mutex_unlock(&global_tunables_lock);
+
+ sugov_policy_free(sg_policy);
+ return 0;
+}
+
+static int sugov_start(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->last_freq_update_time = 0;
+ sg_policy->next_freq = UINT_MAX;
+ sg_policy->work_in_progress = false;
+ sg_policy->need_freq_update = false;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+ sg_cpu->sg_policy = sg_policy;
+ if (policy_is_shared(policy)) {
+ sg_cpu->util = ULONG_MAX;
+ sg_cpu->max = 0;
+ sg_cpu->last_update = 0;
+ cpufreq_set_update_util_hook(cpu, &sg_cpu->update_hook,
+ sugov_update_shared);
+ } else {
+ cpufreq_set_update_util_hook(cpu, &sg_cpu->update_hook,
+ sugov_update_single);
+ }
+ }
+ return 0;
+}
+
+static int sugov_stop(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ cpufreq_clear_update_util_hook(cpu);
+
+ synchronize_sched();
+
+ irq_work_sync(&sg_policy->irq_work);
+ cancel_work_sync(&sg_policy->work);
+ return 0;
+}
+
+static int sugov_limits(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+
+ if (!policy->fast_switch_possible) {
+ mutex_lock(&sg_policy->work_lock);
+
+ if (policy->max < policy->cur)
+ __cpufreq_driver_target(policy, policy->max,
+ CPUFREQ_RELATION_H);
+ else if (policy->min > policy->cur)
+ __cpufreq_driver_target(policy, policy->min,
+ CPUFREQ_RELATION_L);
+
+ mutex_unlock(&sg_policy->work_lock);
+ }
+
+ sg_policy->need_freq_update = true;
+ return 0;
+}
+
+int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
+{
+ if (event == CPUFREQ_GOV_POLICY_INIT) {
+ return sugov_init(policy);
+ } else if (policy->governor_data) {
+ switch (event) {
+ case CPUFREQ_GOV_POLICY_EXIT:
+ return sugov_exit(policy);
+ case CPUFREQ_GOV_START:
+ return sugov_start(policy);
+ case CPUFREQ_GOV_STOP:
+ return sugov_stop(policy);
+ case CPUFREQ_GOV_LIMITS:
+ return sugov_limits(policy);
+ }
+ }
+ return -EINVAL;
+}
+
+static struct cpufreq_governor schedutil_gov = {
+ .name = "schedutil",
+ .governor = sugov_governor,
+ .owner = THIS_MODULE,
+};
+
+static int __init sugov_module_init(void)
+{
+ return cpufreq_register_governor(&schedutil_gov);
+}
+
+static void __exit sugov_module_exit(void)
+{
+ cpufreq_unregister_governor(&schedutil_gov);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+struct cpufreq_governor *cpufreq_default_governor(void)
+{
+ return &schedutil_gov;
+}
+
+fs_initcall(sugov_module_init);
+#else
+module_init(sugov_module_init);
+#endif
+module_exit(sugov_module_exit);
Index: linux-pm/kernel/sched/Makefile
===================================================================
--- linux-pm.orig/kernel/sched/Makefile
+++ linux-pm/kernel/sched/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -34,7 +34,6 @@
* this governor will not work. All times here are in us (micro seconds).
*/
#define MIN_SAMPLING_RATE_RATIO (2)
-#define LATENCY_MULTIPLIER (1000)
#define MIN_LATENCY_MULTIPLIER (20)
#define TRANSITION_LATENCY_LIMIT (10 * 1000 * 1000)
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -468,6 +468,9 @@ void cpufreq_unregister_governor(struct
struct cpufreq_governor *cpufreq_default_governor(void);
struct cpufreq_governor *cpufreq_fallback_governor(void);
+/* Coefficient for computing default sampling rate/rate limit in governors */
+#define LATENCY_MULTIPLIER (1000)
+
/* Governor attribute set */
struct gov_attr_set {
struct kobject kobj;
On Wednesday, March 02, 2016 02:56:28 AM Rafael J. Wysocki wrote:
> Hi,
>
> My previous intro message still applies somewhat, so here's a link:
>
> http://marc.info/?l=linux-pm&m=145609673008122&w=2
>
> The executive summary of the motivation is that I wanted to do two things:
> use the utilization data from the scheduler (it's passed to the governor
> as aguments of update callbacks anyway) and make it possible to set
> CPU frequency without involving process context (fast frequency switching).
>
> Both have been prototyped in the previous RFCs:
>
> https://patchwork.kernel.org/patch/8426691/
> https://patchwork.kernel.org/patch/8426741/
>
[cut]
>
> Comments welcome.
There were quite a few comments to address, so here's a new version.
First off, my interpretation of what Ingo said earlier today (or yesterday
depending on your time zone) is that he wants all of the code dealing with
the util and max values to be located in kernel/sched/. I can understand
the motivation here, although schedutil shares some amount of code with
the other governors, so the dependency on cpufreq will still be there, even
if the code goes to kernel/sched/. Nevertheless, I decided to make that
change just to see how it would look like if not for anything else.
To that end, I revived a patch I had before the first schedutil one to
remove util/max from the cpufreq hooks [7/10], moved the scheduler-related
code from drivers/cpufreq/cpufreq.c to kernel/sched/cpufreq.c (new file)
on top of that [8/10] and reintroduced cpufreq_update_util() in a slightly
different form [9/10]. I did it this way in case it turns out to be
necessary to apply [7/10] and [8/10] for the time being and defer the rest
to the next cycle.
Apart from that, I changed the frequency selection formula in the new
governor to next_freq = util * max_freq / max and it seems to work. That
allowed the code to be simplified somewhat as I don't need the extra
relation field in struct sugov_policy now (RELATION_L is used everywhere).
Finally, I tried to address the bikeshed comment from Viresh about the
"wrong" names of data types etc related to governor sysfs attributes
handling. Hopefully, the new ones are better.
There are small tweaks all over on top of that.
Thanks,
Rafael
From: Rafael J. Wysocki <[email protected]>
In addition to fields representing governor tunables, struct dbs_data
contains some fields needed for the management of objects of that
type. As it turns out, that part of struct dbs_data may be shared
with (future) governors that won't use the common code used by
"ondemand" and "conservative", so move it to a separate struct type
and modify the code using struct dbs_data to follow.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Changes from the previous version:
- The new data type is called gov_attr_set now (instead of gov_tunables)
and some variable names etc have been changed to follow.
---
drivers/cpufreq/cpufreq_conservative.c | 25 +++++----
drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++-------------
drivers/cpufreq/cpufreq_governor.h | 35 +++++++-----
drivers/cpufreq/cpufreq_ondemand.c | 29 ++++++----
4 files changed, 107 insertions(+), 72 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -41,6 +41,13 @@
/* Ondemand Sampling types */
enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
+struct gov_attr_set {
+ struct kobject kobj;
+ struct list_head policy_list;
+ struct mutex update_lock;
+ int usage_count;
+};
+
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
@@ -52,7 +59,7 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
/* Governor demand based switching data (per-policy or global). */
struct dbs_data {
- int usage_count;
+ struct gov_attr_set attr_set;
void *tuners;
unsigned int min_sampling_rate;
unsigned int ignore_nice_load;
@@ -60,37 +67,35 @@ struct dbs_data {
unsigned int sampling_down_factor;
unsigned int up_threshold;
unsigned int io_is_busy;
-
- struct kobject kobj;
- struct list_head policy_dbs_list;
- /*
- * Protect concurrent updates to governor tunables from sysfs,
- * policy_dbs_list and usage_count.
- */
- struct mutex mutex;
};
+static inline struct dbs_data *to_dbs_data(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct dbs_data, attr_set);
+}
+
/* Governor's specific attributes */
-struct dbs_data;
struct governor_attr {
struct attribute attr;
- ssize_t (*show)(struct dbs_data *dbs_data, char *buf);
- ssize_t (*store)(struct dbs_data *dbs_data, const char *buf,
+ ssize_t (*show)(struct gov_attr_set *attr_set, char *buf);
+ ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf,
size_t count);
};
#define gov_show_one(_gov, file_name) \
static ssize_t show_##file_name \
-(struct dbs_data *dbs_data, char *buf) \
+(struct gov_attr_set *attr_set, char *buf) \
{ \
+ struct dbs_data *dbs_data = to_dbs_data(attr_set); \
struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \
return sprintf(buf, "%u\n", tuners->file_name); \
}
#define gov_show_one_common(file_name) \
static ssize_t show_##file_name \
-(struct dbs_data *dbs_data, char *buf) \
+(struct gov_attr_set *attr_set, char *buf) \
{ \
+ struct dbs_data *dbs_data = to_dbs_data(attr_set); \
return sprintf(buf, "%u\n", dbs_data->file_name); \
}
@@ -184,7 +189,7 @@ void od_register_powersave_bias_handler(
(struct cpufreq_policy *, unsigned int, unsigned int),
unsigned int powersave_bias);
void od_unregister_powersave_bias_handler(void);
-ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
+ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf,
size_t count);
void gov_update_cpu_data(struct dbs_data *dbs_data);
#endif /* _CPUFREQ_GOVERNOR_H */
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -43,9 +43,10 @@ static DEFINE_MUTEX(gov_dbs_data_mutex);
* This must be called with dbs_data->mutex held, otherwise traversing
* policy_dbs_list isn't safe.
*/
-ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
+ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf,
size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct policy_dbs_info *policy_dbs;
unsigned int rate;
int ret;
@@ -59,7 +60,7 @@ ssize_t store_sampling_rate(struct dbs_d
* We are operating under dbs_data->mutex and so the list and its
* entries can't be freed concurrently.
*/
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list) {
mutex_lock(&policy_dbs->timer_mutex);
/*
* On 32-bit architectures this may race with the
@@ -96,7 +97,7 @@ void gov_update_cpu_data(struct dbs_data
{
struct policy_dbs_info *policy_dbs;
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &dbs_data->attr_set.policy_list, list) {
unsigned int j;
for_each_cpu(j, policy_dbs->policy->cpus) {
@@ -111,9 +112,9 @@ void gov_update_cpu_data(struct dbs_data
}
EXPORT_SYMBOL_GPL(gov_update_cpu_data);
-static inline struct dbs_data *to_dbs_data(struct kobject *kobj)
+static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
{
- return container_of(kobj, struct dbs_data, kobj);
+ return container_of(kobj, struct gov_attr_set, kobj);
}
static inline struct governor_attr *to_gov_attr(struct attribute *attr)
@@ -124,25 +125,24 @@ static inline struct governor_attr *to_g
static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
char *buf)
{
- struct dbs_data *dbs_data = to_dbs_data(kobj);
struct governor_attr *gattr = to_gov_attr(attr);
- return gattr->show(dbs_data, buf);
+ return gattr->show(to_gov_attr_set(kobj), buf);
}
static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
const char *buf, size_t count)
{
- struct dbs_data *dbs_data = to_dbs_data(kobj);
+ struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
struct governor_attr *gattr = to_gov_attr(attr);
int ret = -EBUSY;
- mutex_lock(&dbs_data->mutex);
+ mutex_lock(&attr_set->update_lock);
- if (dbs_data->usage_count)
- ret = gattr->store(dbs_data, buf, count);
+ if (attr_set->usage_count)
+ ret = gattr->store(attr_set, buf, count);
- mutex_unlock(&dbs_data->mutex);
+ mutex_unlock(&attr_set->update_lock);
return ret;
}
@@ -424,6 +424,41 @@ static void free_policy_dbs_info(struct
gov->free(policy_dbs);
}
+static void gov_attr_set_init(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ INIT_LIST_HEAD(&attr_set->policy_list);
+ mutex_init(&attr_set->update_lock);
+ attr_set->usage_count = 1;
+ list_add(list_node, &attr_set->policy_list);
+}
+
+static void gov_attr_set_get(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ mutex_lock(&attr_set->update_lock);
+ attr_set->usage_count++;
+ list_add(list_node, &attr_set->policy_list);
+ mutex_unlock(&attr_set->update_lock);
+}
+
+static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ unsigned int count;
+
+ mutex_lock(&attr_set->update_lock);
+ list_del(list_node);
+ count = --attr_set->usage_count;
+ mutex_unlock(&attr_set->update_lock);
+ if (count)
+ return count;
+
+ kobject_put(&attr_set->kobj);
+ mutex_destroy(&attr_set->update_lock);
+ return 0;
+}
+
static int cpufreq_governor_init(struct cpufreq_policy *policy)
{
struct dbs_governor *gov = dbs_governor_of(policy);
@@ -452,10 +487,7 @@ static int cpufreq_governor_init(struct
policy_dbs->dbs_data = dbs_data;
policy->governor_data = policy_dbs;
- mutex_lock(&dbs_data->mutex);
- dbs_data->usage_count++;
- list_add(&policy_dbs->list, &dbs_data->policy_dbs_list);
- mutex_unlock(&dbs_data->mutex);
+ gov_attr_set_get(&dbs_data->attr_set, &policy_dbs->list);
goto out;
}
@@ -465,8 +497,7 @@ static int cpufreq_governor_init(struct
goto free_policy_dbs_info;
}
- INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
- mutex_init(&dbs_data->mutex);
+ gov_attr_set_init(&dbs_data->attr_set, &policy_dbs->list);
ret = gov->init(dbs_data, !policy->governor->initialized);
if (ret)
@@ -486,14 +517,11 @@ static int cpufreq_governor_init(struct
if (!have_governor_per_policy())
gov->gdbs_data = dbs_data;
- policy->governor_data = policy_dbs;
-
policy_dbs->dbs_data = dbs_data;
- dbs_data->usage_count = 1;
- list_add(&policy_dbs->list, &dbs_data->policy_dbs_list);
+ policy->governor_data = policy_dbs;
gov->kobj_type.sysfs_ops = &governor_sysfs_ops;
- ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type,
+ ret = kobject_init_and_add(&dbs_data->attr_set.kobj, &gov->kobj_type,
get_governor_parent_kobj(policy),
"%s", gov->gov.name);
if (!ret)
@@ -522,29 +550,21 @@ static int cpufreq_governor_exit(struct
struct dbs_governor *gov = dbs_governor_of(policy);
struct policy_dbs_info *policy_dbs = policy->governor_data;
struct dbs_data *dbs_data = policy_dbs->dbs_data;
- int count;
+ unsigned int count;
/* Protect gov->gdbs_data against concurrent updates. */
mutex_lock(&gov_dbs_data_mutex);
- mutex_lock(&dbs_data->mutex);
- list_del(&policy_dbs->list);
- count = --dbs_data->usage_count;
- mutex_unlock(&dbs_data->mutex);
+ count = gov_attr_set_put(&dbs_data->attr_set, &policy_dbs->list);
- if (!count) {
- kobject_put(&dbs_data->kobj);
-
- policy->governor_data = NULL;
+ policy->governor_data = NULL;
+ if (!count) {
if (!have_governor_per_policy())
gov->gdbs_data = NULL;
gov->exit(dbs_data, policy->governor->initialized == 1);
- mutex_destroy(&dbs_data->mutex);
kfree(dbs_data);
- } else {
- policy->governor_data = NULL;
}
free_policy_dbs_info(policy_dbs, gov);
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -207,9 +207,10 @@ static unsigned int od_dbs_timer(struct
/************************** sysfs interface ************************/
static struct dbs_governor od_dbs_gov;
-static ssize_t store_io_is_busy(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_io_is_busy(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
@@ -224,9 +225,10 @@ static ssize_t store_io_is_busy(struct d
return count;
}
-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_up_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -240,9 +242,10 @@ static ssize_t store_up_threshold(struct
return count;
}
-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct policy_dbs_info *policy_dbs;
unsigned int input;
int ret;
@@ -254,7 +257,7 @@ static ssize_t store_sampling_down_facto
dbs_data->sampling_down_factor = input;
/* Reset down sampling multiplier in case it was active */
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list) {
/*
* Doing this without locking might lead to using different
* rate_mult values in od_update() and od_dbs_timer().
@@ -267,9 +270,10 @@ static ssize_t store_sampling_down_facto
return count;
}
-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
@@ -291,9 +295,10 @@ static ssize_t store_ignore_nice_load(st
return count;
}
-static ssize_t store_powersave_bias(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_powersave_bias(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct od_dbs_tuners *od_tuners = dbs_data->tuners;
struct policy_dbs_info *policy_dbs;
unsigned int input;
@@ -308,7 +313,7 @@ static ssize_t store_powersave_bias(stru
od_tuners->powersave_bias = input;
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list)
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list)
ondemand_powersave_bias_init(policy_dbs->policy);
return count;
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -129,9 +129,10 @@ static struct notifier_block cs_cpufreq_
/************************** sysfs interface ************************/
static struct dbs_governor cs_dbs_gov;
-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -143,9 +144,10 @@ static ssize_t store_sampling_down_facto
return count;
}
-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_up_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
@@ -158,9 +160,10 @@ static ssize_t store_up_threshold(struct
return count;
}
-static ssize_t store_down_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_down_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
@@ -175,9 +178,10 @@ static ssize_t store_down_threshold(stru
return count;
}
-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
@@ -199,9 +203,10 @@ static ssize_t store_ignore_nice_load(st
return count;
}
-static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_freq_step(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
From: Rafael J. Wysocki <[email protected]>
Setting a new CPU frequency and reading the current request value
in the ACPI cpufreq driver involves each at least two switch
instructions (there's more if the policy is shared). One of
them is present in drv_read/write() that prepares a command
structure and the other happens in subsequent do_drv_read/write()
when that structure is interpreted. However, all of those switches
may be avoided by using function pointers.
To that end, add two function pointers to struct acpi_cpufreq_data
to represent read and write operations on the frequency register
and set them up during policy intitialization to point to the pair
of routines suitable for the given processor (Intel/AMD MSR access
or I/O port access). Then, use those pointers in do_drv_read/write()
and modify drv_read/write() to prepare the command structure for
them without any checks.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
No changes.
---
drivers/cpufreq/acpi-cpufreq.c | 208 ++++++++++++++++++-----------------------
1 file changed, 95 insertions(+), 113 deletions(-)
Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -70,6 +70,8 @@ struct acpi_cpufreq_data {
unsigned int cpu_feature;
unsigned int acpi_perf_cpu;
cpumask_var_t freqdomain_cpus;
+ void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val);
+ u32 (*cpu_freq_read)(struct acpi_pct_register *reg);
};
/* acpi_perf_data is a pointer to percpu data. */
@@ -243,125 +245,119 @@ static unsigned extract_freq(u32 val, st
}
}
-struct msr_addr {
- u32 reg;
-};
+u32 cpu_freq_read_intel(struct acpi_pct_register *not_used)
+{
+ u32 val, dummy;
-struct io_addr {
- u16 port;
- u8 bit_width;
-};
+ rdmsr(MSR_IA32_PERF_CTL, val, dummy);
+ return val;
+}
+
+void cpu_freq_write_intel(struct acpi_pct_register *not_used, u32 val)
+{
+ u32 lo, hi;
+
+ rdmsr(MSR_IA32_PERF_CTL, lo, hi);
+ lo = (lo & ~INTEL_MSR_RANGE) | (val & INTEL_MSR_RANGE);
+ wrmsr(MSR_IA32_PERF_CTL, lo, hi);
+}
+
+u32 cpu_freq_read_amd(struct acpi_pct_register *not_used)
+{
+ u32 val, dummy;
+
+ rdmsr(MSR_AMD_PERF_CTL, val, dummy);
+ return val;
+}
+
+void cpu_freq_write_amd(struct acpi_pct_register *not_used, u32 val)
+{
+ wrmsr(MSR_AMD_PERF_CTL, val, 0);
+}
+
+u32 cpu_freq_read_io(struct acpi_pct_register *reg)
+{
+ u32 val;
+
+ acpi_os_read_port(reg->address, &val, reg->bit_width);
+ return val;
+}
+
+void cpu_freq_write_io(struct acpi_pct_register *reg, u32 val)
+{
+ acpi_os_write_port(reg->address, val, reg->bit_width);
+}
struct drv_cmd {
- unsigned int type;
- const struct cpumask *mask;
- union {
- struct msr_addr msr;
- struct io_addr io;
- } addr;
+ struct acpi_pct_register *reg;
u32 val;
+ union {
+ void (*write)(struct acpi_pct_register *reg, u32 val);
+ u32 (*read)(struct acpi_pct_register *reg);
+ } func;
};
/* Called via smp_call_function_single(), on the target CPU */
static void do_drv_read(void *_cmd)
{
struct drv_cmd *cmd = _cmd;
- u32 h;
- switch (cmd->type) {
- case SYSTEM_INTEL_MSR_CAPABLE:
- case SYSTEM_AMD_MSR_CAPABLE:
- rdmsr(cmd->addr.msr.reg, cmd->val, h);
- break;
- case SYSTEM_IO_CAPABLE:
- acpi_os_read_port((acpi_io_address)cmd->addr.io.port,
- &cmd->val,
- (u32)cmd->addr.io.bit_width);
- break;
- default:
- break;
- }
+ cmd->val = cmd->func.read(cmd->reg);
}
-/* Called via smp_call_function_many(), on the target CPUs */
-static void do_drv_write(void *_cmd)
+static u32 drv_read(struct acpi_cpufreq_data *data, const struct cpumask *mask)
{
- struct drv_cmd *cmd = _cmd;
- u32 lo, hi;
+ struct acpi_processor_performance *perf = to_perf_data(data);
+ struct drv_cmd cmd = {
+ .reg = &perf->control_register,
+ .func.read = data->cpu_freq_read,
+ };
+ int err;
- switch (cmd->type) {
- case SYSTEM_INTEL_MSR_CAPABLE:
- rdmsr(cmd->addr.msr.reg, lo, hi);
- lo = (lo & ~INTEL_MSR_RANGE) | (cmd->val & INTEL_MSR_RANGE);
- wrmsr(cmd->addr.msr.reg, lo, hi);
- break;
- case SYSTEM_AMD_MSR_CAPABLE:
- wrmsr(cmd->addr.msr.reg, cmd->val, 0);
- break;
- case SYSTEM_IO_CAPABLE:
- acpi_os_write_port((acpi_io_address)cmd->addr.io.port,
- cmd->val,
- (u32)cmd->addr.io.bit_width);
- break;
- default:
- break;
- }
+ err = smp_call_function_any(mask, do_drv_read, &cmd, 1);
+ WARN_ON_ONCE(err); /* smp_call_function_any() was buggy? */
+ return cmd.val;
}
-static void drv_read(struct drv_cmd *cmd)
+/* Called via smp_call_function_many(), on the target CPUs */
+static void do_drv_write(void *_cmd)
{
- int err;
- cmd->val = 0;
+ struct drv_cmd *cmd = _cmd;
- err = smp_call_function_any(cmd->mask, do_drv_read, cmd, 1);
- WARN_ON_ONCE(err); /* smp_call_function_any() was buggy? */
+ cmd->func.write(cmd->reg, cmd->val);
}
-static void drv_write(struct drv_cmd *cmd)
+static void drv_write(struct acpi_cpufreq_data *data,
+ const struct cpumask *mask, u32 val)
{
+ struct acpi_processor_performance *perf = to_perf_data(data);
+ struct drv_cmd cmd = {
+ .reg = &perf->control_register,
+ .val = val,
+ .func.write = data->cpu_freq_write,
+ };
int this_cpu;
this_cpu = get_cpu();
- if (cpumask_test_cpu(this_cpu, cmd->mask))
- do_drv_write(cmd);
- smp_call_function_many(cmd->mask, do_drv_write, cmd, 1);
+ if (cpumask_test_cpu(this_cpu, mask))
+ do_drv_write(&cmd);
+
+ smp_call_function_many(mask, do_drv_write, &cmd, 1);
put_cpu();
}
-static u32
-get_cur_val(const struct cpumask *mask, struct acpi_cpufreq_data *data)
+static u32 get_cur_val(const struct cpumask *mask, struct acpi_cpufreq_data *data)
{
- struct acpi_processor_performance *perf;
- struct drv_cmd cmd;
+ u32 val;
if (unlikely(cpumask_empty(mask)))
return 0;
- switch (data->cpu_feature) {
- case SYSTEM_INTEL_MSR_CAPABLE:
- cmd.type = SYSTEM_INTEL_MSR_CAPABLE;
- cmd.addr.msr.reg = MSR_IA32_PERF_CTL;
- break;
- case SYSTEM_AMD_MSR_CAPABLE:
- cmd.type = SYSTEM_AMD_MSR_CAPABLE;
- cmd.addr.msr.reg = MSR_AMD_PERF_CTL;
- break;
- case SYSTEM_IO_CAPABLE:
- cmd.type = SYSTEM_IO_CAPABLE;
- perf = to_perf_data(data);
- cmd.addr.io.port = perf->control_register.address;
- cmd.addr.io.bit_width = perf->control_register.bit_width;
- break;
- default:
- return 0;
- }
-
- cmd.mask = mask;
- drv_read(&cmd);
+ val = drv_read(data, mask);
- pr_debug("get_cur_val = %u\n", cmd.val);
+ pr_debug("get_cur_val = %u\n", val);
- return cmd.val;
+ return val;
}
static unsigned int get_cur_freq_on_cpu(unsigned int cpu)
@@ -416,7 +412,7 @@ static int acpi_cpufreq_target(struct cp
{
struct acpi_cpufreq_data *data = policy->driver_data;
struct acpi_processor_performance *perf;
- struct drv_cmd cmd;
+ const struct cpumask *mask;
unsigned int next_perf_state = 0; /* Index into perf table */
int result = 0;
@@ -438,37 +434,17 @@ static int acpi_cpufreq_target(struct cp
}
}
- switch (data->cpu_feature) {
- case SYSTEM_INTEL_MSR_CAPABLE:
- cmd.type = SYSTEM_INTEL_MSR_CAPABLE;
- cmd.addr.msr.reg = MSR_IA32_PERF_CTL;
- cmd.val = (u32) perf->states[next_perf_state].control;
- break;
- case SYSTEM_AMD_MSR_CAPABLE:
- cmd.type = SYSTEM_AMD_MSR_CAPABLE;
- cmd.addr.msr.reg = MSR_AMD_PERF_CTL;
- cmd.val = (u32) perf->states[next_perf_state].control;
- break;
- case SYSTEM_IO_CAPABLE:
- cmd.type = SYSTEM_IO_CAPABLE;
- cmd.addr.io.port = perf->control_register.address;
- cmd.addr.io.bit_width = perf->control_register.bit_width;
- cmd.val = (u32) perf->states[next_perf_state].control;
- break;
- default:
- return -ENODEV;
- }
-
- /* cpufreq holds the hotplug lock, so we are safe from here on */
- if (policy->shared_type != CPUFREQ_SHARED_TYPE_ANY)
- cmd.mask = policy->cpus;
- else
- cmd.mask = cpumask_of(policy->cpu);
+ /*
+ * The core won't allow CPUs to go away until the governor has been
+ * stopped, so we can rely on the stability of policy->cpus.
+ */
+ mask = policy->shared_type == CPUFREQ_SHARED_TYPE_ANY ?
+ cpumask_of(policy->cpu) : policy->cpus;
- drv_write(&cmd);
+ drv_write(data, mask, perf->states[next_perf_state].control);
if (acpi_pstate_strict) {
- if (!check_freqs(cmd.mask, data->freq_table[index].frequency,
+ if (!check_freqs(mask, data->freq_table[index].frequency,
data)) {
pr_debug("acpi_cpufreq_target failed (%d)\n",
policy->cpu);
@@ -738,15 +714,21 @@ static int acpi_cpufreq_cpu_init(struct
}
pr_debug("SYSTEM IO addr space\n");
data->cpu_feature = SYSTEM_IO_CAPABLE;
+ data->cpu_freq_read = cpu_freq_read_io;
+ data->cpu_freq_write = cpu_freq_write_io;
break;
case ACPI_ADR_SPACE_FIXED_HARDWARE:
pr_debug("HARDWARE addr space\n");
if (check_est_cpu(cpu)) {
data->cpu_feature = SYSTEM_INTEL_MSR_CAPABLE;
+ data->cpu_freq_read = cpu_freq_read_intel;
+ data->cpu_freq_write = cpu_freq_write_intel;
break;
}
if (check_amd_hwpstate_cpu(cpu)) {
data->cpu_feature = SYSTEM_AMD_MSR_CAPABLE;
+ data->cpu_freq_read = cpu_freq_read_amd;
+ data->cpu_freq_write = cpu_freq_write_amd;
break;
}
result = -ENODEV;
From: Rafael J. Wysocki <[email protected]>
Use the observation that cpufreq_update_util() is only called
by the scheduler with rq->lock held, so the callers of
cpufreq_set_update_util_data() can use synchronize_sched()
instead of synchronize_rcu() to wait for cpufreq_update_util()
to complete. Moreover, if they are updated to do that,
rcu_read_(un)lock() calls in cpufreq_update_util() might be
replaced with rcu_read_(un)lock_sched(), respectively, but
those aren't really necessary, because the scheduler calls
that function from RCU-sched read-side critical sections
already.
In addition to that, if cpufreq_set_update_util_data() checks
the func field in the struct update_util_data before setting
the per-CPU pointer to it, the data->func check may be dropped
from cpufreq_update_util() as well.
Make the above changes to reduce the overhead from
cpufreq_update_util() in the scheduler paths invoking it
and to make the cleanup after removing its callbacks less
heavy-weight somewhat.
Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
Changes from the previous version:
- Use rcu_dereference_sched() in cpufreq_update_util().
---
drivers/cpufreq/cpufreq.c | 25 +++++++++++++++++--------
drivers/cpufreq/cpufreq_governor.c | 2 +-
drivers/cpufreq/intel_pstate.c | 4 ++--
3 files changed, 20 insertions(+), 11 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -77,12 +77,15 @@ static DEFINE_PER_CPU(struct update_util
* to call from cpufreq_update_util(). That function will be called from an RCU
* read-side critical section, so it must not sleep.
*
- * Callers must use RCU callbacks to free any memory that might be accessed
- * via the old update_util_data pointer or invoke synchronize_rcu() right after
- * this function to avoid use-after-free.
+ * Callers must use RCU-sched callbacks to free any memory that might be
+ * accessed via the old update_util_data pointer or invoke synchronize_sched()
+ * right after this function to avoid use-after-free.
*/
void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
{
+ if (WARN_ON(data && !data->func))
+ return;
+
rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
}
EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
@@ -95,18 +98,24 @@ EXPORT_SYMBOL_GPL(cpufreq_set_update_uti
*
* This function is called by the scheduler on every invocation of
* update_load_avg() on the CPU whose utilization is being updated.
+ *
+ * It can only be called from RCU-sched read-side critical sections.
*/
void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
{
struct update_util_data *data;
- rcu_read_lock();
+#ifdef CONFIG_LOCKDEP
+ WARN_ON(debug_locks && !rcu_read_lock_sched_held());
+#endif
- data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
- if (data && data->func)
+ data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
+ /*
+ * If this isn't inside of an RCU-sched read-side critical section, data
+ * may become NULL after the check below.
+ */
+ if (data)
data->func(data, time, util, max);
-
- rcu_read_unlock();
}
/* Flag to suspend/resume CPUFreq governors */
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -280,7 +280,7 @@ static inline void gov_clear_update_util
for_each_cpu(i, policy->cpus)
cpufreq_set_update_util_data(i, NULL);
- synchronize_rcu();
+ synchronize_sched();
}
static void gov_cancel_work(struct cpufreq_policy *policy)
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -1174,7 +1174,7 @@ static void intel_pstate_stop_cpu(struct
pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
cpufreq_set_update_util_data(cpu_num, NULL);
- synchronize_rcu();
+ synchronize_sched();
if (hwp_active)
return;
@@ -1442,7 +1442,7 @@ out:
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu]) {
cpufreq_set_update_util_data(cpu, NULL);
- synchronize_rcu();
+ synchronize_sched();
kfree(all_cpu_data[cpu]);
}
}
From: Rafael J. Wysocki <[email protected]>
Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
to support that method if available.
Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context via a
new helper function, cpufreq_driver_fast_switch(). Add a new
policy flag, fast_switch_possible, to be set if fast frequency
switching can be used for the given policy.
Implement the ->fast_switch callback in the ACPI cpufreq driver
and make it set fast_switch_possible during policy initialization
as appropriate.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Changes from the previous version:
- Drop a bogus check from cpufreq_driver_fast_switch().
---
drivers/cpufreq/acpi-cpufreq.c | 53 +++++++++++++++++++++++++++++++++++++++++
drivers/cpufreq/cpufreq.c | 30 +++++++++++++++++++++++
include/linux/cpufreq.h | 6 ++++
3 files changed, 89 insertions(+)
Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -458,6 +458,55 @@ static int acpi_cpufreq_target(struct cp
return result;
}
+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq,
+ unsigned int relation)
+{
+ struct acpi_cpufreq_data *data = policy->driver_data;
+ struct acpi_processor_performance *perf;
+ struct cpufreq_frequency_table *entry, *found;
+ unsigned int next_perf_state, next_freq, freq;
+
+ /*
+ * Find the closest frequency above target_freq or equal to it.
+ *
+ * The table is sorted in the reverse order with respect to the
+ * frequency and all of the entries are valid (see the initialization).
+ */
+ entry = data->freq_table;
+ do {
+ entry++;
+ freq = entry->frequency;
+ } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
+ found = entry - 1;
+ /*
+ * Use the one found or the previous one, depending on the relation.
+ * CPUFREQ_RELATION_H is not taken into account here, but it is not
+ * expected to be passed to this function anyway.
+ */
+ next_freq = found->frequency;
+ if (freq == CPUFREQ_TABLE_END || relation != CPUFREQ_RELATION_C ||
+ target_freq - freq >= next_freq - target_freq) {
+ next_perf_state = found->driver_data;
+ } else {
+ next_freq = freq;
+ next_perf_state = entry->driver_data;
+ }
+
+ perf = to_perf_data(data);
+ if (perf->state == next_perf_state) {
+ if (unlikely(data->resume))
+ data->resume = 0;
+ else
+ return next_freq;
+ }
+
+ data->cpu_freq_write(&perf->control_register,
+ perf->states[next_perf_state].control);
+ perf->state = next_perf_state;
+ return next_freq;
+}
+
static unsigned long
acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
{
@@ -740,6 +789,9 @@ static int acpi_cpufreq_cpu_init(struct
goto err_unreg;
}
+ policy->fast_switch_possible = !acpi_pstate_strict &&
+ !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
+
data->freq_table = kzalloc(sizeof(*data->freq_table) *
(perf->state_count+1), GFP_KERNEL);
if (!data->freq_table) {
@@ -874,6 +926,7 @@ static struct freq_attr *acpi_cpufreq_at
static struct cpufreq_driver acpi_cpufreq_driver = {
.verify = cpufreq_generic_frequency_table_verify,
.target_index = acpi_cpufreq_target,
+ .fast_switch = acpi_cpufreq_fast_switch,
.bios_limit = acpi_processor_get_bios_limit,
.init = acpi_cpufreq_cpu_init,
.exit = acpi_cpufreq_cpu_exit,
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -1719,6 +1719,36 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
* GOVERNORS *
*********************************************************************/
+/**
+ * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
+ * @policy: cpufreq policy to switch the frequency for.
+ * @target_freq: New frequency to set (may be approximate).
+ * @relation: Relation to use for frequency selection.
+ *
+ * Carry out a fast frequency switch from interrupt context.
+ *
+ * This function must not be called if policy->fast_switch_possible is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same policy and that it will never be called in
+ * parallel with either ->target() or ->target_index() for the same policy.
+ *
+ * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
+ * callback, the hardware configuration must be preserved.
+ */
+void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq, unsigned int relation)
+{
+ unsigned int freq;
+
+ freq = cpufreq_driver->fast_switch(policy, target_freq, relation);
+ if (freq != CPUFREQ_ENTRY_INVALID) {
+ policy->cur = freq;
+ trace_cpu_frequency(freq, smp_processor_id());
+ }
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
+
/* Must set freqs->new to intermediate frequency */
static int __target_intermediate(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, int index)
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -81,6 +81,7 @@ struct cpufreq_policy {
struct cpufreq_governor *governor; /* see below */
void *governor_data;
char last_governor[CPUFREQ_NAME_LEN]; /* last governor used */
+ bool fast_switch_possible;
struct work_struct update; /* if update_policy() needs to be
* called, but you're in IRQ context */
@@ -236,6 +237,9 @@ struct cpufreq_driver {
unsigned int relation); /* Deprecated */
int (*target_index)(struct cpufreq_policy *policy,
unsigned int index);
+ unsigned int (*fast_switch)(struct cpufreq_policy *policy,
+ unsigned int target_freq,
+ unsigned int relation);
/*
* Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
* unset.
@@ -450,6 +454,8 @@ struct cpufreq_governor {
};
/* Pass a target to the cpufreq driver */
+void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq, unsigned int relation);
int cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
From: Rafael J. Wysocki <[email protected]>
Move definitions and function headers related to struct gov_attr_set
to include/linux/cpufreq.h so they can be used by (future) goverernors
located outside of drivers/cpufreq/.
No functional changes.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
New patch. Needed to move cpufreq_schedutil.c to kernel/sched/.
---
drivers/cpufreq/cpufreq_governor.h | 21 ---------------------
include/linux/cpufreq.h | 23 +++++++++++++++++++++++
2 files changed, 23 insertions(+), 21 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -41,19 +41,6 @@
/* Ondemand Sampling types */
enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
-struct gov_attr_set {
- struct kobject kobj;
- struct list_head policy_list;
- struct mutex update_lock;
- int usage_count;
-};
-
-extern const struct sysfs_ops governor_sysfs_ops;
-
-void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node);
-void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node);
-unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node);
-
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
@@ -80,14 +67,6 @@ static inline struct dbs_data *to_dbs_da
return container_of(attr_set, struct dbs_data, attr_set);
}
-/* Governor's specific attributes */
-struct governor_attr {
- struct attribute attr;
- ssize_t (*show)(struct gov_attr_set *attr_set, char *buf);
- ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf,
- size_t count);
-};
-
#define gov_show_one(_gov, file_name) \
static ssize_t show_##file_name \
(struct gov_attr_set *attr_set, char *buf) \
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -462,6 +462,29 @@ void cpufreq_unregister_governor(struct
struct cpufreq_governor *cpufreq_default_governor(void);
struct cpufreq_governor *cpufreq_fallback_governor(void);
+/* Governor attribute set */
+struct gov_attr_set {
+ struct kobject kobj;
+ struct list_head policy_list;
+ struct mutex update_lock;
+ int usage_count;
+};
+
+/* sysfs ops for cpufreq governors */
+extern const struct sysfs_ops governor_sysfs_ops;
+
+void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node);
+void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node);
+unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node);
+
+/* Governor sysfs attribute */
+struct governor_attr {
+ struct attribute attr;
+ ssize_t (*show)(struct gov_attr_set *attr_set, char *buf);
+ ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf,
+ size_t count);
+};
+
/*********************************************************************
* FREQUENCY TABLE HELPERS *
*********************************************************************/
From: Rafael J. Wysocki <[email protected]>
Move abstract code related to struct gov_attr_set to a separate (new)
file so it can be shared with (future) goverernors that won't share
more code with "ondemand" and "conservative".
No intentional functional changes.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Changes from the previous version:
- Different name of the new file.
- Different name of the new Kconfig symbol.
---
drivers/cpufreq/Kconfig | 4 +
drivers/cpufreq/Makefile | 1
drivers/cpufreq/cpufreq_governor.c | 82 ---------------------------
drivers/cpufreq/cpufreq_governor.h | 6 ++
drivers/cpufreq/cpufreq_governor_attr_set.c | 84 ++++++++++++++++++++++++++++
5 files changed, 95 insertions(+), 82 deletions(-)
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -18,7 +18,11 @@ config CPU_FREQ
if CPU_FREQ
+config CPU_FREQ_GOV_ATTR_SET
+ bool
+
config CPU_FREQ_GOV_COMMON
+ select CPU_FREQ_GOV_ATTR_SET
select IRQ_WORK
bool
Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) +=
obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o
obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o
obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
+obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o
obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -112,53 +112,6 @@ void gov_update_cpu_data(struct dbs_data
}
EXPORT_SYMBOL_GPL(gov_update_cpu_data);
-static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
-{
- return container_of(kobj, struct gov_attr_set, kobj);
-}
-
-static inline struct governor_attr *to_gov_attr(struct attribute *attr)
-{
- return container_of(attr, struct governor_attr, attr);
-}
-
-static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
- char *buf)
-{
- struct governor_attr *gattr = to_gov_attr(attr);
-
- return gattr->show(to_gov_attr_set(kobj), buf);
-}
-
-static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
- const char *buf, size_t count)
-{
- struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
- struct governor_attr *gattr = to_gov_attr(attr);
- int ret = -EBUSY;
-
- mutex_lock(&attr_set->update_lock);
-
- if (attr_set->usage_count)
- ret = gattr->store(attr_set, buf, count);
-
- mutex_unlock(&attr_set->update_lock);
-
- return ret;
-}
-
-/*
- * Sysfs Ops for accessing governor attributes.
- *
- * All show/store invocations for governor specific sysfs attributes, will first
- * call the below show/store callbacks and the attribute specific callback will
- * be called from within it.
- */
-static const struct sysfs_ops governor_sysfs_ops = {
- .show = governor_show,
- .store = governor_store,
-};
-
unsigned int dbs_update(struct cpufreq_policy *policy)
{
struct policy_dbs_info *policy_dbs = policy->governor_data;
@@ -424,41 +377,6 @@ static void free_policy_dbs_info(struct
gov->free(policy_dbs);
}
-static void gov_attr_set_init(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- INIT_LIST_HEAD(&attr_set->policy_list);
- mutex_init(&attr_set->update_lock);
- attr_set->usage_count = 1;
- list_add(list_node, &attr_set->policy_list);
-}
-
-static void gov_attr_set_get(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- mutex_lock(&attr_set->update_lock);
- attr_set->usage_count++;
- list_add(list_node, &attr_set->policy_list);
- mutex_unlock(&attr_set->update_lock);
-}
-
-static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- unsigned int count;
-
- mutex_lock(&attr_set->update_lock);
- list_del(list_node);
- count = --attr_set->usage_count;
- mutex_unlock(&attr_set->update_lock);
- if (count)
- return count;
-
- kobject_put(&attr_set->kobj);
- mutex_destroy(&attr_set->update_lock);
- return 0;
-}
-
static int cpufreq_governor_init(struct cpufreq_policy *policy)
{
struct dbs_governor *gov = dbs_governor_of(policy);
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -48,6 +48,12 @@ struct gov_attr_set {
int usage_count;
};
+extern const struct sysfs_ops governor_sysfs_ops;
+
+void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node);
+void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node);
+unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node);
+
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
Index: linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c
@@ -0,0 +1,84 @@
+/*
+ * Abstract code for CPUFreq governor tunable sysfs attributes.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "cpufreq_governor.h"
+
+static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
+{
+ return container_of(kobj, struct gov_attr_set, kobj);
+}
+
+static inline struct governor_attr *to_gov_attr(struct attribute *attr)
+{
+ return container_of(attr, struct governor_attr, attr);
+}
+
+static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+{
+ struct governor_attr *gattr = to_gov_attr(attr);
+
+ return gattr->show(to_gov_attr_set(kobj), buf);
+}
+
+static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
+ const char *buf, size_t count)
+{
+ struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
+ struct governor_attr *gattr = to_gov_attr(attr);
+ int ret;
+
+ mutex_lock(&attr_set->update_lock);
+ ret = attr_set->usage_count ? gattr->store(attr_set, buf, count) : -EBUSY;
+ mutex_unlock(&attr_set->update_lock);
+ return ret;
+}
+
+const struct sysfs_ops governor_sysfs_ops = {
+ .show = governor_show,
+ .store = governor_store,
+};
+EXPORT_SYMBOL_GPL(governor_sysfs_ops);
+
+void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ INIT_LIST_HEAD(&attr_set->policy_list);
+ mutex_init(&attr_set->update_lock);
+ attr_set->usage_count = 1;
+ list_add(list_node, &attr_set->policy_list);
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_init);
+
+void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ mutex_lock(&attr_set->update_lock);
+ attr_set->usage_count++;
+ list_add(list_node, &attr_set->policy_list);
+ mutex_unlock(&attr_set->update_lock);
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_get);
+
+unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ unsigned int count;
+
+ mutex_lock(&attr_set->update_lock);
+ list_del(list_node);
+ count = --attr_set->usage_count;
+ mutex_unlock(&attr_set->update_lock);
+ if (count)
+ return count;
+
+ kobject_put(&attr_set->kobj);
+ mutex_destroy(&attr_set->update_lock);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_put);
On 03-03-16, 20:26, Rafael J. Wysocki wrote:
> So this is a totally bicycle shed discussion argument which makes it
> seriously irritating.
>
> Does it really matter so much how this structure is called?
> Essentially, it is something to build your tunables structure around
> and you can treat it as a counterpart of a C++ abstract class. So the
> name *does* make sense in that context.
:(
I thought you will apply this patch to linux-next *now*, as it was quite
independent and so gave such comment.
> That said, what about gov_attr_set?
Maybe just gov_kobj or whatever you wish.
--
viresh
On 04-03-16, 04:01, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> In addition to fields representing governor tunables, struct dbs_data
> contains some fields needed for the management of objects of that
> type. As it turns out, that part of struct dbs_data may be shared
> with (future) governors that won't use the common code used by
> "ondemand" and "conservative", so move it to a separate struct type
> and modify the code using struct dbs_data to follow.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> Changes from the previous version:
> - The new data type is called gov_attr_set now (instead of gov_tunables)
> and some variable names etc have been changed to follow.
Acked-by: Viresh Kumar <[email protected]>
--
viresh
On 04-03-16, 04:03, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Move abstract code related to struct gov_attr_set to a separate (new)
> file so it can be shared with (future) goverernors that won't share
> more code with "ondemand" and "conservative".
>
> No intentional functional changes.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> Changes from the previous version:
> - Different name of the new file.
> - Different name of the new Kconfig symbol.
Acked-by: Viresh Kumar <[email protected]>
--
viresh
On 04-03-16, 04:05, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Move definitions and function headers related to struct gov_attr_set
> to include/linux/cpufreq.h so they can be used by (future) goverernors
> located outside of drivers/cpufreq/.
>
> No functional changes.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> New patch. Needed to move cpufreq_schedutil.c to kernel/sched/.
Acked-by: Viresh Kumar <[email protected]>
--
viresh
Hi Rafael,
On 04/03/16 04:18, Rafael J. Wysocki wrote:
[...]
> +/**
> + * cpufreq_update_util - Take a note about CPU utilization changes.
> + * @time: Current time.
> + * @util: CPU utilization.
> + * @max: CPU capacity.
> + *
> + * This function is called on every invocation of update_load_avg() on the CPU
> + * whose utilization is being updated.
> + *
> + * It can only be called from RCU-sched read-side critical sections.
> + */
> +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
> +{
> + struct freq_update_hook *hook;
> +
> +#ifdef CONFIG_LOCKDEP
> + WARN_ON(debug_locks && !rcu_read_lock_sched_held());
> +#endif
> +
> + hook = rcu_dereference(*this_cpu_ptr(&cpufreq_freq_update_hook));
Small fix. You forgot to change this to rcu_dereference_sched() (you
only fixed that in 01/10).
Best,
- Juri
Hi Rafael,
On 04/03/16 04:35, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
>
> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> mechanism for registering utilization update callbacks) that
> introduced cpufreq_update_util() called by the scheduler on
> utilization changes (from CFS) and RT/DL task status updates.
> In particular, CPU frequency scaling decisions may be based on
> the the utilization data passed to cpufreq_update_util() by CFS.
>
> The new governor is relatively simple.
>
> The frequency selection formula used by it is
>
> next_freq = util * max_freq / max
>
> where util and max are the utilization and CPU capacity coming from CFS.
>
The formula looks better to me now. However, problem is that, if you
have freq. invariance, util will slowly saturate to the current
capacity. So, we won't trigger OPP changes for a task that for example
starts light and then becomes big.
This is the same problem we faced with schedfreq. The current solution
there is to use a margin for calculating a threshold (80% of current
capacity ATM). Once util goes above that threshold we trigger an OPP
change. Current policy is pretty aggressive, we go to max_f and then
adapt to the "real" util during successive enqueues. This was also
tought to cope with the fact that PELT seems slow to react to abrupt
changes in tasks behaviour.
I'm not saying this is the definitive solution, but I fear something
along this line is needed when you add freq invariance in the mix.
Best,
- Juri
On Fri, Mar 4, 2016 at 11:50 AM, Juri Lelli <[email protected]> wrote:
> Hi Rafael,
>
> On 04/03/16 04:18, Rafael J. Wysocki wrote:
>
> [...]
>
>> +/**
>> + * cpufreq_update_util - Take a note about CPU utilization changes.
>> + * @time: Current time.
>> + * @util: CPU utilization.
>> + * @max: CPU capacity.
>> + *
>> + * This function is called on every invocation of update_load_avg() on the CPU
>> + * whose utilization is being updated.
>> + *
>> + * It can only be called from RCU-sched read-side critical sections.
>> + */
>> +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
>> +{
>> + struct freq_update_hook *hook;
>> +
>> +#ifdef CONFIG_LOCKDEP
>> + WARN_ON(debug_locks && !rcu_read_lock_sched_held());
>> +#endif
>> +
>> + hook = rcu_dereference(*this_cpu_ptr(&cpufreq_freq_update_hook));
>
> Small fix. You forgot to change this to rcu_dereference_sched() (you
> only fixed that in 01/10).
Yup, thanks!
I had to propagate the change throughout the queue and forgot about
the last step.
I'll send an updated patch shortly.
Thanks,
Rafael
On Fri, Mar 4, 2016 at 12:26 PM, Juri Lelli <[email protected]> wrote:
> Hi Rafael,
Hi,
> On 04/03/16 04:35, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>>
>> Add a new cpufreq scaling governor, called "schedutil", that uses
>> scheduler-provided CPU utilization information as input for making
>> its decisions.
>>
>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
>> mechanism for registering utilization update callbacks) that
>> introduced cpufreq_update_util() called by the scheduler on
>> utilization changes (from CFS) and RT/DL task status updates.
>> In particular, CPU frequency scaling decisions may be based on
>> the the utilization data passed to cpufreq_update_util() by CFS.
>>
>> The new governor is relatively simple.
>>
>> The frequency selection formula used by it is
>>
>> next_freq = util * max_freq / max
>>
>> where util and max are the utilization and CPU capacity coming from CFS.
>>
>
> The formula looks better to me now. However, problem is that, if you
> have freq. invariance, util will slowly saturate to the current
> capacity. So, we won't trigger OPP changes for a task that for example
> starts light and then becomes big.
>
> This is the same problem we faced with schedfreq. The current solution
> there is to use a margin for calculating a threshold (80% of current
> capacity ATM). Once util goes above that threshold we trigger an OPP
> change. Current policy is pretty aggressive, we go to max_f and then
> adapt to the "real" util during successive enqueues. This was also
> tought to cope with the fact that PELT seems slow to react to abrupt
> changes in tasks behaviour.
>
> I'm not saying this is the definitive solution, but I fear something
> along this line is needed when you add freq invariance in the mix.
I really would like to avoid adding factors that need to be determined
experimentally, because the result of that tends to depend on the
system where the experiment is carried out and tunables simply don't
work (99% or maybe even more users don't change the defaults anyway).
So I would really like to use a formula that's based on some science
and doesn't depend on additional input.
Now, since the equation generally is f = a * x + b (f - frequency, x =
util/max) and there are good arguments for b = 0, it all boils down to
what number to take as a. a = max_freq is a good candidate (that's
what I'm using right now), but it may turn out to be too small.
Another reasonable candidate is a = min_freq + max_freq, because then
x = 0.5 selects the frequency in the middle of the available range,
but that may turn out to be way too big if min_freq is high (like
higher that 50% of max_freq).
I need to think more about that and admittedly my understanding of the
frequency invariance consequences is limited ATM.
Thanks,
Rafael
From: Rafael J. Wysocki <[email protected]>
A subsequent change set will introduce a new cpufreq governor using
CPU utilization information from the scheduler, so introduce
cpufreq_update_util() (again) to allow that information to be passed to
the new governor and make cpufreq_trigger_update() call it internally.
To that end, add a new ->update_util callback pointer to struct
freq_update_hook to be set by entities that want to use the util
and max arguments and make cpufreq_update_util() use that callback
if available or the ->func callback that only takes the time argument
otherwise.
In addition to that, arrange helpers to set/clear the utilization
update hooks in such a way that the full ->update_util callbacks
can only be set by code inside the kernel/sched/ directory.
Update the current users of cpufreq_set_freq_update_hook() to use
the new helpers.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Changes from v2:
- Use rcu_dereference_sched() in cpufreq_update_util().
---
drivers/cpufreq/cpufreq_governor.c | 76 +++++++++++++--------------
drivers/cpufreq/intel_pstate.c | 8 +-
include/linux/sched.h | 10 +--
kernel/sched/cpufreq.c | 101 +++++++++++++++++++++++++++++--------
kernel/sched/fair.c | 8 ++
kernel/sched/sched.h | 16 +++++
6 files changed, 150 insertions(+), 69 deletions(-)
Index: linux-pm/include/linux/sched.h
===================================================================
--- linux-pm.orig/include/linux/sched.h
+++ linux-pm/include/linux/sched.h
@@ -2363,15 +2363,15 @@ static inline bool sched_can_stop_tick(v
#endif
#ifdef CONFIG_CPU_FREQ
-void cpufreq_trigger_update(u64 time);
-
struct freq_update_hook {
void (*func)(struct freq_update_hook *hook, u64 time);
+ void (*update_util)(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max);
};
-void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook);
-#else
-static inline void cpufreq_trigger_update(u64 time) {}
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
+ void (*func)(struct freq_update_hook *hook, u64 time));
+void cpufreq_clear_freq_update_hook(int cpu);
#endif
#ifdef CONFIG_SCHED_AUTOGROUP
Index: linux-pm/kernel/sched/cpufreq.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq.c
+++ linux-pm/kernel/sched/cpufreq.c
@@ -9,12 +9,12 @@
* published by the Free Software Foundation.
*/
-#include <linux/sched.h>
+#include "sched.h"
static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
/**
- * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
+ * set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
* @cpu: The CPU to set the pointer for.
* @hook: New pointer value.
*
@@ -27,23 +27,96 @@ static DEFINE_PER_CPU(struct freq_update
* accessed via the old update_util_data pointer or invoke synchronize_sched()
* right after this function to avoid use-after-free.
*/
-void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook)
+static void set_freq_update_hook(int cpu, struct freq_update_hook *hook)
{
- if (WARN_ON(hook && !hook->func))
+ rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
+}
+
+/**
+ * cpufreq_set_freq_update_hook - Set the CPU's frequency update callback.
+ * @cpu: The CPU to set the callback for.
+ * @hook: New freq_update_hook pointer value.
+ * @func: Callback function to use with the new hook.
+ */
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
+ void (*func)(struct freq_update_hook *hook, u64 time))
+{
+ if (WARN_ON(!hook || !func))
return;
- rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
+ hook->func = func;
+ set_freq_update_hook(cpu, hook);
}
EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook);
/**
+ * cpufreq_set_update_util_hook - Set the CPU's utilization update callback.
+ * @cpu: The CPU to set the callback for.
+ * @hook: New freq_update_hook pointer value.
+ * @update_util: Callback function to use with the new hook.
+ */
+void cpufreq_set_update_util_hook(int cpu, struct freq_update_hook *hook,
+ void (*update_util)(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max))
+{
+ if (WARN_ON(!hook || !update_util))
+ return;
+
+ hook->update_util = update_util;
+ set_freq_update_hook(cpu, hook);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_hook);
+
+/**
+ * cpufreq_set_update_util_hook - Clear the CPU's freq_update_hook pointer.
+ * @cpu: The CPU to clear the pointer for.
+ */
+void cpufreq_clear_freq_update_hook(int cpu)
+{
+ set_freq_update_hook(cpu, NULL);
+}
+EXPORT_SYMBOL_GPL(cpufreq_clear_freq_update_hook);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @time: Current time.
+ * @util: CPU utilization.
+ * @max: CPU capacity.
+ *
+ * This function is called on every invocation of update_load_avg() on the CPU
+ * whose utilization is being updated.
+ *
+ * It can only be called from RCU-sched read-side critical sections.
+ */
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+ struct freq_update_hook *hook;
+
+#ifdef CONFIG_LOCKDEP
+ WARN_ON(debug_locks && !rcu_read_lock_sched_held());
+#endif
+
+ hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
+ /*
+ * If this isn't inside of an RCU-sched read-side critical section, hook
+ * may become NULL after the check below.
+ */
+ if (hook) {
+ if (hook->update_util)
+ hook->update_util(hook, time, util, max);
+ else
+ hook->func(hook, time);
+ }
+}
+
+/**
* cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
* @time: Current time.
*
* The way cpufreq is currently arranged requires it to evaluate the CPU
* performance state (frequency/voltage) on a regular basis. To facilitate
- * that, this function is called by update_load_avg() in CFS when executed for
- * the current CPU's runqueue.
+ * that, cpufreq_update_util() is called by update_load_avg() in CFS when
+ * executed for the current CPU's runqueue.
*
* However, this isn't sufficient to prevent the CPU from being stuck in a
* completely inadequate performance level for too long, because the calls
@@ -57,17 +130,5 @@ EXPORT_SYMBOL_GPL(cpufreq_set_freq_updat
*/
void cpufreq_trigger_update(u64 time)
{
- struct freq_update_hook *hook;
-
-#ifdef CONFIG_LOCKDEP
- WARN_ON(debug_locks && !rcu_read_lock_sched_held());
-#endif
-
- hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
- /*
- * If this isn't inside of an RCU-sched read-side critical section, hook
- * may become NULL after the check below.
- */
- if (hook)
- hook->func(hook, time);
+ cpufreq_update_util(time, ULONG_MAX, 0);
}
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2839,6 +2839,8 @@ static inline void update_load_avg(struc
update_tg_load_avg(cfs_rq, 0);
if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+ unsigned long max = rq->cpu_capacity_orig;
+
/*
* There are a few boundary cases this might miss but it should
* get called often enough that that should (hopefully) not be
@@ -2847,9 +2849,11 @@ static inline void update_load_avg(struc
* the next tick/schedule should update.
*
* It will not get called when we go idle, because the idle
- * thread is a different class (!fair).
+ * thread is a different class (!fair), nor will the utilization
+- * number include things like RT tasks.
*/
- cpufreq_trigger_update(rq_clock(rq));
+ cpufreq_update_util(rq_clock(rq),
+ min(cfs_rq->avg.util_avg, max), max);
}
}
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -1739,3 +1739,19 @@ static inline u64 irq_time_read(int cpu)
}
#endif /* CONFIG_64BIT */
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+#ifdef CONFIG_CPU_FREQ
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+void cpufreq_trigger_update(u64 time);
+void cpufreq_set_update_util_hook(int cpu, struct freq_update_hook *hook,
+ void (*update_util)(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max));
+static inline void cpufreq_clear_update_util_hook(int cpu)
+{
+ cpufreq_clear_freq_update_hook(cpu);
+}
+#else
+static inline void cpufreq_update_util(u64 time, unsigned long util,
+ unsigned long max) {}
+static inline void cpufreq_trigger_update(u64 time) {}
+#endif /* CONFIG_CPU_FREQ */
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -1088,8 +1088,8 @@ static int intel_pstate_init_cpu(unsigne
intel_pstate_busy_pid_reset(cpu);
intel_pstate_sample(cpu, 0);
- cpu->update_hook.func = intel_pstate_freq_update;
- cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook);
+ cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook,
+ intel_pstate_freq_update );
pr_debug("intel_pstate: controlling: cpu %d\n", cpunum);
@@ -1173,7 +1173,7 @@ static void intel_pstate_stop_cpu(struct
pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
- cpufreq_set_freq_update_hook(cpu_num, NULL);
+ cpufreq_clear_freq_update_hook(cpu_num);
synchronize_sched();
if (hwp_active)
@@ -1441,7 +1441,7 @@ out:
get_online_cpus();
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu]) {
- cpufreq_set_freq_update_hook(cpu, NULL);
+ cpufreq_clear_freq_update_hook(cpu);
synchronize_sched();
kfree(all_cpu_data[cpu]);
}
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -211,43 +211,6 @@ unsigned int dbs_update(struct cpufreq_p
}
EXPORT_SYMBOL_GPL(dbs_update);
-static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs,
- unsigned int delay_us)
-{
- struct cpufreq_policy *policy = policy_dbs->policy;
- int cpu;
-
- gov_update_sample_delay(policy_dbs, delay_us);
- policy_dbs->last_sample_time = 0;
-
- for_each_cpu(cpu, policy->cpus) {
- struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
-
- cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook);
- }
-}
-
-static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy)
-{
- int i;
-
- for_each_cpu(i, policy->cpus)
- cpufreq_set_freq_update_hook(i, NULL);
-
- synchronize_sched();
-}
-
-static void gov_cancel_work(struct cpufreq_policy *policy)
-{
- struct policy_dbs_info *policy_dbs = policy->governor_data;
-
- gov_clear_freq_update_hooks(policy_dbs->policy);
- irq_work_sync(&policy_dbs->irq_work);
- cancel_work_sync(&policy_dbs->work);
- atomic_set(&policy_dbs->work_count, 0);
- policy_dbs->work_in_progress = false;
-}
-
static void dbs_work_handler(struct work_struct *work)
{
struct policy_dbs_info *policy_dbs;
@@ -334,6 +297,44 @@ static void dbs_freq_update_handler(stru
irq_work_queue(&policy_dbs->irq_work);
}
+static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs,
+ unsigned int delay_us)
+{
+ struct cpufreq_policy *policy = policy_dbs->policy;
+ int cpu;
+
+ gov_update_sample_delay(policy_dbs, delay_us);
+ policy_dbs->last_sample_time = 0;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
+
+ cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook,
+ dbs_freq_update_handler);
+ }
+}
+
+static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy)
+{
+ int i;
+
+ for_each_cpu(i, policy->cpus)
+ cpufreq_clear_freq_update_hook(i);
+
+ synchronize_sched();
+}
+
+static void gov_cancel_work(struct cpufreq_policy *policy)
+{
+ struct policy_dbs_info *policy_dbs = policy->governor_data;
+
+ gov_clear_freq_update_hooks(policy_dbs->policy);
+ irq_work_sync(&policy_dbs->irq_work);
+ cancel_work_sync(&policy_dbs->work);
+ atomic_set(&policy_dbs->work_count, 0);
+ policy_dbs->work_in_progress = false;
+}
+
static struct policy_dbs_info *alloc_policy_dbs_info(struct cpufreq_policy *policy,
struct dbs_governor *gov)
{
@@ -356,7 +357,6 @@ static struct policy_dbs_info *alloc_pol
struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
j_cdbs->policy_dbs = policy_dbs;
- j_cdbs->update_hook.func = dbs_freq_update_handler;
}
return policy_dbs;
}
On Fri, 2016-03-04 at 11:26 +0000, Juri Lelli wrote:
> Hi Rafael,
>
> On 04/03/16 04:35, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > Add a new cpufreq scaling governor, called "schedutil", that uses
> > scheduler-provided CPU utilization information as input for making
> > its decisions.
> >
> > Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> > mechanism for registering utilization update callbacks) that
> > introduced cpufreq_update_util() called by the scheduler on
> > utilization changes (from CFS) and RT/DL task status updates.
> > In particular, CPU frequency scaling decisions may be based on
> > the the utilization data passed to cpufreq_update_util() by CFS.
> >
> > The new governor is relatively simple.
> >
> > The frequency selection formula used by it is
> >
> > next_freq = util * max_freq / max
> >
> > where util and max are the utilization and CPU capacity coming from
> > CFS.
> >
>
> The formula looks better to me now. However, problem is that, if you
> have freq. invariance, util will slowly saturate to the current
> capacity. So, we won't trigger OPP changes for a task that for
> example
> starts light and then becomes big.
>
> This is the same problem we faced with schedfreq. The current
> solution
> there is to use a margin for calculating a threshold (80% of current
> capacity ATM). Once util goes above that threshold we trigger an OPP
> change.  Current policy is pretty aggressive, we go to max_f and then
> adapt to the "real" util during successive enqueues. This was also
> tought to cope with the fact that PELT seems slow to react to abrupt
> changes in tasks behaviour.
>
I also tried something like this in intel_pstate with scheduler util,
where you ramp up to turbo when a threshold percent exceeded then ramp
down slowly in steps. This helped some workloads like tbench to perform
better, but it resulted in lower performance/watt on specpower server
workload. The problem is finding what is the right threshold value.
Thanks,
Srinivas
> I'm not saying this is the definitive solution, but I fear something
> along this line is needed when you add freq invariance in the mix.
>
> Best,
>
> - Juri
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm"
> in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
On 03/04/2016 05:30 AM, Rafael J. Wysocki wrote:
> +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
> +{
> + struct freq_update_hook *hook;
> +
> +#ifdef CONFIG_LOCKDEP
> + WARN_ON(debug_locks && !rcu_read_lock_sched_held());
> +#endif
> +
> + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
> + /*
> + * If this isn't inside of an RCU-sched read-side critical section, hook
> + * may become NULL after the check below.
> + */
> + if (hook) {
> + if (hook->update_util)
> + hook->update_util(hook, time, util, max);
> + else
> + hook->func(hook, time);
> + }
Is it worth having two hook types?
On Fri, Mar 4, 2016 at 10:21 PM, Steve Muckle <[email protected]> wrote:
> On 03/04/2016 05:30 AM, Rafael J. Wysocki wrote:
>> +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
>> +{
>> + struct freq_update_hook *hook;
>> +
>> +#ifdef CONFIG_LOCKDEP
>> + WARN_ON(debug_locks && !rcu_read_lock_sched_held());
>> +#endif
>> +
>> + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
>> + /*
>> + * If this isn't inside of an RCU-sched read-side critical section, hook
>> + * may become NULL after the check below.
>> + */
>> + if (hook) {
>> + if (hook->update_util)
>> + hook->update_util(hook, time, util, max);
>> + else
>> + hook->func(hook, time);
>> + }
>
> Is it worth having two hook types?
Well, that's why I said "maybe over the top" in the changelog comments. :-)
If we want to isolate the "old" governors from util/max entirely, then yes.
If we don't care that much, then no.
I'm open to both possibilities.
Thanks,
Rafael
On Fri, Mar 4, 2016 at 10:27 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Fri, Mar 4, 2016 at 10:21 PM, Steve Muckle <[email protected]> wrote:
>> On 03/04/2016 05:30 AM, Rafael J. Wysocki wrote:
>>> +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
>>> +{
>>> + struct freq_update_hook *hook;
>>> +
>>> +#ifdef CONFIG_LOCKDEP
>>> + WARN_ON(debug_locks && !rcu_read_lock_sched_held());
>>> +#endif
>>> +
>>> + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
>>> + /*
>>> + * If this isn't inside of an RCU-sched read-side critical section, hook
>>> + * may become NULL after the check below.
>>> + */
>>> + if (hook) {
>>> + if (hook->update_util)
>>> + hook->update_util(hook, time, util, max);
>>> + else
>>> + hook->func(hook, time);
>>> + }
>>
>> Is it worth having two hook types?
>
> Well, that's why I said "maybe over the top" in the changelog comments. :-)
>
> If we want to isolate the "old" governors from util/max entirely, then yes.
>
> If we don't care that much, then no.
>
> I'm open to both possibilities.
But in the latter case I don't see a particular reason to put the new
governor under kernel/sched/ too and as I wrote in the changelog
comments to patch [10/10], I personally think that it would be cleaner
to keep it under drivers/cpufreq/.
Thanks,
Rafael
On 03/03/2016 07:07 PM, Rafael J. Wysocki wrote:
> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> + unsigned int target_freq, unsigned int relation)
> +{
> + unsigned int freq;
> +
> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation);
> + if (freq != CPUFREQ_ENTRY_INVALID) {
> + policy->cur = freq;
> + trace_cpu_frequency(freq, smp_processor_id());
> + }
> +}
Even if there are platforms which may change the CPU frequency behind
cpufreq's back, breaking the transition notifiers, I'm worried about the
addition of an interface which itself breaks them. The platforms which
do change CPU frequency on their own have probably evolved to live with
or work around this behavior. As other platforms migrate to fast
frequency switching they might be surprised when things don't work as
advertised.
I'm not sure what the easiest way to deal with this is. I see the
transition notifiers are the srcu type, which I understand to be
blocking. Going through the tree and reworking everyone's callbacks and
changing the type to atomic is obviously not realistic.
How about modifying cpufreq_register_notifier to return an error if the
driver has a fast_switch callback installed and an attempt to register a
transition notifier is made?
In the future, perhaps an additional atomic transition callback type can
be added, which platform/driver owners can switch to if they wish to use
fast transitions with their platform.
thanks,
Steve
On Fri, Mar 4, 2016 at 11:18 PM, Steve Muckle <[email protected]> wrote:
> On 03/03/2016 07:07 PM, Rafael J. Wysocki wrote:
>> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>> + unsigned int target_freq, unsigned int relation)
>> +{
>> + unsigned int freq;
>> +
>> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation);
>> + if (freq != CPUFREQ_ENTRY_INVALID) {
>> + policy->cur = freq;
>> + trace_cpu_frequency(freq, smp_processor_id());
>> + }
>> +}
>
> Even if there are platforms which may change the CPU frequency behind
> cpufreq's back, breaking the transition notifiers, I'm worried about the
> addition of an interface which itself breaks them. The platforms which
> do change CPU frequency on their own have probably evolved to live with
> or work around this behavior. As other platforms migrate to fast
> frequency switching they might be surprised when things don't work as
> advertised.
Well, intel_pstate doesn't do notifies at all, so anything depending
on them is already broken when it is used. Let alone the hardware
P-states coordination mechanism (HWP) where the frequency is
controlled by the processor itself entirely.
That said I see your point.
> I'm not sure what the easiest way to deal with this is. I see the
> transition notifiers are the srcu type, which I understand to be
> blocking. Going through the tree and reworking everyone's callbacks and
> changing the type to atomic is obviously not realistic.
Right.
> How about modifying cpufreq_register_notifier to return an error if the
> driver has a fast_switch callback installed and an attempt to register a
> transition notifier is made?
That sounds like a good idea.
There also is the CPUFREQ_ASYNC_NOTIFICATION driver flag that in
principle might be used as a workaround, but I'm not sure how much
work that would require ATM.
> In the future, perhaps an additional atomic transition callback type can
> be added, which platform/driver owners can switch to if they wish to use
> fast transitions with their platform.
I guess you mean an atomic notification mechanism based on registering
callbacks? While technically viable that's somewhat risky, because we
are in a fast path and allowing anyone to add stuff to it would be
asking for trouble IMO.
Thanks,
Rafael
On Fri, Mar 4, 2016 at 11:32 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Fri, Mar 4, 2016 at 11:18 PM, Steve Muckle <[email protected]> wrote:
>> On 03/03/2016 07:07 PM, Rafael J. Wysocki wrote:
>>> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>>> + unsigned int target_freq, unsigned int relation)
>>> +{
>>> + unsigned int freq;
>>> +
>>> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation);
>>> + if (freq != CPUFREQ_ENTRY_INVALID) {
>>> + policy->cur = freq;
>>> + trace_cpu_frequency(freq, smp_processor_id());
>>> + }
>>> +}
>>
>> Even if there are platforms which may change the CPU frequency behind
>> cpufreq's back, breaking the transition notifiers, I'm worried about the
>> addition of an interface which itself breaks them. The platforms which
>> do change CPU frequency on their own have probably evolved to live with
>> or work around this behavior. As other platforms migrate to fast
>> frequency switching they might be surprised when things don't work as
>> advertised.
>
> Well, intel_pstate doesn't do notifies at all, so anything depending
> on them is already broken when it is used. Let alone the hardware
> P-states coordination mechanism (HWP) where the frequency is
> controlled by the processor itself entirely.
>
> That said I see your point.
>
>> I'm not sure what the easiest way to deal with this is. I see the
>> transition notifiers are the srcu type, which I understand to be
>> blocking. Going through the tree and reworking everyone's callbacks and
>> changing the type to atomic is obviously not realistic.
>
> Right.
>
>> How about modifying cpufreq_register_notifier to return an error if the
>> driver has a fast_switch callback installed and an attempt to register a
>> transition notifier is made?
>
> That sounds like a good idea.
>
> There also is the CPUFREQ_ASYNC_NOTIFICATION driver flag that in
> principle might be used as a workaround, but I'm not sure how much
> work that would require ATM.
What I mean is that drivers using it are supposed to handle the
notifications by calling cpufreq_freq_transition_begin(/end() by
themselves, so theoretically there is a mechanism already in place for
that.
I guess what might be done would be to spawn a work item to carry out
a notify when the frequency changes.
Thanks,
Rafael
On Fri, Mar 4, 2016 at 11:32 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Fri, Mar 4, 2016 at 11:18 PM, Steve Muckle <[email protected]> wrote:
>> On 03/03/2016 07:07 PM, Rafael J. Wysocki wrote:
>>> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>>> + unsigned int target_freq, unsigned int relation)
>>> +{
>>> + unsigned int freq;
>>> +
>>> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation);
>>> + if (freq != CPUFREQ_ENTRY_INVALID) {
>>> + policy->cur = freq;
>>> + trace_cpu_frequency(freq, smp_processor_id());
>>> + }
>>> +}
>>
>> Even if there are platforms which may change the CPU frequency behind
>> cpufreq's back, breaking the transition notifiers, I'm worried about the
>> addition of an interface which itself breaks them. The platforms which
>> do change CPU frequency on their own have probably evolved to live with
>> or work around this behavior. As other platforms migrate to fast
>> frequency switching they might be surprised when things don't work as
>> advertised.
>
> Well, intel_pstate doesn't do notifies at all, so anything depending
> on them is already broken when it is used. Let alone the hardware
> P-states coordination mechanism (HWP) where the frequency is
> controlled by the processor itself entirely.
>
> That said I see your point.
>
>> I'm not sure what the easiest way to deal with this is. I see the
>> transition notifiers are the srcu type, which I understand to be
>> blocking. Going through the tree and reworking everyone's callbacks and
>> changing the type to atomic is obviously not realistic.
>
> Right.
>
>> How about modifying cpufreq_register_notifier to return an error if the
>> driver has a fast_switch callback installed and an attempt to register a
>> transition notifier is made?
>
> That sounds like a good idea.
Transition notifiers may be registered before the driver is
registered, so that won't help in all cases.
Thanks,
Rafael
On Fri, Mar 4, 2016 at 11:40 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Fri, Mar 4, 2016 at 11:32 PM, Rafael J. Wysocki <[email protected]> wrote:
>> On Fri, Mar 4, 2016 at 11:18 PM, Steve Muckle <[email protected]> wrote:
>>> On 03/03/2016 07:07 PM, Rafael J. Wysocki wrote:
>>>> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>>>> + unsigned int target_freq, unsigned int relation)
>>>> +{
>>>> + unsigned int freq;
>>>> +
>>>> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation);
>>>> + if (freq != CPUFREQ_ENTRY_INVALID) {
>>>> + policy->cur = freq;
>>>> + trace_cpu_frequency(freq, smp_processor_id());
>>>> + }
>>>> +}
>>>
>>> Even if there are platforms which may change the CPU frequency behind
>>> cpufreq's back, breaking the transition notifiers, I'm worried about the
>>> addition of an interface which itself breaks them. The platforms which
>>> do change CPU frequency on their own have probably evolved to live with
>>> or work around this behavior. As other platforms migrate to fast
>>> frequency switching they might be surprised when things don't work as
>>> advertised.
>>
>> Well, intel_pstate doesn't do notifies at all, so anything depending
>> on them is already broken when it is used. Let alone the hardware
>> P-states coordination mechanism (HWP) where the frequency is
>> controlled by the processor itself entirely.
>>
>> That said I see your point.
>>
>>> I'm not sure what the easiest way to deal with this is. I see the
>>> transition notifiers are the srcu type, which I understand to be
>>> blocking. Going through the tree and reworking everyone's callbacks and
>>> changing the type to atomic is obviously not realistic.
>>
>> Right.
>>
>>> How about modifying cpufreq_register_notifier to return an error if the
>>> driver has a fast_switch callback installed and an attempt to register a
>>> transition notifier is made?
>>
>> That sounds like a good idea.
>>
>> There also is the CPUFREQ_ASYNC_NOTIFICATION driver flag that in
>> principle might be used as a workaround, but I'm not sure how much
>> work that would require ATM.
>
> What I mean is that drivers using it are supposed to handle the
> notifications by calling cpufreq_freq_transition_begin(/end() by
> themselves, so theoretically there is a mechanism already in place for
> that.
>
> I guess what might be done would be to spawn a work item to carry out
> a notify when the frequency changes.
In fact, the mechanism may be relatively simple if I'm not mistaken.
In the "fast switch" case, the governor may spawn a work item that
will just execute cpufreq_get() on policy->cpu. That will notice that
policy->cur is different from the real current frequency and will
re-adjust.
Of course, cpufreq_driver_fast_switch() will need to be modified so it
doesn't update policy->cur then perhaps with a comment that the
governor using it will be responsible for that.
And the governor will need to avoid spawning that work item too often
(basically, if one has been spawned already and hasn't completed, no
need to spawn a new one, and maybe rate-limit it?), but all that looks
reasonably straightforward.
Thanks,
Rafael
On 03/04/2016 03:18 PM, Rafael J. Wysocki wrote:
> In fact, the mechanism may be relatively simple if I'm not mistaken.
>
> In the "fast switch" case, the governor may spawn a work item that
> will just execute cpufreq_get() on policy->cpu. That will notice that
> policy->cur is different from the real current frequency and will
> re-adjust.
>
> Of course, cpufreq_driver_fast_switch() will need to be modified so it
> doesn't update policy->cur then perhaps with a comment that the
> governor using it will be responsible for that.
>
> And the governor will need to avoid spawning that work item too often
> (basically, if one has been spawned already and hasn't completed, no
> need to spawn a new one, and maybe rate-limit it?), but all that looks
> reasonably straightforward.
It is another option though definitely a compromise. The semantics seem
different since you'd potentially have multiple freq changes before a
single notifier went through, so stuff might still break. The fast path
would also be more expensive given the workqueue activity that could
translate into additional task wakeups.
Honestly I wonder if it's better to just try the "no notifiers with fast
drivers" approach to start. The notifiers could always be added if
platform owners complain that they absolutely require them.
thanks,
Steve
On 03/04/2016 02:58 PM, Rafael J. Wysocki wrote:
>>> How about modifying cpufreq_register_notifier to return an error if the
>>> >> driver has a fast_switch callback installed and an attempt to register a
>>> >> transition notifier is made?
>> >
>> > That sounds like a good idea.
>
> Transition notifiers may be registered before the driver is
> registered, so that won't help in all cases.
Could that hole be closed by a similar check in
cpufreq_register_driver()? I.e. if the transition_notifier list is not
empty, fail to register the driver (if the driver has a fast_switch
routine)?
Or alternatively, the fast_switch routine is not installed.
thanks,
Steve
On Sat, Mar 5, 2016 at 12:56 AM, Steve Muckle <[email protected]> wrote:
> On 03/04/2016 03:18 PM, Rafael J. Wysocki wrote:
>> In fact, the mechanism may be relatively simple if I'm not mistaken.
>>
>> In the "fast switch" case, the governor may spawn a work item that
>> will just execute cpufreq_get() on policy->cpu. That will notice that
>> policy->cur is different from the real current frequency and will
>> re-adjust.
>>
>> Of course, cpufreq_driver_fast_switch() will need to be modified so it
>> doesn't update policy->cur then perhaps with a comment that the
>> governor using it will be responsible for that.
>>
>> And the governor will need to avoid spawning that work item too often
>> (basically, if one has been spawned already and hasn't completed, no
>> need to spawn a new one, and maybe rate-limit it?), but all that looks
>> reasonably straightforward.
>
> It is another option though definitely a compromise. The semantics seem
> different since you'd potentially have multiple freq changes before a
> single notifier went through, so stuff might still break.
Here I'm not worried. That's basically equivalent to someone doing a
"get" and seeing an unexpected frequency in the driver output which is
covered already and things need to cope with it or they are just
really broken.
> The fast path would also be more expensive given the workqueue activity that could
> translate into additional task wakeups.
That's a valid concern, so maybe there can be a driver flag to
indicate that this has to be done if ->fast_switch is in use? Or
something like fast_switch_notify_rate that will tell the governor how
often to notify things about transitions if ->fast_switch is in use
with either 0 or all ones meaning "never"? That might be a policy
property even, so the driver may set this depending on what platform
it is used on.
> Honestly I wonder if it's better to just try the "no notifiers with fast
> drivers" approach to start. The notifiers could always be added if
> platform owners complain that they absolutely require them.
Well, I'm not sure what happens if we start to fail notifier
registrations. It may not be a well tested error code path. :-)
Besides, there is the problem with registering notifiers before the
driver and I don't think we can fail driver registration if notifiers
have already been registered. We may not be able to register a "fast"
driver at all in that case.
But that whole thing is your worry, not mine. :-)
Had I been worrying about that, I would have added some bandaid for
that to the patches.
Thanks,
Rafael
* Rafael J. Wysocki <[email protected]> wrote:
> > Honestly I wonder if it's better to just try the "no notifiers with fast
> > drivers" approach to start. The notifiers could always be added if platform
> > owners complain that they absolutely require them.
>
> Well, I'm not sure what happens if we start to fail notifier registrations. It
> may not be a well tested error code path. :-)
Yeah, so as a general principle 'struct notifier_block' as a really bad interface
with poor and fragile semantics, and we are trying to get rid of them everywhere
from core kernel code. For example Thomas Gleixner et al is working on eliminating
them from the CPU hotplug code - which will get rid of most remaining notifier
uses from the scheduler as well.
So please add explicit cpufreq driver callback functions instead, which can be
filled in by a platform if needed. No notifiers!
Thanks,
Ingo
On Sat, Mar 05, 2016 at 12:18:54AM +0100, Rafael J. Wysocki wrote:
> >>> Even if there are platforms which may change the CPU frequency behind
> >>> cpufreq's back, breaking the transition notifiers, I'm worried about the
> >>> addition of an interface which itself breaks them. The platforms which
> >>> do change CPU frequency on their own have probably evolved to live with
> >>> or work around this behavior. As other platforms migrate to fast
> >>> frequency switching they might be surprised when things don't work as
> >>> advertised.
There's only 43 sites of cpufreq_register_notifier in 37 files, that
should be fairly simple to audit.
> >>> I'm not sure what the easiest way to deal with this is. I see the
> >>> transition notifiers are the srcu type, which I understand to be
> >>> blocking. Going through the tree and reworking everyone's callbacks and
> >>> changing the type to atomic is obviously not realistic.
> >>
> >> Right.
Even if it was (and per the above it looks entirely feasible), that's
just not going to happen. We're not ever going to call random notifier
crap from this deep within the scheduler.
> >>> How about modifying cpufreq_register_notifier to return an error if the
> >>> driver has a fast_switch callback installed and an attempt to register a
> >>> transition notifier is made?
> >>
> >> That sounds like a good idea.
Agreed, fail the stuff hard.
Simply make cpufreq_register_notifier a __must_check function and add
error handling to all call sites.
> > I guess what might be done would be to spawn a work item to carry out
> > a notify when the frequency changes.
>
> In fact, the mechanism may be relatively simple if I'm not mistaken.
>
> In the "fast switch" case, the governor may spawn a work item that
> will just execute cpufreq_get() on policy->cpu. That will notice that
> policy->cur is different from the real current frequency and will
> re-adjust.
>
> Of course, cpufreq_driver_fast_switch() will need to be modified so it
> doesn't update policy->cur then perhaps with a comment that the
> governor using it will be responsible for that.
No no no, that's just horrible. Why would you want to keep this
notification stuff alive? If your platform can change frequency 'fast'
you don't want notifiers.
What's the point of a notification that says: "At some point in the
random past my frequency has changed, and it likely has changed again
since then, do 'something'."
That's pointless. If you have dependent clock domains or whatever, you
simply _cannot_ be fast.
On Sat, Mar 5, 2016 at 5:49 PM, Peter Zijlstra <[email protected]> wrote:
> On Sat, Mar 05, 2016 at 12:18:54AM +0100, Rafael J. Wysocki wrote:
>
>> >>> Even if there are platforms which may change the CPU frequency behind
>> >>> cpufreq's back, breaking the transition notifiers, I'm worried about the
>> >>> addition of an interface which itself breaks them. The platforms which
>> >>> do change CPU frequency on their own have probably evolved to live with
>> >>> or work around this behavior. As other platforms migrate to fast
>> >>> frequency switching they might be surprised when things don't work as
>> >>> advertised.
>
> There's only 43 sites of cpufreq_register_notifier in 37 files, that
> should be fairly simple to audit.
>
>> >>> I'm not sure what the easiest way to deal with this is. I see the
>> >>> transition notifiers are the srcu type, which I understand to be
>> >>> blocking. Going through the tree and reworking everyone's callbacks and
>> >>> changing the type to atomic is obviously not realistic.
>> >>
>> >> Right.
>
> Even if it was (and per the above it looks entirely feasible), that's
> just not going to happen. We're not ever going to call random notifier
> crap from this deep within the scheduler.
>
>> >>> How about modifying cpufreq_register_notifier to return an error if the
>> >>> driver has a fast_switch callback installed and an attempt to register a
>> >>> transition notifier is made?
>> >>
>> >> That sounds like a good idea.
>
> Agreed, fail the stuff hard.
>
> Simply make cpufreq_register_notifier a __must_check function and add
> error handling to all call sites.
Quite frankly, I don't see a compelling reason to do anything about
the notifications at this point.
The ACPI driver is the only one that will support fast switching for
the time being and on practically all platforms that can use the ACPI
driver the transition notifications cannot be relied on anyway for a
few reasons. First, if intel_pstate or HWP is in use, they won't be
coming at all. Second, anything turbo will just change frequency at
will without notifying (like HWP). Finally, if they are coming,
whoever receives them is notified about the frequency that is
requested and not the real one, which is misleading, because (a) the
request may just make the CPU go into the turbo range and then see
above or (b) if the CPU is in a platform-coordinated package, its
request will only be granted if it's the winning one.
>> > I guess what might be done would be to spawn a work item to carry out
>> > a notify when the frequency changes.
>>
>> In fact, the mechanism may be relatively simple if I'm not mistaken.
>>
>> In the "fast switch" case, the governor may spawn a work item that
>> will just execute cpufreq_get() on policy->cpu. That will notice that
>> policy->cur is different from the real current frequency and will
>> re-adjust.
>>
>> Of course, cpufreq_driver_fast_switch() will need to be modified so it
>> doesn't update policy->cur then perhaps with a comment that the
>> governor using it will be responsible for that.
>
> No no no, that's just horrible. Why would you want to keep this
> notification stuff alive? If your platform can change frequency 'fast'
> you don't want notifiers.
I'm not totally sure about that.
>
> What's the point of a notification that says: "At some point in the
> random past my frequency has changed, and it likely has changed again
> since then, do 'something'."
>
> That's pointless. If you have dependent clock domains or whatever, you
> simply _cannot_ be fast.
>
What about thermal? They don't need to get very accurate information,
but they need to be updated on a regular basis. It would do if they
get averages instead of momentary values (and may be better even).
On Thursday, March 03, 2016 01:37:59 PM Steve Muckle wrote:
> On 03/03/2016 12:20 PM, Rafael J. Wysocki wrote:
> >> Here is a comparison, with frequency invariance, of ondemand and
> >> interactive with schedfreq and schedutil. The first two columns (run and
> >> period) are omitted so the table will fit.
> >>
> >> ondemand interactive schedfreq schedutil
> >> busy % OR OH OR OH OR OH OR OH
> >> 1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86%
> >> 1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61%
> >> 10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78%
> >> 10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96%
> >> 10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03%
> >> 18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61%
> >> 19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46%
> >> 40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93%
> >> 40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08%
> >> 40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58%
> >> 55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70%
> >> 55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06%
> >> 55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94%
> >> 75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59%
> >> 75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94%
> >> 75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41%
> >>
> >> Aside from the 2nd and 3rd tests schedutil is showing decreased
> >> performance across the board. The fifth test is particularly bad.
> >
> > I guess you mean performance in terms of the overhead?
>
> Correct. This overhead metric describes how fast the workload completes,
> with 0% equaling the perf governor and 100% equaling the powersave
> governor. So it's a reflection of general performance using the
> governor. It's called "overhead" I imagine (the metric predates my
> involvement) as it is something introduced/caused by the policy of the
> governor.
If my understanding of the requency invariant utilization idea is correct,
it is about re-scaling utilization so it is always relative to the capacity
at the max frequency. If that's the case, then instead of using x = util_raw / max
we will use something like y = (util_raw / max) * (f / max_freq) (f - current
frequency). This means that
(1) x = y * max_freq / f
Now, say we have an agreed-on (linear) formula for f depending on x:
f = a * x + b
and if you say "Look, if I substitute y for x in this formula, it doesn't
produce correct results", then I can only say "It doesn't, because it can't".
It *obviously* won't work, because instead of substituting y for x, you
need to substitute the right-hand side of (1) for it. They you'll get
f = a * y * max_freq / f + b
which is obviously nonlinear, so there's no hope that the same formula
will ever work for both "raw" and "frequency invariant" utilization.
To me this means that looking for a formula that will work for both is
just pointless and there are 3 possibilities:
(a) Look for a good enough formula to apply to "raw" utilization and then
switch over when all architectures start to use "frequency invariant"
utilization.
(b) Make all architecuters use "frequency invariant" and then look for a
working formula (seems rather less than realistic to me to be honest).
(c) Code for using either "raw" or "frequency invariant" depending on
a callback flag or something like that.
I, personally, would go for (a) at this point, because that's the easiest
one, but (c) would be doable too IMO, so I don't care that much as long
as it is not (b).
Thanks,
Rafael
On Sun, Mar 06, 2016 at 03:17:09AM +0100, Rafael J. Wysocki wrote:
> > Agreed, fail the stuff hard.
> >
> > Simply make cpufreq_register_notifier a __must_check function and add
> > error handling to all call sites.
>
> Quite frankly, I don't see a compelling reason to do anything about
> the notifications at this point.
>
> The ACPI driver is the only one that will support fast switching for
> the time being and on practically all platforms that can use the ACPI
> driver the transition notifications cannot be relied on anyway for a
> few reasons. First, if intel_pstate or HWP is in use, they won't be
> coming at all. Second, anything turbo will just change frequency at
> will without notifying (like HWP). Finally, if they are coming,
> whoever receives them is notified about the frequency that is
> requested and not the real one, which is misleading, because (a) the
> request may just make the CPU go into the turbo range and then see
> above or (b) if the CPU is in a platform-coordinated package, its
> request will only be granted if it's the winning one.
Sure I know all that. But that, to me, seems like an argument for why
you should have done this a long time ago.
Someone registering a notifier you _know_ won't be called reliably is a
sure sign of borkage. And you want to be notified (pun intended) of
borkage.
So the alternative option to making the registration fail, is making the
registration WARN (and possibly disable fast support in the driver).
But I do think something wants to be done here.
> > No no no, that's just horrible. Why would you want to keep this
> > notification stuff alive? If your platform can change frequency 'fast'
> > you don't want notifiers.
>
> I'm not totally sure about that.
I am, per definition, if you need to call notifiers, you're not fast.
I would really suggest making that a hard rule and enforcing it.
> > What's the point of a notification that says: "At some point in the
> > random past my frequency has changed, and it likely has changed again
> > since then, do 'something'."
> >
> > That's pointless. If you have dependent clock domains or whatever, you
> > simply _cannot_ be fast.
> >
>
> What about thermal? They don't need to get very accurate information,
> but they need to be updated on a regular basis. It would do if they
> get averages instead of momentary values (and may be better even).
Thermal, should be an integral part of cpufreq, but if they need a
callback from the switching hook (and here I would like to remind
everyone that this is inside scheduler hot paths and the more code you
stuff in the harder the performance regressions will hit you in the
face) it can get a direct function call. No need for no stinking
notifiers.
On Mon, Mar 7, 2016 at 9:00 AM, Peter Zijlstra <[email protected]> wrote:
> On Sun, Mar 06, 2016 at 03:17:09AM +0100, Rafael J. Wysocki wrote:
>> > Agreed, fail the stuff hard.
>> >
>> > Simply make cpufreq_register_notifier a __must_check function and add
>> > error handling to all call sites.
>>
>> Quite frankly, I don't see a compelling reason to do anything about
>> the notifications at this point.
>>
>> The ACPI driver is the only one that will support fast switching for
>> the time being and on practically all platforms that can use the ACPI
>> driver the transition notifications cannot be relied on anyway for a
>> few reasons. First, if intel_pstate or HWP is in use, they won't be
>> coming at all. Second, anything turbo will just change frequency at
>> will without notifying (like HWP). Finally, if they are coming,
>> whoever receives them is notified about the frequency that is
>> requested and not the real one, which is misleading, because (a) the
>> request may just make the CPU go into the turbo range and then see
>> above or (b) if the CPU is in a platform-coordinated package, its
>> request will only be granted if it's the winning one.
>
> Sure I know all that. But that, to me, seems like an argument for why
> you should have done this a long time ago.
While I generally agree with this, I don't quite see why cleaning that
up necessarily has to be connected to the current patch series which
is my point.
> Someone registering a notifier you _know_ won't be called reliably is a
> sure sign of borkage. And you want to be notified (pun intended) of
> borkage.
>
> So the alternative option to making the registration fail, is making the
> registration WARN (and possibly disable fast support in the driver).
>
> But I do think something wants to be done here.
So here's what I can do for the "fast switch" thing.
There is the fast_switch_possible policy flag that's necessary anyway.
I can make notifier registration fail when that is set for at least
one policy and I can make the setting of it fail if at least one
notifier has already been registered.
However, without spending too much time on chasing code dependencies i
sort of suspect that it will uncover things that register cpufreq
notifiers early and it won't be possible to use fast switch without
sorting that out. And that won't even change anything apart from
removing some code that has not worked for quite a while already and
nobody noticed.
>> > No no no, that's just horrible. Why would you want to keep this
>> > notification stuff alive? If your platform can change frequency 'fast'
>> > you don't want notifiers.
>>
>> I'm not totally sure about that.
>
> I am, per definition, if you need to call notifiers, you're not fast.
>
> I would really suggest making that a hard rule and enforcing it.
OK, but see above.
It is doable for the "fast switch" thing, but it won't help in all of
the other cases when notifications are not reliable.
>> > What's the point of a notification that says: "At some point in the
>> > random past my frequency has changed, and it likely has changed again
>> > since then, do 'something'."
>> >
>> > That's pointless. If you have dependent clock domains or whatever, you
>> > simply _cannot_ be fast.
>> >
>>
>> What about thermal? They don't need to get very accurate information,
>> but they need to be updated on a regular basis. It would do if they
>> get averages instead of momentary values (and may be better even).
>
> Thermal, should be an integral part of cpufreq, but if they need a
> callback from the switching hook (and here I would like to remind
> everyone that this is inside scheduler hot paths and the more code you
> stuff in the harder the performance regressions will hit you in the
> face)
Calling notifiers (or any kind of callbacks that anyone can register)
from there is out of the question.
> it can get a direct function call. No need for no stinking
> notifiers.
I'm not talking about hooks in the switching code but *some* way to
let stuff know about frequency changes.
If it changes frequently enough, it's not practical and not even
necessary to cause things like thermal to react on every change, but I
think there needs to be a way to make them reevaluate things
regularly. Arguably, they might set a timer for that, but why would
they need a timer if they could get triggered by the code that
actually makes changes?
On Mon, Mar 07, 2016 at 02:15:47PM +0100, Rafael J. Wysocki wrote:
> On Mon, Mar 7, 2016 at 9:00 AM, Peter Zijlstra <[email protected]> wrote:
> > Sure I know all that. But that, to me, seems like an argument for why
> > you should have done this a long time ago.
>
> While I generally agree with this, I don't quite see why cleaning that
> up necessarily has to be connected to the current patch series which
> is my point.
Ah OK, fair enough I suppose. But someone should stick this on their
TODO list, we should not 'forget' about this (again).
> > But I do think something wants to be done here.
>
> So here's what I can do for the "fast switch" thing.
>
> There is the fast_switch_possible policy flag that's necessary anyway.
> I can make notifier registration fail when that is set for at least
> one policy and I can make the setting of it fail if at least one
> notifier has already been registered.
>
> However, without spending too much time on chasing code dependencies i
> sort of suspect that it will uncover things that register cpufreq
> notifiers early and it won't be possible to use fast switch without
> sorting that out.
The two x86 users don't register notifiers when CONSTANT_TSC, which
seems to be the right thing.
Much of the other users seem unlikely to be used on x86, so I suspect
the initial fallout will be very limited.
*groan* modules, cpufreq allows drivers to be modules, so init sequences
are poorly defined at best :/ Yes that blows.
> And that won't even change anything apart from
> removing some code that has not worked for quite a while already and
> nobody noticed.
Which is always a good thing, but yes, we can do this later.
> It is doable for the "fast switch" thing, but it won't help in all of
> the other cases when notifications are not reliable.
Right, you can maybe add a 'NOTIFIERS_BROKEN' flag to the intel_p_state
and HWP drivers or so, and trigger off of that.
> If it changes frequently enough, it's not practical and not even
> necessary to cause things like thermal to react on every change, but I
> think there needs to be a way to make them reevaluate things
> regularly. Arguably, they might set a timer for that, but why would
> they need a timer if they could get triggered by the code that
> actually makes changes?
So that very much depends on what thermal actually needs; but I suspect
that using a timer is cheaper than using irq_work to kick off something
else.
The irq_work is a LAPIC write (self IPI), just as the timer. However
timers can be coalesced, resulting in, on average, less timer
reprogramming than there are handlers ran.
Now, if thermal can do without work and can run in-line just like the
fast freq switch, then yes, that might make sense.
On Mon, Mar 7, 2016 at 2:32 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Mar 07, 2016 at 02:15:47PM +0100, Rafael J. Wysocki wrote:
>> On Mon, Mar 7, 2016 at 9:00 AM, Peter Zijlstra <[email protected]> wrote:
>
>> > Sure I know all that. But that, to me, seems like an argument for why
>> > you should have done this a long time ago.
>>
>> While I generally agree with this, I don't quite see why cleaning that
>> up necessarily has to be connected to the current patch series which
>> is my point.
>
> Ah OK, fair enough I suppose. But someone should stick this on their
> TODO list, we should not 'forget' about this (again).
Sure.
>> > But I do think something wants to be done here.
>>
>> So here's what I can do for the "fast switch" thing.
>>
>> There is the fast_switch_possible policy flag that's necessary anyway.
>> I can make notifier registration fail when that is set for at least
>> one policy and I can make the setting of it fail if at least one
>> notifier has already been registered.
>>
>> However, without spending too much time on chasing code dependencies i
>> sort of suspect that it will uncover things that register cpufreq
>> notifiers early and it won't be possible to use fast switch without
>> sorting that out.
>
> The two x86 users don't register notifiers when CONSTANT_TSC, which
> seems to be the right thing.
>
> Much of the other users seem unlikely to be used on x86, so I suspect
> the initial fallout will be very limited.
OK, let me try this then.
> *groan* modules, cpufreq allows drivers to be modules, so init sequences
> are poorly defined at best :/ Yes that blows.
Yup.
>> And that won't even change anything apart from
>> removing some code that has not worked for quite a while already and
>> nobody noticed.
>
> Which is always a good thing, but yes, we can do this later.
>
>> It is doable for the "fast switch" thing, but it won't help in all of
>> the other cases when notifications are not reliable.
>
> Right, you can maybe add a 'NOTIFIERS_BROKEN' flag to the intel_p_state
> and HWP drivers or so, and trigger off of that.
Something like that, yes.
From: Rafael J. Wysocki <[email protected]>
A subsequent change set will introduce a new cpufreq governor using
CPU utilization information from the scheduler, so introduce
cpufreq_update_util() (again) to allow that information to be passed to
the new governor and make cpufreq_trigger_update() call it internally.
To that end, modify the ->func callback pointer in struct freq_update_hook
to take the util and max arguments in addition to the time one
and arrange helpers to set/clear the utilization update hooks
accordingly.
Modify the current users of cpufreq utilization update callbacks to
take the above changes into account.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Changes from v2:
- One ->func callback for all of the users of struct freq_update_hook.
---
drivers/cpufreq/cpufreq_governor.c | 80 ++++++++++++++++++-------------------
drivers/cpufreq/intel_pstate.c | 12 +++--
include/linux/sched.h | 12 ++---
kernel/sched/cpufreq.c | 80 +++++++++++++++++++++++++++----------
kernel/sched/fair.c | 8 ++-
kernel/sched/sched.h | 9 ++++
6 files changed, 129 insertions(+), 72 deletions(-)
Index: linux-pm/include/linux/sched.h
===================================================================
--- linux-pm.orig/include/linux/sched.h
+++ linux-pm/include/linux/sched.h
@@ -2363,15 +2363,15 @@ static inline bool sched_can_stop_tick(v
#endif
#ifdef CONFIG_CPU_FREQ
-void cpufreq_trigger_update(u64 time);
-
struct freq_update_hook {
- void (*func)(struct freq_update_hook *hook, u64 time);
+ void (*func)(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max);
};
-void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook);
-#else
-static inline void cpufreq_trigger_update(u64 time) {}
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
+ void (*func)(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max));
+void cpufreq_clear_freq_update_hook(int cpu);
#endif
#ifdef CONFIG_SCHED_AUTOGROUP
Index: linux-pm/kernel/sched/cpufreq.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq.c
+++ linux-pm/kernel/sched/cpufreq.c
@@ -9,12 +9,12 @@
* published by the Free Software Foundation.
*/
-#include <linux/sched.h>
+#include "sched.h"
static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
/**
- * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
+ * set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
* @cpu: The CPU to set the pointer for.
* @hook: New pointer value.
*
@@ -27,23 +27,75 @@ static DEFINE_PER_CPU(struct freq_update
* accessed via the old update_util_data pointer or invoke synchronize_sched()
* right after this function to avoid use-after-free.
*/
-void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook)
+static void set_freq_update_hook(int cpu, struct freq_update_hook *hook)
{
- if (WARN_ON(hook && !hook->func))
+ rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
+}
+
+/**
+ * cpufreq_set_freq_update_hook - Set the CPU's frequency update callback.
+ * @cpu: The CPU to set the callback for.
+ * @hook: New freq_update_hook pointer value.
+ * @func: Callback function to use with the new hook.
+ */
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
+ void (*func)(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max))
+{
+ if (WARN_ON(!hook || !func))
return;
- rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
+ hook->func = func;
+ set_freq_update_hook(cpu, hook);
}
EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook);
/**
+ * cpufreq_set_update_util_hook - Clear the CPU's freq_update_hook pointer.
+ * @cpu: The CPU to clear the pointer for.
+ */
+void cpufreq_clear_freq_update_hook(int cpu)
+{
+ set_freq_update_hook(cpu, NULL);
+}
+EXPORT_SYMBOL_GPL(cpufreq_clear_freq_update_hook);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @time: Current time.
+ * @util: CPU utilization.
+ * @max: CPU capacity.
+ *
+ * This function is called on every invocation of update_load_avg() on the CPU
+ * whose utilization is being updated.
+ *
+ * It can only be called from RCU-sched read-side critical sections.
+ */
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+ struct freq_update_hook *hook;
+
+#ifdef CONFIG_LOCKDEP
+ WARN_ON(debug_locks && !rcu_read_lock_sched_held());
+#endif
+
+ hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
+ /*
+ * If this isn't inside of an RCU-sched read-side critical section, hook
+ * may become NULL after the check below.
+ */
+ if (hook)
+ hook->func(hook, time, util, max);
+}
+
+/**
* cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
* @time: Current time.
*
* The way cpufreq is currently arranged requires it to evaluate the CPU
* performance state (frequency/voltage) on a regular basis. To facilitate
- * that, this function is called by update_load_avg() in CFS when executed for
- * the current CPU's runqueue.
+ * that, cpufreq_update_util() is called by update_load_avg() in CFS when
+ * executed for the current CPU's runqueue.
*
* However, this isn't sufficient to prevent the CPU from being stuck in a
* completely inadequate performance level for too long, because the calls
@@ -57,17 +109,5 @@ EXPORT_SYMBOL_GPL(cpufreq_set_freq_updat
*/
void cpufreq_trigger_update(u64 time)
{
- struct freq_update_hook *hook;
-
-#ifdef CONFIG_LOCKDEP
- WARN_ON(debug_locks && !rcu_read_lock_sched_held());
-#endif
-
- hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
- /*
- * If this isn't inside of an RCU-sched read-side critical section, hook
- * may become NULL after the check below.
- */
- if (hook)
- hook->func(hook, time);
+ cpufreq_update_util(time, ULONG_MAX, 0);
}
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2839,6 +2839,8 @@ static inline void update_load_avg(struc
update_tg_load_avg(cfs_rq, 0);
if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+ unsigned long max = rq->cpu_capacity_orig;
+
/*
* There are a few boundary cases this might miss but it should
* get called often enough that that should (hopefully) not be
@@ -2847,9 +2849,11 @@ static inline void update_load_avg(struc
* the next tick/schedule should update.
*
* It will not get called when we go idle, because the idle
- * thread is a different class (!fair).
+ * thread is a different class (!fair), nor will the utilization
+ * number include things like RT tasks.
*/
- cpufreq_trigger_update(rq_clock(rq));
+ cpufreq_update_util(rq_clock(rq),
+ min(cfs_rq->avg.util_avg, max), max);
}
}
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -1020,7 +1020,9 @@ static inline void intel_pstate_adjust_b
sample->freq);
}
-static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time)
+static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time,
+ unsigned long util_not_used,
+ unsigned long max_not_used)
{
struct cpudata *cpu = container_of(hook, struct cpudata, update_hook);
u64 delta_ns = time - cpu->sample.time;
@@ -1088,8 +1090,8 @@ static int intel_pstate_init_cpu(unsigne
intel_pstate_busy_pid_reset(cpu);
intel_pstate_sample(cpu, 0);
- cpu->update_hook.func = intel_pstate_freq_update;
- cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook);
+ cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook,
+ intel_pstate_freq_update);
pr_debug("intel_pstate: controlling: cpu %d\n", cpunum);
@@ -1173,7 +1175,7 @@ static void intel_pstate_stop_cpu(struct
pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
- cpufreq_set_freq_update_hook(cpu_num, NULL);
+ cpufreq_clear_freq_update_hook(cpu_num);
synchronize_sched();
if (hwp_active)
@@ -1441,7 +1443,7 @@ out:
get_online_cpus();
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu]) {
- cpufreq_set_freq_update_hook(cpu, NULL);
+ cpufreq_clear_freq_update_hook(cpu);
synchronize_sched();
kfree(all_cpu_data[cpu]);
}
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -211,43 +211,6 @@ unsigned int dbs_update(struct cpufreq_p
}
EXPORT_SYMBOL_GPL(dbs_update);
-static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs,
- unsigned int delay_us)
-{
- struct cpufreq_policy *policy = policy_dbs->policy;
- int cpu;
-
- gov_update_sample_delay(policy_dbs, delay_us);
- policy_dbs->last_sample_time = 0;
-
- for_each_cpu(cpu, policy->cpus) {
- struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
-
- cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook);
- }
-}
-
-static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy)
-{
- int i;
-
- for_each_cpu(i, policy->cpus)
- cpufreq_set_freq_update_hook(i, NULL);
-
- synchronize_sched();
-}
-
-static void gov_cancel_work(struct cpufreq_policy *policy)
-{
- struct policy_dbs_info *policy_dbs = policy->governor_data;
-
- gov_clear_freq_update_hooks(policy_dbs->policy);
- irq_work_sync(&policy_dbs->irq_work);
- cancel_work_sync(&policy_dbs->work);
- atomic_set(&policy_dbs->work_count, 0);
- policy_dbs->work_in_progress = false;
-}
-
static void dbs_work_handler(struct work_struct *work)
{
struct policy_dbs_info *policy_dbs;
@@ -285,7 +248,9 @@ static void dbs_irq_work(struct irq_work
schedule_work(&policy_dbs->work);
}
-static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time)
+static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time,
+ unsigned long util_not_used,
+ unsigned long max_not_used)
{
struct cpu_dbs_info *cdbs = container_of(hook, struct cpu_dbs_info, update_hook);
struct policy_dbs_info *policy_dbs = cdbs->policy_dbs;
@@ -334,6 +299,44 @@ static void dbs_freq_update_handler(stru
irq_work_queue(&policy_dbs->irq_work);
}
+static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs,
+ unsigned int delay_us)
+{
+ struct cpufreq_policy *policy = policy_dbs->policy;
+ int cpu;
+
+ gov_update_sample_delay(policy_dbs, delay_us);
+ policy_dbs->last_sample_time = 0;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
+
+ cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook,
+ dbs_freq_update_handler);
+ }
+}
+
+static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy)
+{
+ int i;
+
+ for_each_cpu(i, policy->cpus)
+ cpufreq_clear_freq_update_hook(i);
+
+ synchronize_sched();
+}
+
+static void gov_cancel_work(struct cpufreq_policy *policy)
+{
+ struct policy_dbs_info *policy_dbs = policy->governor_data;
+
+ gov_clear_freq_update_hooks(policy_dbs->policy);
+ irq_work_sync(&policy_dbs->irq_work);
+ cancel_work_sync(&policy_dbs->work);
+ atomic_set(&policy_dbs->work_count, 0);
+ policy_dbs->work_in_progress = false;
+}
+
static struct policy_dbs_info *alloc_policy_dbs_info(struct cpufreq_policy *policy,
struct dbs_governor *gov)
{
@@ -356,7 +359,6 @@ static struct policy_dbs_info *alloc_pol
struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
j_cdbs->policy_dbs = policy_dbs;
- j_cdbs->update_hook.func = dbs_freq_update_handler;
}
return policy_dbs;
}
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -1739,3 +1739,12 @@ static inline u64 irq_time_read(int cpu)
}
#endif /* CONFIG_64BIT */
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+#ifdef CONFIG_CPU_FREQ
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+void cpufreq_trigger_update(u64 time);
+#else
+static inline void cpufreq_update_util(u64 time, unsigned long util,
+ unsigned long max) {}
+static inline void cpufreq_trigger_update(u64 time) {}
+#endif /* CONFIG_CPU_FREQ */
From: Rafael J. Wysocki <[email protected]>
Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.
Doing that is possible after commit fe7034338ba0 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.
The new governor is relatively simple.
The frequency selection formula used by it is
f = 1.1 * max_freq * util / max
where util and max are the utilization and CPU capacity coming from
CFS and max_freq is the nominal maximum frequency of the CPU (as
reported by the cpufreq driver).
All of the computations are carried out in the utilization update
handlers provided by the new governor. One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).
The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers. If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).
Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes. That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.
The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Changes from v2:
- The governor goes into drivers/cpufreq/.
- The "next frequency" formula has an additional 1.1 factor to allow
more util/max values to map onto the top-most frequency in case the
distance between that and the previous one is unproportionally small.
- sugov_update_commit() traces CPU frequency even if the new one is
the same as the previous one (otherwise, if the system is 100% loaded
for long enough, powertop starts to report that all CPUs are 100% idle).
---
drivers/cpufreq/Kconfig | 26 +
drivers/cpufreq/Makefile | 1
drivers/cpufreq/cpufreq_schedutil.c | 509 ++++++++++++++++++++++++++++++++++++
3 files changed, 536 insertions(+)
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+ bool "schedutil"
+ select CPU_FREQ_GOV_SCHEDUTIL
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the 'schedutil' CPUFreq governor by default. If unsure,
+ have a look at the help section of that governor. The fallback
+ governor will be 'performance'.
+
endchoice
config CPU_FREQ_GOV_PERFORMANCE
@@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE
If in doubt, say N.
+config CPU_FREQ_GOV_SCHEDUTIL
+ tristate "'schedutil' cpufreq policy governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_ATTR_SET
+ select IRQ_WORK
+ help
+ The frequency selection formula used by this governor is analogous
+ to the one used by 'ondemand', but instead of computing CPU load
+ as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+ utilization data provided by the scheduler as input.
+
+ To compile this driver as a module, choose M here: the
+ module will be called cpufreq_schedutil.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"
config CPUFREQ_DT
Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
@@ -0,0 +1,509 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <trace/events/power.h>
+
+#include "cpufreq_governor.h"
+
+struct sugov_tunables {
+ struct gov_attr_set attr_set;
+ unsigned int rate_limit_us;
+};
+
+struct sugov_policy {
+ struct cpufreq_policy *policy;
+
+ struct sugov_tunables *tunables;
+ struct list_head tunables_hook;
+
+ raw_spinlock_t update_lock; /* For shared policies */
+ u64 last_freq_update_time;
+ s64 freq_update_delay_ns;
+ unsigned int next_freq;
+ unsigned int driver_freq;
+ unsigned int max_freq;
+
+ /* The next fields are only needed if fast switch cannot be used. */
+ struct irq_work irq_work;
+ struct work_struct work;
+ struct mutex work_lock;
+ bool work_in_progress;
+
+ bool need_freq_update;
+};
+
+struct sugov_cpu {
+ struct freq_update_hook update_hook;
+ struct sugov_policy *sg_policy;
+
+ /* The fields below are only needed when sharing a policy. */
+ unsigned long util;
+ unsigned long max;
+ u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
+ u64 delta_ns;
+
+ if (sg_policy->work_in_progress)
+ return false;
+
+ if (unlikely(sg_policy->need_freq_update)) {
+ sg_policy->need_freq_update = false;
+ return true;
+ }
+
+ delta_ns = time - sg_policy->last_freq_update_time;
+ return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
+}
+
+static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+ unsigned int next_freq)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int freq;
+
+ if (next_freq > policy->max)
+ next_freq = policy->max;
+ else if (next_freq < policy->min)
+ next_freq = policy->min;
+
+ sg_policy->last_freq_update_time = time;
+ if (sg_policy->next_freq == next_freq) {
+ if (!policy->fast_switch_possible)
+ return;
+
+ freq = sg_policy->driver_freq;
+ } else {
+ sg_policy->next_freq = next_freq;
+ if (!policy->fast_switch_possible) {
+ sg_policy->work_in_progress = true;
+ irq_work_queue(&sg_policy->irq_work);
+ return;
+ }
+ freq = cpufreq_driver_fast_switch(policy, next_freq);
+ if (freq == CPUFREQ_ENTRY_INVALID)
+ return;
+
+ sg_policy->driver_freq = freq;
+ }
+ policy->cur = freq;
+ trace_cpu_frequency(freq, smp_processor_id());
+}
+
+static void sugov_update_single(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int max_f, next_f;
+
+ if (!sugov_should_update_freq(sg_policy, time))
+ return;
+
+ max_f = sg_policy->max_freq;
+ next_f = util > max ? max_f : util * max_f / max;
+ sugov_update_commit(sg_policy, time, next_f);
+}
+
+static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
+ unsigned long util, unsigned long max)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int max_f = sg_policy->max_freq;
+ u64 last_freq_update_time = sg_policy->last_freq_update_time;
+ unsigned int j;
+
+ if (util > max)
+ return max_f;
+
+ for_each_cpu(j, policy->cpus) {
+ struct sugov_cpu *j_sg_cpu;
+ unsigned long j_util, j_max;
+ u64 delta_ns;
+
+ if (j == smp_processor_id())
+ continue;
+
+ j_sg_cpu = &per_cpu(sugov_cpu, j);
+ /*
+ * If the CPU utilization was last updated before the previous
+ * frequency update and the time elapsed between the last update
+ * of the CPU utilization and the last frequency update is long
+ * enough, don't take the CPU into account as it probably is
+ * idle now.
+ */
+ delta_ns = last_freq_update_time - j_sg_cpu->last_update;
+ if ((s64)delta_ns > NSEC_PER_SEC / HZ)
+ continue;
+
+ j_util = j_sg_cpu->util;
+ j_max = j_sg_cpu->max;
+ if (j_util > j_max)
+ return max_f;
+
+ if (j_util * max > j_max * util) {
+ util = j_util;
+ max = j_max;
+ }
+ }
+
+ return util * max_f / max;
+}
+
+static void sugov_update_shared(struct freq_update_hook *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int next_f;
+
+ raw_spin_lock(&sg_policy->update_lock);
+
+ sg_cpu->util = util;
+ sg_cpu->max = max;
+ sg_cpu->last_update = time;
+
+ if (sugov_should_update_freq(sg_policy, time)) {
+ next_f = sugov_next_freq(sg_policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+ }
+
+ raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_work(struct work_struct *work)
+{
+ struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+
+ mutex_lock(&sg_policy->work_lock);
+ __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+ CPUFREQ_RELATION_L);
+ mutex_unlock(&sg_policy->work_lock);
+
+ sg_policy->work_in_progress = false;
+}
+
+static void sugov_irq_work(struct irq_work *irq_work)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
+ schedule_work(&sg_policy->work);
+}
+
+/************************** sysfs interface ************************/
+
+static struct sugov_tunables *global_tunables;
+static DEFINE_MUTEX(global_tunables_lock);
+
+static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct sugov_tunables, attr_set);
+}
+
+static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+ return sprintf(buf, "%u\n", tunables->rate_limit_us);
+}
+
+static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+ struct sugov_policy *sg_policy;
+ unsigned int rate_limit_us;
+ int ret;
+
+ ret = sscanf(buf, "%u", &rate_limit_us);
+ if (ret != 1)
+ return -EINVAL;
+
+ tunables->rate_limit_us = rate_limit_us;
+
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+ sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+
+ return count;
+}
+
+static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+
+static struct attribute *sugov_attributes[] = {
+ &rate_limit_us.attr,
+ NULL
+};
+
+static struct kobj_type sugov_tunables_ktype = {
+ .default_attrs = sugov_attributes,
+ .sysfs_ops = &governor_sysfs_ops,
+};
+
+/********************** cpufreq governor interface *********************/
+
+static struct cpufreq_governor schedutil_gov;
+
+static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
+{
+ unsigned int max_f = policy->cpuinfo.max_freq;
+ struct sugov_policy *sg_policy;
+
+ sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+ if (!sg_policy)
+ return NULL;
+
+ sg_policy->policy = policy;
+ /*
+ * Take the proportionality coefficient between util/max and frequency
+ * to be 1.1 times the nominal maximum frequency to boost performance
+ * slightly on systems with a narrow top-most frequency bin.
+ */
+ sg_policy->max_freq = max_f + max_f / 10;
+ init_irq_work(&sg_policy->irq_work, sugov_irq_work);
+ INIT_WORK(&sg_policy->work, sugov_work);
+ mutex_init(&sg_policy->work_lock);
+ raw_spin_lock_init(&sg_policy->update_lock);
+ return sg_policy;
+}
+
+static void sugov_policy_free(struct sugov_policy *sg_policy)
+{
+ mutex_destroy(&sg_policy->work_lock);
+ kfree(sg_policy);
+}
+
+static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
+{
+ struct sugov_tunables *tunables;
+
+ tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
+ if (tunables)
+ gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
+
+ return tunables;
+}
+
+static void sugov_tunables_free(struct sugov_tunables *tunables)
+{
+ if (!have_governor_per_policy())
+ global_tunables = NULL;
+
+ kfree(tunables);
+}
+
+static int sugov_init(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+ struct sugov_tunables *tunables;
+ unsigned int lat;
+ int ret = 0;
+
+ /* State should be equivalent to EXIT */
+ if (policy->governor_data)
+ return -EBUSY;
+
+ sg_policy = sugov_policy_alloc(policy);
+ if (!sg_policy)
+ return -ENOMEM;
+
+ mutex_lock(&global_tunables_lock);
+
+ if (global_tunables) {
+ if (WARN_ON(have_governor_per_policy())) {
+ ret = -EINVAL;
+ goto free_sg_policy;
+ }
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = global_tunables;
+
+ gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
+ goto out;
+ }
+
+ tunables = sugov_tunables_alloc(sg_policy);
+ if (!tunables) {
+ ret = -ENOMEM;
+ goto free_sg_policy;
+ }
+
+ tunables->rate_limit_us = LATENCY_MULTIPLIER;
+ lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
+ if (lat)
+ tunables->rate_limit_us *= lat;
+
+ if (!have_governor_per_policy())
+ global_tunables = tunables;
+
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = tunables;
+
+ ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
+ get_governor_parent_kobj(policy), "%s",
+ schedutil_gov.name);
+ if (!ret)
+ goto out;
+
+ /* Failure, so roll back. */
+ policy->governor_data = NULL;
+ sugov_tunables_free(tunables);
+
+ free_sg_policy:
+ pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
+ sugov_policy_free(sg_policy);
+
+ out:
+ mutex_unlock(&global_tunables_lock);
+ return ret;
+}
+
+static int sugov_exit(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ struct sugov_tunables *tunables = sg_policy->tunables;
+ unsigned int count;
+
+ mutex_lock(&global_tunables_lock);
+
+ count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
+ policy->governor_data = NULL;
+ if (!count)
+ sugov_tunables_free(tunables);
+
+ mutex_unlock(&global_tunables_lock);
+
+ sugov_policy_free(sg_policy);
+ return 0;
+}
+
+static int sugov_start(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->last_freq_update_time = 0;
+ sg_policy->next_freq = UINT_MAX;
+ sg_policy->work_in_progress = false;
+ sg_policy->need_freq_update = false;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+ sg_cpu->sg_policy = sg_policy;
+ if (policy_is_shared(policy)) {
+ sg_cpu->util = ULONG_MAX;
+ sg_cpu->max = 0;
+ sg_cpu->last_update = 0;
+ cpufreq_set_freq_update_hook(cpu, &sg_cpu->update_hook,
+ sugov_update_shared);
+ } else {
+ cpufreq_set_freq_update_hook(cpu, &sg_cpu->update_hook,
+ sugov_update_single);
+ }
+ }
+ return 0;
+}
+
+static int sugov_stop(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ cpufreq_clear_freq_update_hook(cpu);
+
+ synchronize_sched();
+
+ irq_work_sync(&sg_policy->irq_work);
+ cancel_work_sync(&sg_policy->work);
+ return 0;
+}
+
+static int sugov_limits(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+
+ if (!policy->fast_switch_possible) {
+ mutex_lock(&sg_policy->work_lock);
+
+ if (policy->max < policy->cur)
+ __cpufreq_driver_target(policy, policy->max,
+ CPUFREQ_RELATION_H);
+ else if (policy->min > policy->cur)
+ __cpufreq_driver_target(policy, policy->min,
+ CPUFREQ_RELATION_L);
+
+ mutex_unlock(&sg_policy->work_lock);
+ }
+
+ sg_policy->need_freq_update = true;
+ return 0;
+}
+
+int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
+{
+ if (event == CPUFREQ_GOV_POLICY_INIT) {
+ return sugov_init(policy);
+ } else if (policy->governor_data) {
+ switch (event) {
+ case CPUFREQ_GOV_POLICY_EXIT:
+ return sugov_exit(policy);
+ case CPUFREQ_GOV_START:
+ return sugov_start(policy);
+ case CPUFREQ_GOV_STOP:
+ return sugov_stop(policy);
+ case CPUFREQ_GOV_LIMITS:
+ return sugov_limits(policy);
+ }
+ }
+ return -EINVAL;
+}
+
+static struct cpufreq_governor schedutil_gov = {
+ .name = "schedutil",
+ .governor = sugov_governor,
+ .owner = THIS_MODULE,
+};
+
+static int __init sugov_module_init(void)
+{
+ return cpufreq_register_governor(&schedutil_gov);
+}
+
+static void __exit sugov_module_exit(void)
+{
+ cpufreq_unregister_governor(&schedutil_gov);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+struct cpufreq_governor *cpufreq_default_governor(void)
+{
+ return &schedutil_gov;
+}
+
+fs_initcall(sugov_module_init);
+#else
+module_init(sugov_module_init);
+#endif
+module_exit(sugov_module_exit);
Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += c
obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o
obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o
On Friday, March 04, 2016 03:56:09 AM Rafael J. Wysocki wrote:
> On Wednesday, March 02, 2016 02:56:28 AM Rafael J. Wysocki wrote:
> > Hi,
> >
> > My previous intro message still applies somewhat, so here's a link:
> >
> > http://marc.info/?l=linux-pm&m=145609673008122&w=2
> >
> > The executive summary of the motivation is that I wanted to do two things:
> > use the utilization data from the scheduler (it's passed to the governor
> > as aguments of update callbacks anyway) and make it possible to set
> > CPU frequency without involving process context (fast frequency switching).
> >
> > Both have been prototyped in the previous RFCs:
> >
> > https://patchwork.kernel.org/patch/8426691/
> > https://patchwork.kernel.org/patch/8426741/
> >
>
> [cut]
>
> >
> > Comments welcome.
>
> There were quite a few comments to address, so here's a new version.
>
> First off, my interpretation of what Ingo said earlier today (or yesterday
> depending on your time zone) is that he wants all of the code dealing with
> the util and max values to be located in kernel/sched/. I can understand
> the motivation here, although schedutil shares some amount of code with
> the other governors, so the dependency on cpufreq will still be there, even
> if the code goes to kernel/sched/. Nevertheless, I decided to make that
> change just to see how it would look like if not for anything else.
>
> To that end, I revived a patch I had before the first schedutil one to
> remove util/max from the cpufreq hooks [7/10], moved the scheduler-related
> code from drivers/cpufreq/cpufreq.c to kernel/sched/cpufreq.c (new file)
> on top of that [8/10] and reintroduced cpufreq_update_util() in a slightly
> different form [9/10]. I did it this way in case it turns out to be
> necessary to apply [7/10] and [8/10] for the time being and defer the rest
> to the next cycle.
>
> Apart from that, I changed the frequency selection formula in the new
> governor to next_freq = util * max_freq / max and it seems to work. That
> allowed the code to be simplified somewhat as I don't need the extra
> relation field in struct sugov_policy now (RELATION_L is used everywhere).
>
> Finally, I tried to address the bikeshed comment from Viresh about the
> "wrong" names of data types etc related to governor sysfs attributes
> handling. Hopefully, the new ones are better.
>
> There are small tweaks all over on top of that.
I've taken patches [1-2/10] from the previous iteration into linux-next
as they were not controversial and improved things anyway.
What follows is reordered a bit and reworked with respect to the v2.
Patches [1-4/7] have not been modified (ie. resends).
Patch [5/7] (fast switch support) has a mechanism to deal with notifiers
included (works for me with the ACPI driver) and cpufreq_driver_fast_switch()
is just a wrapper around the driver callback now (because the givernor needs
to do frequency tracing by itself as it turns out).
Patch [6/7] makes the hooks use util and max arguments again, but this time
the callback function format is the same for everyone (ie. 4 arguments) and
the new governor added by patch [7/7] goes into drivers/cpufreq/ as that
is *much* cleaner IMO.
The new frequency formula has been tweaked a bit once more to make more
util/max values map to the top-most frequency (that matters for systems
where turbo is "encoded" by an extra frequency level where the frequency
is greater by 1 MHz from the previous one, for example).
At this point I'm inclined to take patches [1-2/7] into linux-next for 4.6,
because they set a clear boundary between the current linux-next code which
doesn't really use the utilization data and schedutil, and defer the rest
till after the 4.6 merge window. That will allow the new next frequency
formula to be tested and maybe we can do something about passing util data
from DL to cpufreq_update_util() in the meantime.
If anyone has any issues with that plan, please let me know.
Thanks,
Rafael
From: Rafael J. Wysocki <[email protected]>
Commit fe7034338ba0 (cpufreq: Add mechanism for registering
utilization update callbacks) added cpufreq_update_util() to be
called by the scheduler (from the CFS part) on utilization updates.
The goal was to allow CFS to pass utilization information to cpufreq
and to trigger it to evaluate the frequency/voltage configuration
(P-state) of every CPU on a regular basis.
However, the last two arguments of that function are never used by
the current code, so CFS might simply call cpufreq_trigger_update()
instead of it (like the RT and DL sched classes).
For this reason, drop the last two arguments of cpufreq_update_util(),
rename it to cpufreq_trigger_update() and modify CFS to call it.
Moreover, since the utilization is not involved in that now, rename
data types, functions and variables related to cpufreq_trigger_update()
to reflect that (eg. struct update_util_data becomes struct
freq_update_hook and so on).
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
No changes from v2.
---
drivers/cpufreq/cpufreq.c | 52 +++++++++++++++++++++----------------
drivers/cpufreq/cpufreq_governor.c | 25 ++++++++---------
drivers/cpufreq/cpufreq_governor.h | 2 -
drivers/cpufreq/intel_pstate.c | 15 ++++------
include/linux/cpufreq.h | 32 ++--------------------
kernel/sched/deadline.c | 2 -
kernel/sched/fair.c | 13 +--------
kernel/sched/rt.c | 2 -
8 files changed, 58 insertions(+), 85 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -65,57 +65,65 @@ static struct cpufreq_driver *cpufreq_dr
static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
static DEFINE_RWLOCK(cpufreq_driver_lock);
-static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
/**
- * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
* @cpu: The CPU to set the pointer for.
- * @data: New pointer value.
+ * @hook: New pointer value.
*
- * Set and publish the update_util_data pointer for the given CPU. That pointer
- * points to a struct update_util_data object containing a callback function
- * to call from cpufreq_update_util(). That function will be called from an RCU
- * read-side critical section, so it must not sleep.
+ * Set and publish the freq_update_hook pointer for the given CPU. That pointer
+ * points to a struct freq_update_hook object containing a callback function
+ * to call from cpufreq_trigger_update(). That function will be called from
+ * an RCU read-side critical section, so it must not sleep.
*
* Callers must use RCU-sched callbacks to free any memory that might be
* accessed via the old update_util_data pointer or invoke synchronize_sched()
* right after this function to avoid use-after-free.
*/
-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook)
{
- if (WARN_ON(data && !data->func))
+ if (WARN_ON(hook && !hook->func))
return;
- rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+ rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
}
-EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook);
/**
- * cpufreq_update_util - Take a note about CPU utilization changes.
+ * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
* @time: Current time.
- * @util: Current utilization.
- * @max: Utilization ceiling.
*
- * This function is called by the scheduler on every invocation of
- * update_load_avg() on the CPU whose utilization is being updated.
+ * The way cpufreq is currently arranged requires it to evaluate the CPU
+ * performance state (frequency/voltage) on a regular basis. To facilitate
+ * that, this function is called by update_load_avg() in CFS when executed for
+ * the current CPU's runqueue.
*
- * It can only be called from RCU-sched read-side critical sections.
+ * However, this isn't sufficient to prevent the CPU from being stuck in a
+ * completely inadequate performance level for too long, because the calls
+ * from CFS will not be made if RT or deadline tasks are active all the time
+ * (or there are RT and DL tasks only).
+ *
+ * As a workaround for that issue, this function is called by the RT and DL
+ * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * but that really is a band-aid. Going forward it should be replaced with
+ * solutions targeted more specifically at RT and DL tasks.
*/
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+void cpufreq_trigger_update(u64 time)
{
- struct update_util_data *data;
+ struct freq_update_hook *hook;
#ifdef CONFIG_LOCKDEP
WARN_ON(debug_locks && !rcu_read_lock_sched_held());
#endif
- data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
+ hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
/*
* If this isn't inside of an RCU-sched read-side critical section, data
* may become NULL after the check below.
*/
- if (data)
- data->func(data, time, util, max);
+ if (hook)
+ hook->func(hook, time);
}
/* Flag to suspend/resume CPUFreq governors */
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -146,35 +146,13 @@ static inline bool policy_is_shared(stru
extern struct kobject *cpufreq_global_kobject;
#ifdef CONFIG_CPU_FREQ
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+void cpufreq_trigger_update(u64 time);
-/**
- * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
- * @time: Current time.
- *
- * The way cpufreq is currently arranged requires it to evaluate the CPU
- * performance state (frequency/voltage) on a regular basis to prevent it from
- * being stuck in a completely inadequate performance level for too long.
- * That is not guaranteed to happen if the updates are only triggered from CFS,
- * though, because they may not be coming in if RT or deadline tasks are active
- * all the time (or there are RT and DL tasks only).
- *
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
- * but that really is a band-aid. Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
- */
-static inline void cpufreq_trigger_update(u64 time)
-{
- cpufreq_update_util(time, ULONG_MAX, 0);
-}
-
-struct update_util_data {
- void (*func)(struct update_util_data *data,
- u64 time, unsigned long util, unsigned long max);
+struct freq_update_hook {
+ void (*func)(struct freq_update_hook *hook, u64 time);
};
-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook);
unsigned int cpufreq_get(unsigned int cpu);
unsigned int cpufreq_quick_get(unsigned int cpu);
@@ -187,8 +165,6 @@ int cpufreq_update_policy(unsigned int c
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
#else
-static inline void cpufreq_update_util(u64 time, unsigned long util,
- unsigned long max) {}
static inline void cpufreq_trigger_update(u64 time) {}
static inline unsigned int cpufreq_get(unsigned int cpu)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -62,10 +62,10 @@ ssize_t store_sampling_rate(struct dbs_d
mutex_lock(&policy_dbs->timer_mutex);
/*
* On 32-bit architectures this may race with the
- * sample_delay_ns read in dbs_update_util_handler(), but that
+ * sample_delay_ns read in dbs_freq_update_handler(), but that
* really doesn't matter. If the read returns a value that's
* too big, the sample will be skipped, but the next invocation
- * of dbs_update_util_handler() (when the update has been
+ * of dbs_freq_update_handler() (when the update has been
* completed) will take a sample.
*
* If this runs in parallel with dbs_work_handler(), we may end
@@ -257,7 +257,7 @@ unsigned int dbs_update(struct cpufreq_p
}
EXPORT_SYMBOL_GPL(dbs_update);
-static void gov_set_update_util(struct policy_dbs_info *policy_dbs,
+static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs,
unsigned int delay_us)
{
struct cpufreq_policy *policy = policy_dbs->policy;
@@ -269,16 +269,16 @@ static void gov_set_update_util(struct p
for_each_cpu(cpu, policy->cpus) {
struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
- cpufreq_set_update_util_data(cpu, &cdbs->update_util);
+ cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook);
}
}
-static inline void gov_clear_update_util(struct cpufreq_policy *policy)
+static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy)
{
int i;
for_each_cpu(i, policy->cpus)
- cpufreq_set_update_util_data(i, NULL);
+ cpufreq_set_freq_update_hook(i, NULL);
synchronize_sched();
}
@@ -287,7 +287,7 @@ static void gov_cancel_work(struct cpufr
{
struct policy_dbs_info *policy_dbs = policy->governor_data;
- gov_clear_update_util(policy_dbs->policy);
+ gov_clear_freq_update_hooks(policy_dbs->policy);
irq_work_sync(&policy_dbs->irq_work);
cancel_work_sync(&policy_dbs->work);
atomic_set(&policy_dbs->work_count, 0);
@@ -331,10 +331,9 @@ static void dbs_irq_work(struct irq_work
schedule_work(&policy_dbs->work);
}
-static void dbs_update_util_handler(struct update_util_data *data, u64 time,
- unsigned long util, unsigned long max)
+static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time)
{
- struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
+ struct cpu_dbs_info *cdbs = container_of(hook, struct cpu_dbs_info, update_hook);
struct policy_dbs_info *policy_dbs = cdbs->policy_dbs;
u64 delta_ns, lst;
@@ -403,7 +402,7 @@ static struct policy_dbs_info *alloc_pol
struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
j_cdbs->policy_dbs = policy_dbs;
- j_cdbs->update_util.func = dbs_update_util_handler;
+ j_cdbs->update_hook.func = dbs_freq_update_handler;
}
return policy_dbs;
}
@@ -419,7 +418,7 @@ static void free_policy_dbs_info(struct
struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
j_cdbs->policy_dbs = NULL;
- j_cdbs->update_util.func = NULL;
+ j_cdbs->update_hook.func = NULL;
}
gov->free(policy_dbs);
}
@@ -586,7 +585,7 @@ static int cpufreq_governor_start(struct
gov->start(policy);
- gov_set_update_util(policy_dbs, sampling_rate);
+ gov_set_freq_update_hooks(policy_dbs, sampling_rate);
return 0;
}
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -144,7 +144,7 @@ struct cpu_dbs_info {
* wake-up from idle.
*/
unsigned int prev_load;
- struct update_util_data update_util;
+ struct freq_update_hook update_hook;
struct policy_dbs_info *policy_dbs;
};
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -103,7 +103,7 @@ struct _pid {
struct cpudata {
int cpu;
- struct update_util_data update_util;
+ struct freq_update_hook update_hook;
struct pstate_data pstate;
struct vid_data vid;
@@ -1019,10 +1019,9 @@ static inline void intel_pstate_adjust_b
sample->freq);
}
-static void intel_pstate_update_util(struct update_util_data *data, u64 time,
- unsigned long util, unsigned long max)
+static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time)
{
- struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+ struct cpudata *cpu = container_of(hook, struct cpudata, update_hook);
u64 delta_ns = time - cpu->sample.time;
if ((s64)delta_ns >= pid_params.sample_rate_ns) {
@@ -1088,8 +1087,8 @@ static int intel_pstate_init_cpu(unsigne
intel_pstate_busy_pid_reset(cpu);
intel_pstate_sample(cpu, 0);
- cpu->update_util.func = intel_pstate_update_util;
- cpufreq_set_update_util_data(cpunum, &cpu->update_util);
+ cpu->update_hook.func = intel_pstate_freq_update;
+ cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook);
pr_debug("intel_pstate: controlling: cpu %d\n", cpunum);
@@ -1173,7 +1172,7 @@ static void intel_pstate_stop_cpu(struct
pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
- cpufreq_set_update_util_data(cpu_num, NULL);
+ cpufreq_set_freq_update_hook(cpu_num, NULL);
synchronize_sched();
if (hwp_active)
@@ -1441,7 +1440,7 @@ out:
get_online_cpus();
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu]) {
- cpufreq_set_update_util_data(cpu, NULL);
+ cpufreq_set_freq_update_hook(cpu, NULL);
synchronize_sched();
kfree(all_cpu_data[cpu]);
}
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2839,8 +2839,6 @@ static inline void update_load_avg(struc
update_tg_load_avg(cfs_rq, 0);
if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
- unsigned long max = rq->cpu_capacity_orig;
-
/*
* There are a few boundary cases this might miss but it should
* get called often enough that that should (hopefully) not be
@@ -2849,16 +2847,9 @@ static inline void update_load_avg(struc
* the next tick/schedule should update.
*
* It will not get called when we go idle, because the idle
- * thread is a different class (!fair), nor will the utilization
- * number include things like RT tasks.
- *
- * As is, the util number is not freq-invariant (we'd have to
- * implement arch_scale_freq_capacity() for that).
- *
- * See cpu_util().
+ * thread is a different class (!fair).
*/
- cpufreq_update_util(rq_clock(rq),
- min(cfs_rq->avg.util_avg, max), max);
+ cpufreq_trigger_update(rq_clock(rq));
}
}
Index: linux-pm/kernel/sched/deadline.c
===================================================================
--- linux-pm.orig/kernel/sched/deadline.c
+++ linux-pm/kernel/sched/deadline.c
@@ -726,7 +726,7 @@ static void update_curr_dl(struct rq *rq
if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
- /* Kick cpufreq (see the comment in linux/cpufreq.h). */
+ /* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
cpufreq_trigger_update(rq_clock(rq));
Index: linux-pm/kernel/sched/rt.c
===================================================================
--- linux-pm.orig/kernel/sched/rt.c
+++ linux-pm/kernel/sched/rt.c
@@ -945,7 +945,7 @@ static void update_curr_rt(struct rq *rq
if (curr->sched_class != &rt_sched_class)
return;
- /* Kick cpufreq (see the comment in linux/cpufreq.h). */
+ /* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
cpufreq_trigger_update(rq_clock(rq));
From: Rafael J. Wysocki <[email protected]>
Create cpufreq.c under kernel/sched/ and move the cpufreq code
related to the scheduler to that file. Also move the headers
related to that code from cpufreq.h to sched.h.
No functional changes.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
No changes from v2.
---
drivers/cpufreq/cpufreq.c | 61 ------------------------------
drivers/cpufreq/cpufreq_governor.c | 1
drivers/cpufreq/intel_pstate.c | 1
include/linux/cpufreq.h | 10 -----
include/linux/sched.h | 12 ++++++
kernel/sched/Makefile | 1
kernel/sched/cpufreq.c | 73 +++++++++++++++++++++++++++++++++++++
7 files changed, 88 insertions(+), 71 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -65,67 +65,6 @@ static struct cpufreq_driver *cpufreq_dr
static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
static DEFINE_RWLOCK(cpufreq_driver_lock);
-static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
-
-/**
- * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
- * @cpu: The CPU to set the pointer for.
- * @hook: New pointer value.
- *
- * Set and publish the freq_update_hook pointer for the given CPU. That pointer
- * points to a struct freq_update_hook object containing a callback function
- * to call from cpufreq_trigger_update(). That function will be called from
- * an RCU read-side critical section, so it must not sleep.
- *
- * Callers must use RCU-sched callbacks to free any memory that might be
- * accessed via the old update_util_data pointer or invoke synchronize_sched()
- * right after this function to avoid use-after-free.
- */
-void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook)
-{
- if (WARN_ON(hook && !hook->func))
- return;
-
- rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
-}
-EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook);
-
-/**
- * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
- * @time: Current time.
- *
- * The way cpufreq is currently arranged requires it to evaluate the CPU
- * performance state (frequency/voltage) on a regular basis. To facilitate
- * that, this function is called by update_load_avg() in CFS when executed for
- * the current CPU's runqueue.
- *
- * However, this isn't sufficient to prevent the CPU from being stuck in a
- * completely inadequate performance level for too long, because the calls
- * from CFS will not be made if RT or deadline tasks are active all the time
- * (or there are RT and DL tasks only).
- *
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
- * but that really is a band-aid. Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
- */
-void cpufreq_trigger_update(u64 time)
-{
- struct freq_update_hook *hook;
-
-#ifdef CONFIG_LOCKDEP
- WARN_ON(debug_locks && !rcu_read_lock_sched_held());
-#endif
-
- hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
- /*
- * If this isn't inside of an RCU-sched read-side critical section, data
- * may become NULL after the check below.
- */
- if (hook)
- hook->func(hook, time);
-}
-
/* Flag to suspend/resume CPUFreq governors */
static bool cpufreq_suspended;
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -18,6 +18,7 @@
#include <linux/export.h>
#include <linux/kernel_stat.h>
+#include <linux/sched.h>
#include <linux/slab.h>
#include "cpufreq_governor.h"
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -21,6 +21,7 @@
#include <linux/list.h>
#include <linux/cpu.h>
#include <linux/cpufreq.h>
+#include <linux/sched.h>
#include <linux/sysfs.h>
#include <linux/types.h>
#include <linux/fs.h>
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -146,14 +146,6 @@ static inline bool policy_is_shared(stru
extern struct kobject *cpufreq_global_kobject;
#ifdef CONFIG_CPU_FREQ
-void cpufreq_trigger_update(u64 time);
-
-struct freq_update_hook {
- void (*func)(struct freq_update_hook *hook, u64 time);
-};
-
-void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook);
-
unsigned int cpufreq_get(unsigned int cpu);
unsigned int cpufreq_quick_get(unsigned int cpu);
unsigned int cpufreq_quick_get_max(unsigned int cpu);
@@ -165,8 +157,6 @@ int cpufreq_update_policy(unsigned int c
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
#else
-static inline void cpufreq_trigger_update(u64 time) {}
-
static inline unsigned int cpufreq_get(unsigned int cpu)
{
return 0;
Index: linux-pm/include/linux/sched.h
===================================================================
--- linux-pm.orig/include/linux/sched.h
+++ linux-pm/include/linux/sched.h
@@ -2362,6 +2362,18 @@ extern u64 scheduler_tick_max_deferment(
static inline bool sched_can_stop_tick(void) { return false; }
#endif
+#ifdef CONFIG_CPU_FREQ
+void cpufreq_trigger_update(u64 time);
+
+struct freq_update_hook {
+ void (*func)(struct freq_update_hook *hook, u64 time);
+};
+
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook);
+#else
+static inline void cpufreq_trigger_update(u64 time) {}
+#endif
+
#ifdef CONFIG_SCHED_AUTOGROUP
extern void sched_autogroup_create_attach(struct task_struct *p);
extern void sched_autogroup_detach(struct task_struct *p);
Index: linux-pm/kernel/sched/Makefile
===================================================================
--- linux-pm.orig/kernel/sched/Makefile
+++ linux-pm/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_gr
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ) += cpufreq.o
Index: linux-pm/kernel/sched/cpufreq.c
===================================================================
--- /dev/null
+++ linux-pm/kernel/sched/cpufreq.c
@@ -0,0 +1,73 @@
+/*
+ * Scheduler code and data structures related to cpufreq.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sched.h>
+
+static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
+
+/**
+ * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @hook: New pointer value.
+ *
+ * Set and publish the freq_update_hook pointer for the given CPU. That pointer
+ * points to a struct freq_update_hook object containing a callback function
+ * to call from cpufreq_trigger_update(). That function will be called from
+ * an RCU read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU-sched callbacks to free any memory that might be
+ * accessed via the old update_util_data pointer or invoke synchronize_sched()
+ * right after this function to avoid use-after-free.
+ */
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook)
+{
+ if (WARN_ON(hook && !hook->func))
+ return;
+
+ rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook);
+
+/**
+ * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
+ * @time: Current time.
+ *
+ * The way cpufreq is currently arranged requires it to evaluate the CPU
+ * performance state (frequency/voltage) on a regular basis. To facilitate
+ * that, this function is called by update_load_avg() in CFS when executed for
+ * the current CPU's runqueue.
+ *
+ * However, this isn't sufficient to prevent the CPU from being stuck in a
+ * completely inadequate performance level for too long, because the calls
+ * from CFS will not be made if RT or deadline tasks are active all the time
+ * (or there are RT and DL tasks only).
+ *
+ * As a workaround for that issue, this function is called by the RT and DL
+ * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * but that really is a band-aid. Going forward it should be replaced with
+ * solutions targeted more specifically at RT and DL tasks.
+ */
+void cpufreq_trigger_update(u64 time)
+{
+ struct freq_update_hook *hook;
+
+#ifdef CONFIG_LOCKDEP
+ WARN_ON(debug_locks && !rcu_read_lock_sched_held());
+#endif
+
+ hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook));
+ /*
+ * If this isn't inside of an RCU-sched read-side critical section, hook
+ * may become NULL after the check below.
+ */
+ if (hook)
+ hook->func(hook, time);
+}
From: Rafael J. Wysocki <[email protected]>
In addition to fields representing governor tunables, struct dbs_data
contains some fields needed for the management of objects of that
type. As it turns out, that part of struct dbs_data may be shared
with (future) governors that won't use the common code used by
"ondemand" and "conservative", so move it to a separate struct type
and modify the code using struct dbs_data to follow.
Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
No changes from v2.
---
drivers/cpufreq/cpufreq_conservative.c | 25 +++++----
drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++-------------
drivers/cpufreq/cpufreq_governor.h | 35 +++++++-----
drivers/cpufreq/cpufreq_ondemand.c | 29 ++++++----
4 files changed, 107 insertions(+), 72 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -41,6 +41,13 @@
/* Ondemand Sampling types */
enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
+struct gov_attr_set {
+ struct kobject kobj;
+ struct list_head policy_list;
+ struct mutex update_lock;
+ int usage_count;
+};
+
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
@@ -52,7 +59,7 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
/* Governor demand based switching data (per-policy or global). */
struct dbs_data {
- int usage_count;
+ struct gov_attr_set attr_set;
void *tuners;
unsigned int min_sampling_rate;
unsigned int ignore_nice_load;
@@ -60,37 +67,35 @@ struct dbs_data {
unsigned int sampling_down_factor;
unsigned int up_threshold;
unsigned int io_is_busy;
-
- struct kobject kobj;
- struct list_head policy_dbs_list;
- /*
- * Protect concurrent updates to governor tunables from sysfs,
- * policy_dbs_list and usage_count.
- */
- struct mutex mutex;
};
+static inline struct dbs_data *to_dbs_data(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct dbs_data, attr_set);
+}
+
/* Governor's specific attributes */
-struct dbs_data;
struct governor_attr {
struct attribute attr;
- ssize_t (*show)(struct dbs_data *dbs_data, char *buf);
- ssize_t (*store)(struct dbs_data *dbs_data, const char *buf,
+ ssize_t (*show)(struct gov_attr_set *attr_set, char *buf);
+ ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf,
size_t count);
};
#define gov_show_one(_gov, file_name) \
static ssize_t show_##file_name \
-(struct dbs_data *dbs_data, char *buf) \
+(struct gov_attr_set *attr_set, char *buf) \
{ \
+ struct dbs_data *dbs_data = to_dbs_data(attr_set); \
struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \
return sprintf(buf, "%u\n", tuners->file_name); \
}
#define gov_show_one_common(file_name) \
static ssize_t show_##file_name \
-(struct dbs_data *dbs_data, char *buf) \
+(struct gov_attr_set *attr_set, char *buf) \
{ \
+ struct dbs_data *dbs_data = to_dbs_data(attr_set); \
return sprintf(buf, "%u\n", dbs_data->file_name); \
}
@@ -184,7 +189,7 @@ void od_register_powersave_bias_handler(
(struct cpufreq_policy *, unsigned int, unsigned int),
unsigned int powersave_bias);
void od_unregister_powersave_bias_handler(void);
-ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
+ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf,
size_t count);
void gov_update_cpu_data(struct dbs_data *dbs_data);
#endif /* _CPUFREQ_GOVERNOR_H */
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -43,9 +43,10 @@ static DEFINE_MUTEX(gov_dbs_data_mutex);
* This must be called with dbs_data->mutex held, otherwise traversing
* policy_dbs_list isn't safe.
*/
-ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
+ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf,
size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct policy_dbs_info *policy_dbs;
unsigned int rate;
int ret;
@@ -59,7 +60,7 @@ ssize_t store_sampling_rate(struct dbs_d
* We are operating under dbs_data->mutex and so the list and its
* entries can't be freed concurrently.
*/
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list) {
mutex_lock(&policy_dbs->timer_mutex);
/*
* On 32-bit architectures this may race with the
@@ -96,7 +97,7 @@ void gov_update_cpu_data(struct dbs_data
{
struct policy_dbs_info *policy_dbs;
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &dbs_data->attr_set.policy_list, list) {
unsigned int j;
for_each_cpu(j, policy_dbs->policy->cpus) {
@@ -111,9 +112,9 @@ void gov_update_cpu_data(struct dbs_data
}
EXPORT_SYMBOL_GPL(gov_update_cpu_data);
-static inline struct dbs_data *to_dbs_data(struct kobject *kobj)
+static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
{
- return container_of(kobj, struct dbs_data, kobj);
+ return container_of(kobj, struct gov_attr_set, kobj);
}
static inline struct governor_attr *to_gov_attr(struct attribute *attr)
@@ -124,25 +125,24 @@ static inline struct governor_attr *to_g
static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
char *buf)
{
- struct dbs_data *dbs_data = to_dbs_data(kobj);
struct governor_attr *gattr = to_gov_attr(attr);
- return gattr->show(dbs_data, buf);
+ return gattr->show(to_gov_attr_set(kobj), buf);
}
static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
const char *buf, size_t count)
{
- struct dbs_data *dbs_data = to_dbs_data(kobj);
+ struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
struct governor_attr *gattr = to_gov_attr(attr);
int ret = -EBUSY;
- mutex_lock(&dbs_data->mutex);
+ mutex_lock(&attr_set->update_lock);
- if (dbs_data->usage_count)
- ret = gattr->store(dbs_data, buf, count);
+ if (attr_set->usage_count)
+ ret = gattr->store(attr_set, buf, count);
- mutex_unlock(&dbs_data->mutex);
+ mutex_unlock(&attr_set->update_lock);
return ret;
}
@@ -424,6 +424,41 @@ static void free_policy_dbs_info(struct
gov->free(policy_dbs);
}
+static void gov_attr_set_init(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ INIT_LIST_HEAD(&attr_set->policy_list);
+ mutex_init(&attr_set->update_lock);
+ attr_set->usage_count = 1;
+ list_add(list_node, &attr_set->policy_list);
+}
+
+static void gov_attr_set_get(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ mutex_lock(&attr_set->update_lock);
+ attr_set->usage_count++;
+ list_add(list_node, &attr_set->policy_list);
+ mutex_unlock(&attr_set->update_lock);
+}
+
+static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ unsigned int count;
+
+ mutex_lock(&attr_set->update_lock);
+ list_del(list_node);
+ count = --attr_set->usage_count;
+ mutex_unlock(&attr_set->update_lock);
+ if (count)
+ return count;
+
+ kobject_put(&attr_set->kobj);
+ mutex_destroy(&attr_set->update_lock);
+ return 0;
+}
+
static int cpufreq_governor_init(struct cpufreq_policy *policy)
{
struct dbs_governor *gov = dbs_governor_of(policy);
@@ -452,10 +487,7 @@ static int cpufreq_governor_init(struct
policy_dbs->dbs_data = dbs_data;
policy->governor_data = policy_dbs;
- mutex_lock(&dbs_data->mutex);
- dbs_data->usage_count++;
- list_add(&policy_dbs->list, &dbs_data->policy_dbs_list);
- mutex_unlock(&dbs_data->mutex);
+ gov_attr_set_get(&dbs_data->attr_set, &policy_dbs->list);
goto out;
}
@@ -465,8 +497,7 @@ static int cpufreq_governor_init(struct
goto free_policy_dbs_info;
}
- INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
- mutex_init(&dbs_data->mutex);
+ gov_attr_set_init(&dbs_data->attr_set, &policy_dbs->list);
ret = gov->init(dbs_data, !policy->governor->initialized);
if (ret)
@@ -486,14 +517,11 @@ static int cpufreq_governor_init(struct
if (!have_governor_per_policy())
gov->gdbs_data = dbs_data;
- policy->governor_data = policy_dbs;
-
policy_dbs->dbs_data = dbs_data;
- dbs_data->usage_count = 1;
- list_add(&policy_dbs->list, &dbs_data->policy_dbs_list);
+ policy->governor_data = policy_dbs;
gov->kobj_type.sysfs_ops = &governor_sysfs_ops;
- ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type,
+ ret = kobject_init_and_add(&dbs_data->attr_set.kobj, &gov->kobj_type,
get_governor_parent_kobj(policy),
"%s", gov->gov.name);
if (!ret)
@@ -522,29 +550,21 @@ static int cpufreq_governor_exit(struct
struct dbs_governor *gov = dbs_governor_of(policy);
struct policy_dbs_info *policy_dbs = policy->governor_data;
struct dbs_data *dbs_data = policy_dbs->dbs_data;
- int count;
+ unsigned int count;
/* Protect gov->gdbs_data against concurrent updates. */
mutex_lock(&gov_dbs_data_mutex);
- mutex_lock(&dbs_data->mutex);
- list_del(&policy_dbs->list);
- count = --dbs_data->usage_count;
- mutex_unlock(&dbs_data->mutex);
+ count = gov_attr_set_put(&dbs_data->attr_set, &policy_dbs->list);
- if (!count) {
- kobject_put(&dbs_data->kobj);
-
- policy->governor_data = NULL;
+ policy->governor_data = NULL;
+ if (!count) {
if (!have_governor_per_policy())
gov->gdbs_data = NULL;
gov->exit(dbs_data, policy->governor->initialized == 1);
- mutex_destroy(&dbs_data->mutex);
kfree(dbs_data);
- } else {
- policy->governor_data = NULL;
}
free_policy_dbs_info(policy_dbs, gov);
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -207,9 +207,10 @@ static unsigned int od_dbs_timer(struct
/************************** sysfs interface ************************/
static struct dbs_governor od_dbs_gov;
-static ssize_t store_io_is_busy(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_io_is_busy(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
@@ -224,9 +225,10 @@ static ssize_t store_io_is_busy(struct d
return count;
}
-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_up_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -240,9 +242,10 @@ static ssize_t store_up_threshold(struct
return count;
}
-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct policy_dbs_info *policy_dbs;
unsigned int input;
int ret;
@@ -254,7 +257,7 @@ static ssize_t store_sampling_down_facto
dbs_data->sampling_down_factor = input;
/* Reset down sampling multiplier in case it was active */
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list) {
/*
* Doing this without locking might lead to using different
* rate_mult values in od_update() and od_dbs_timer().
@@ -267,9 +270,10 @@ static ssize_t store_sampling_down_facto
return count;
}
-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
@@ -291,9 +295,10 @@ static ssize_t store_ignore_nice_load(st
return count;
}
-static ssize_t store_powersave_bias(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_powersave_bias(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct od_dbs_tuners *od_tuners = dbs_data->tuners;
struct policy_dbs_info *policy_dbs;
unsigned int input;
@@ -308,7 +313,7 @@ static ssize_t store_powersave_bias(stru
od_tuners->powersave_bias = input;
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list)
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list)
ondemand_powersave_bias_init(policy_dbs->policy);
return count;
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -129,9 +129,10 @@ static struct notifier_block cs_cpufreq_
/************************** sysfs interface ************************/
static struct dbs_governor cs_dbs_gov;
-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -143,9 +144,10 @@ static ssize_t store_sampling_down_facto
return count;
}
-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_up_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
@@ -158,9 +160,10 @@ static ssize_t store_up_threshold(struct
return count;
}
-static ssize_t store_down_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_down_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
@@ -175,9 +178,10 @@ static ssize_t store_down_threshold(stru
return count;
}
-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
@@ -199,9 +203,10 @@ static ssize_t store_ignore_nice_load(st
return count;
}
-static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_freq_step(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
From: Rafael J. Wysocki <[email protected]>
Move abstract code related to struct gov_attr_set to a separate (new)
file so it can be shared with (future) goverernors that won't share
more code with "ondemand" and "conservative".
No intentional functional changes.
Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
No changes from v2.
---
drivers/cpufreq/Kconfig | 4 +
drivers/cpufreq/Makefile | 1
drivers/cpufreq/cpufreq_governor.c | 82 ---------------------------
drivers/cpufreq/cpufreq_governor.h | 6 ++
drivers/cpufreq/cpufreq_governor_attr_set.c | 84 ++++++++++++++++++++++++++++
5 files changed, 95 insertions(+), 82 deletions(-)
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -18,7 +18,11 @@ config CPU_FREQ
if CPU_FREQ
+config CPU_FREQ_GOV_ATTR_SET
+ bool
+
config CPU_FREQ_GOV_COMMON
+ select CPU_FREQ_GOV_ATTR_SET
select IRQ_WORK
bool
Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) +=
obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o
obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o
obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
+obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o
obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -112,53 +112,6 @@ void gov_update_cpu_data(struct dbs_data
}
EXPORT_SYMBOL_GPL(gov_update_cpu_data);
-static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
-{
- return container_of(kobj, struct gov_attr_set, kobj);
-}
-
-static inline struct governor_attr *to_gov_attr(struct attribute *attr)
-{
- return container_of(attr, struct governor_attr, attr);
-}
-
-static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
- char *buf)
-{
- struct governor_attr *gattr = to_gov_attr(attr);
-
- return gattr->show(to_gov_attr_set(kobj), buf);
-}
-
-static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
- const char *buf, size_t count)
-{
- struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
- struct governor_attr *gattr = to_gov_attr(attr);
- int ret = -EBUSY;
-
- mutex_lock(&attr_set->update_lock);
-
- if (attr_set->usage_count)
- ret = gattr->store(attr_set, buf, count);
-
- mutex_unlock(&attr_set->update_lock);
-
- return ret;
-}
-
-/*
- * Sysfs Ops for accessing governor attributes.
- *
- * All show/store invocations for governor specific sysfs attributes, will first
- * call the below show/store callbacks and the attribute specific callback will
- * be called from within it.
- */
-static const struct sysfs_ops governor_sysfs_ops = {
- .show = governor_show,
- .store = governor_store,
-};
-
unsigned int dbs_update(struct cpufreq_policy *policy)
{
struct policy_dbs_info *policy_dbs = policy->governor_data;
@@ -424,41 +377,6 @@ static void free_policy_dbs_info(struct
gov->free(policy_dbs);
}
-static void gov_attr_set_init(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- INIT_LIST_HEAD(&attr_set->policy_list);
- mutex_init(&attr_set->update_lock);
- attr_set->usage_count = 1;
- list_add(list_node, &attr_set->policy_list);
-}
-
-static void gov_attr_set_get(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- mutex_lock(&attr_set->update_lock);
- attr_set->usage_count++;
- list_add(list_node, &attr_set->policy_list);
- mutex_unlock(&attr_set->update_lock);
-}
-
-static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- unsigned int count;
-
- mutex_lock(&attr_set->update_lock);
- list_del(list_node);
- count = --attr_set->usage_count;
- mutex_unlock(&attr_set->update_lock);
- if (count)
- return count;
-
- kobject_put(&attr_set->kobj);
- mutex_destroy(&attr_set->update_lock);
- return 0;
-}
-
static int cpufreq_governor_init(struct cpufreq_policy *policy)
{
struct dbs_governor *gov = dbs_governor_of(policy);
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -48,6 +48,12 @@ struct gov_attr_set {
int usage_count;
};
+extern const struct sysfs_ops governor_sysfs_ops;
+
+void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node);
+void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node);
+unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node);
+
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
Index: linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c
@@ -0,0 +1,84 @@
+/*
+ * Abstract code for CPUFreq governor tunable sysfs attributes.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "cpufreq_governor.h"
+
+static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
+{
+ return container_of(kobj, struct gov_attr_set, kobj);
+}
+
+static inline struct governor_attr *to_gov_attr(struct attribute *attr)
+{
+ return container_of(attr, struct governor_attr, attr);
+}
+
+static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+{
+ struct governor_attr *gattr = to_gov_attr(attr);
+
+ return gattr->show(to_gov_attr_set(kobj), buf);
+}
+
+static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
+ const char *buf, size_t count)
+{
+ struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
+ struct governor_attr *gattr = to_gov_attr(attr);
+ int ret;
+
+ mutex_lock(&attr_set->update_lock);
+ ret = attr_set->usage_count ? gattr->store(attr_set, buf, count) : -EBUSY;
+ mutex_unlock(&attr_set->update_lock);
+ return ret;
+}
+
+const struct sysfs_ops governor_sysfs_ops = {
+ .show = governor_show,
+ .store = governor_store,
+};
+EXPORT_SYMBOL_GPL(governor_sysfs_ops);
+
+void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ INIT_LIST_HEAD(&attr_set->policy_list);
+ mutex_init(&attr_set->update_lock);
+ attr_set->usage_count = 1;
+ list_add(list_node, &attr_set->policy_list);
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_init);
+
+void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ mutex_lock(&attr_set->update_lock);
+ attr_set->usage_count++;
+ list_add(list_node, &attr_set->policy_list);
+ mutex_unlock(&attr_set->update_lock);
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_get);
+
+unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ unsigned int count;
+
+ mutex_lock(&attr_set->update_lock);
+ list_del(list_node);
+ count = --attr_set->usage_count;
+ mutex_unlock(&attr_set->update_lock);
+ if (count)
+ return count;
+
+ kobject_put(&attr_set->kobj);
+ mutex_destroy(&attr_set->update_lock);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_put);
From: Rafael J. Wysocki <[email protected]>
Subject: [PATCH] cpufreq: Support for fast frequency switching
Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
to support that method if available.
Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context by (future)
governors supporting that feature via (new) helper function
Add a new policy flag,
fast_switch_possible, to be set if fast frequency switching can
be used for the given policy and add a helper for setting that
flag.
Since fast frequency switching is inherently incompatible with
cpufreq transition notifiers, make it possible to set the
fast_switch_possible only if there are no transition notifiers
already registered and make the registration of new transition
notifiers fail if the fast_switch_possible flag is set for at
least one policy.
Implement the ->fast_switch callback in the ACPI cpufreq driver
and make it set fast_switch_possible during policy initialization
as appropriate.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Changes from v2:
- The driver ->fast_switch callback and cpufreq_driver_fast_switch()
don't need the relation argument as they will always do RELATION_L now.
- New mechanism to make fast switch and cpufreq notifiers mutually
exclusive.
- cpufreq_driver_fast_switch() doesn't do anything in addition to
invoking the driver callback and returns its return value.
---
drivers/cpufreq/acpi-cpufreq.c | 42 ++++++++++++++++++++
drivers/cpufreq/cpufreq.c | 85 ++++++++++++++++++++++++++++++++++++++---
include/linux/cpufreq.h | 6 ++
3 files changed, 127 insertions(+), 6 deletions(-)
Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp
return result;
}
+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ struct acpi_cpufreq_data *data = policy->driver_data;
+ struct acpi_processor_performance *perf;
+ struct cpufreq_frequency_table *entry;
+ unsigned int next_perf_state, next_freq, freq;
+
+ /*
+ * Find the closest frequency above target_freq.
+ *
+ * The table is sorted in the reverse order with respect to the
+ * frequency and all of the entries are valid (see the initialization).
+ */
+ entry = data->freq_table;
+ do {
+ entry++;
+ freq = entry->frequency;
+ } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
+ entry--;
+ next_freq = entry->frequency;
+ next_perf_state = entry->driver_data;
+
+ perf = to_perf_data(data);
+ if (perf->state == next_perf_state) {
+ if (unlikely(data->resume))
+ data->resume = 0;
+ else
+ return next_freq;
+ }
+
+ data->cpu_freq_write(&perf->control_register,
+ perf->states[next_perf_state].control);
+ perf->state = next_perf_state;
+ return next_freq;
+}
+
static unsigned long
acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
{
@@ -740,6 +777,10 @@ static int acpi_cpufreq_cpu_init(struct
goto err_unreg;
}
+ if (!acpi_pstate_strict && !(policy_is_shared(policy)
+ && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY))
+ cpufreq_enable_fast_switch(policy);
+
data->freq_table = kzalloc(sizeof(*data->freq_table) *
(perf->state_count+1), GFP_KERNEL);
if (!data->freq_table) {
@@ -874,6 +915,7 @@ static struct freq_attr *acpi_cpufreq_at
static struct cpufreq_driver acpi_cpufreq_driver = {
.verify = cpufreq_generic_frequency_table_verify,
.target_index = acpi_cpufreq_target,
+ .fast_switch = acpi_cpufreq_fast_switch,
.bios_limit = acpi_processor_get_bios_limit,
.init = acpi_cpufreq_cpu_init,
.exit = acpi_cpufreq_cpu_exit,
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -81,6 +81,7 @@ struct cpufreq_policy {
struct cpufreq_governor *governor; /* see below */
void *governor_data;
char last_governor[CPUFREQ_NAME_LEN]; /* last governor used */
+ bool fast_switch_possible;
struct work_struct update; /* if update_policy() needs to be
* called, but you're in IRQ context */
@@ -156,6 +157,7 @@ int cpufreq_get_policy(struct cpufreq_po
int cpufreq_update_policy(unsigned int cpu);
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
#else
static inline unsigned int cpufreq_get(unsigned int cpu)
{
@@ -236,6 +238,8 @@ struct cpufreq_driver {
unsigned int relation); /* Deprecated */
int (*target_index)(struct cpufreq_policy *policy,
unsigned int index);
+ unsigned int (*fast_switch)(struct cpufreq_policy *policy,
+ unsigned int target_freq);
/*
* Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
* unset.
@@ -450,6 +454,8 @@ struct cpufreq_governor {
};
/* Pass a target to the cpufreq driver */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq);
int cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -428,6 +428,26 @@ void cpufreq_freq_transition_end(struct
}
EXPORT_SYMBOL_GPL(cpufreq_freq_transition_end);
+/*
+ * Fast frequency switching status count. Positive means "enabled", negative
+ * means "disabled" and 0 means "don't care".
+ */
+static int cpufreq_fast_switch_count;
+static DEFINE_MUTEX(cpufreq_fast_switch_lock);
+
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
+{
+ mutex_lock(&cpufreq_fast_switch_lock);
+ if (cpufreq_fast_switch_count >= 0) {
+ cpufreq_fast_switch_count++;
+ policy->fast_switch_possible = true;
+ } else {
+ pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n",
+ policy->cpu);
+ }
+ mutex_unlock(&cpufreq_fast_switch_lock);
+}
+EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch);
/*********************************************************************
* SYSFS INTERFACE *
@@ -1074,6 +1094,23 @@ static void cpufreq_policy_free(struct c
kfree(policy);
}
+static void cpufreq_driver_exit_policy(struct cpufreq_policy *policy)
+{
+ if (policy->fast_switch_possible) {
+ mutex_lock(&cpufreq_fast_switch_lock);
+
+ if (!WARN_ON(cpufreq_fast_switch_count <= 0))
+ cpufreq_fast_switch_count--;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
+ }
+
+ if (cpufreq_driver->exit) {
+ cpufreq_driver->exit(policy);
+ policy->freq_table = NULL;
+ }
+}
+
static int cpufreq_online(unsigned int cpu)
{
struct cpufreq_policy *policy;
@@ -1237,8 +1274,7 @@ static int cpufreq_online(unsigned int c
out_exit_policy:
up_write(&policy->rwsem);
- if (cpufreq_driver->exit)
- cpufreq_driver->exit(policy);
+ cpufreq_driver_exit_policy(policy);
out_free_policy:
cpufreq_policy_free(policy, !new_policy);
return ret;
@@ -1335,10 +1371,7 @@ static void cpufreq_offline(unsigned int
* since this is a core component, and is essential for the
* subsequent light-weight ->init() to succeed.
*/
- if (cpufreq_driver->exit) {
- cpufreq_driver->exit(policy);
- policy->freq_table = NULL;
- }
+ cpufreq_driver_exit_policy(policy);
unlock:
up_write(&policy->rwsem);
@@ -1665,8 +1698,18 @@ int cpufreq_register_notifier(struct not
switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
+ if (cpufreq_fast_switch_count > 0) {
+ mutex_unlock(&cpufreq_fast_switch_lock);
+ return -EPERM;
+ }
ret = srcu_notifier_chain_register(
&cpufreq_transition_notifier_list, nb);
+ if (!ret)
+ cpufreq_fast_switch_count--;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_register(
@@ -1699,8 +1742,14 @@ int cpufreq_unregister_notifier(struct n
switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
ret = srcu_notifier_chain_unregister(
&cpufreq_transition_notifier_list, nb);
+ if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0))
+ cpufreq_fast_switch_count++;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_unregister(
@@ -1719,6 +1768,30 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
* GOVERNORS *
*********************************************************************/
+/**
+ * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
+ * @policy: cpufreq policy to switch the frequency for.
+ * @target_freq: New frequency to set (may be approximate).
+ *
+ * Carry out a fast frequency switch from interrupt context.
+ *
+ * This function must not be called if policy->fast_switch_possible is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same policy and that it will never be called in
+ * parallel with either ->target() or ->target_index() for the same policy.
+ *
+ * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
+ * callback to indicate an error condition, the hardware configuration must be
+ * preserved.
+ */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ return cpufreq_driver->fast_switch(policy, target_freq);
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
+
/* Must set freqs->new to intermediate frequency */
static int __target_intermediate(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, int index)
On Mon, Mar 07, 2016 at 03:41:15AM +0100, Rafael J. Wysocki wrote:
> If my understanding of the requency invariant utilization idea is correct,
> it is about re-scaling utilization so it is always relative to the capacity
> at the max frequency.
Right. So if a workload runs for 5ms at @1GHz and 10ms @500MHz, it would
still result in the exact same utilization.
> If that's the case, then instead of using
> x = util_raw / max
> we will use something like
> y = (util_raw / max) * (f / max_freq) (f - current frequency).
I don't get the last term. Assuming fixed frequency hardware (we can't
really assume anything else) I get to:
util = util_raw * (current_freq / max_freq) (1)
x = util / max (2)
> so there's no hope that the same formula will ever work for both "raw"
> and "frequency invariant" utilization.
Here I agree, however the above (current_freq / max_freq) term is easily
computable, and really the only thing we can assume if the arch doesn't
implement freq invariant accounting.
> (c) Code for using either "raw" or "frequency invariant" depending on
> a callback flag or something like that.
Seeing how frequency invariance is an arch feature, and cpufreq drivers
are also typically arch specific, do we really need a flag at this
level?
In any case, I think the only difference between the two formula should
be the addition of (1) for the platforms that do not already implement
frequency invariance.
That is actually correct for platforms which do as told with their DVFS
bits. And there's really not much else we can do short of implementing
the scheduler arch hook to do better.
> (b) Make all architecuters use "frequency invariant" and then look for a
> working formula (seems rather less than realistic to me to be honest).
There was a proposal to implement arch_scale_freq_capacity() as a weak
function and have it serve the cpufreq selected frequency for (1) so
that everything would default to that.
We didn't do that because that makes the function call and
multiplications unconditional. It's cheaper to add (1) to the cpufreq
side when selecting a freq rather than at every single time we update
the util statistics.
On Thu, Mar 03, 2016 at 07:26:24PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 03, 2016 at 05:28:55PM +0000, Dietmar Eggemann wrote:
> > Wasn't there the problem that this ratio goes to zero if the cpu is idle
> > in the old power estimation approach on x86?
>
> Yeah, there was something funky.
So it might have been that when we're nearly idle the hardware runs at
low frequency, which under that old code would have resulted in lowering
the capacity of that cpu. Which in turn would have resulted in the
scheduler moving the little work it had away and it being even more
idle.
On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Mar 07, 2016 at 03:41:15AM +0100, Rafael J. Wysocki wrote:
>
>> If my understanding of the requency invariant utilization idea is correct,
>> it is about re-scaling utilization so it is always relative to the capacity
>> at the max frequency.
>
> Right. So if a workload runs for 5ms at @1GHz and 10ms @500MHz, it would
> still result in the exact same utilization.
>
>> If that's the case, then instead of using
>> x = util_raw / max
>> we will use something like
>> y = (util_raw / max) * (f / max_freq) (f - current frequency).
>
> I don't get the last term.
The "(f - current frequency)" thing? It doesn't belong to the
formula, sorry for the confusion.
So it is almost the same as your (1) below (except for the max in the
denominator), so my y is your x. :-)
> Assuming fixed frequency hardware (we can't
> really assume anything else) I get to:
>
> util = util_raw * (current_freq / max_freq) (1)
> x = util / max (2)
>
>> so there's no hope that the same formula will ever work for both "raw"
>> and "frequency invariant" utilization.
>
> Here I agree, however the above (current_freq / max_freq) term is easily
> computable, and really the only thing we can assume if the arch doesn't
> implement freq invariant accounting.
Right.
>> (c) Code for using either "raw" or "frequency invariant" depending on
>> a callback flag or something like that.
>
> Seeing how frequency invariance is an arch feature, and cpufreq drivers
> are also typically arch specific, do we really need a flag at this
> level?
The next frequency is selected by the governor and that's why. The
driver gets a frequency to set only.
Now, the governor needs to work with different platforms, so it needs
to know how to deal with the given one.
> In any case, I think the only difference between the two formula should
> be the addition of (1) for the platforms that do not already implement
> frequency invariance.
OK
So I'm reading this as a statement that linear is a better
approximation for frequency invariant utilization.
This means that on platforms where the utilization is frequency
invariant we should use
next_freq = a * x
(where x is given by (2) above) and for platforms where the
utilization is not frequency invariant
next_freq = a * x * current_freq / max_freq
and all boils down to finding a.
Now, it seems reasonable for a to be something like (1 + 1/n) *
max_freq, so for non-frequency invariant we get
nex_freq = (1 + 1/n) * current_freq * x
> That is actually correct for platforms which do as told with their DVFS
> bits. And there's really not much else we can do short of implementing
> the scheduler arch hook to do better.
>
>> (b) Make all architecuters use "frequency invariant" and then look for a
>> working formula (seems rather less than realistic to me to be honest).
>
> There was a proposal to implement arch_scale_freq_capacity() as a weak
> function and have it serve the cpufreq selected frequency for (1) so
> that everything would default to that.
>
> We didn't do that because that makes the function call and
> multiplications unconditional. It's cheaper to add (1) to the cpufreq
> side when selecting a freq rather than at every single time we update
> the util statistics.
That's fine by me.
My point was that we need different formulas for frequency invariant
and the other basically.
On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <[email protected]> wrote:
> > Seeing how frequency invariance is an arch feature, and cpufreq drivers
> > are also typically arch specific, do we really need a flag at this
> > level?
>
> The next frequency is selected by the governor and that's why. The
> driver gets a frequency to set only.
>
> Now, the governor needs to work with different platforms, so it needs
> to know how to deal with the given one.
Ah, indeed. In any case, the availability of arch_sched_scale_freq() is
a compile time thingy, so we can, at compile time, know what to use.
> > In any case, I think the only difference between the two formula should
> > be the addition of (1) for the platforms that do not already implement
> > frequency invariance.
>
> OK
>
> So I'm reading this as a statement that linear is a better
> approximation for frequency invariant utilization.
Well, (1) is what the scheduler does with frequency invariance, except
that allows a more flexible definition of 'current frequency' by asking
for it every time we update the util stats.
But if a platform doesn't need this, ie. it has a fixed frequency, or
simply doesn't provide anything like this, assuming we run at the
frequency we asked for is a reasonable assumption no?
> This means that on platforms where the utilization is frequency
> invariant we should use
>
> next_freq = a * x
>
> (where x is given by (2) above) and for platforms where the
> utilization is not frequency invariant
>
> next_freq = a * x * current_freq / max_freq
>
> and all boils down to finding a.
Right.
> Now, it seems reasonable for a to be something like (1 + 1/n) *
> max_freq, so for non-frequency invariant we get
>
> nex_freq = (1 + 1/n) * current_freq * x
This seems like a big leap; where does:
(1 + 1/n) * max_freq
come from? And what is 'n'?
On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
>> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <[email protected]> wrote:
>
>> > Seeing how frequency invariance is an arch feature, and cpufreq drivers
>> > are also typically arch specific, do we really need a flag at this
>> > level?
>>
>> The next frequency is selected by the governor and that's why. The
>> driver gets a frequency to set only.
>>
>> Now, the governor needs to work with different platforms, so it needs
>> to know how to deal with the given one.
>
> Ah, indeed. In any case, the availability of arch_sched_scale_freq() is
> a compile time thingy, so we can, at compile time, know what to use.
>
>> > In any case, I think the only difference between the two formula should
>> > be the addition of (1) for the platforms that do not already implement
>> > frequency invariance.
>>
>> OK
>>
>> So I'm reading this as a statement that linear is a better
>> approximation for frequency invariant utilization.
>
> Well, (1) is what the scheduler does with frequency invariance, except
> that allows a more flexible definition of 'current frequency' by asking
> for it every time we update the util stats.
>
> But if a platform doesn't need this, ie. it has a fixed frequency, or
> simply doesn't provide anything like this, assuming we run at the
> frequency we asked for is a reasonable assumption no?
>
>> This means that on platforms where the utilization is frequency
>> invariant we should use
>>
>> next_freq = a * x
>>
>> (where x is given by (2) above) and for platforms where the
>> utilization is not frequency invariant
>>
>> next_freq = a * x * current_freq / max_freq
>>
>> and all boils down to finding a.
>
> Right.
However, that doesn't seem to be in agreement with the Steve's results
posted earlier in this thread.
Also theoretically, with frequency invariant, the only way you can get
to 100% utilization is by running at the max frequency, so the closer
to 100% you get, the faster you need to run to get any further. That
indicates nonlinear to me.
>> Now, it seems reasonable for a to be something like (1 + 1/n) *
>> max_freq, so for non-frequency invariant we get
>>
>> nex_freq = (1 + 1/n) * current_freq * x
>
> This seems like a big leap; where does:
>
> (1 + 1/n) * max_freq
>
> come from? And what is 'n'?
a = max_freq gives next_freq = max_freq for x = 1, but with that
choice of a you may never get to x = 1 with frequency invariant
because of the feedback effect mentioned above, so the 1/n produces
the extra boost needed for that (n is a positive integer).
Quite frankly, to me it looks like linear really is a better
approximation for "raw" utilization. That is, for frequency invariant
x we should take:
next_freq = a * x * max_freq / current_freq
(and if x is not frequency invariant, the right-hand side becomes a *
x). Then, the extra boost needed to get to x = 1 for frequency
invariant is produced by the (max_freq / current_freq) factor that is
greater than 1 as long as we are not running at max_freq and a can be
chosen as max_freq.
Hi,
sorry if I didn't reply yet. Trying to cope with jetlag and
talks/meetings these days :-). Let me see if I'm getting what you are
discussing, though.
On 08/03/16 21:05, Rafael J. Wysocki wrote:
> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <[email protected]> wrote:
[...]
> a = max_freq gives next_freq = max_freq for x = 1, but with that
> choice of a you may never get to x = 1 with frequency invariant
> because of the feedback effect mentioned above, so the 1/n produces
> the extra boost needed for that (n is a positive integer).
>
> Quite frankly, to me it looks like linear really is a better
> approximation for "raw" utilization. That is, for frequency invariant
> x we should take:
>
> next_freq = a * x * max_freq / current_freq
>
> (and if x is not frequency invariant, the right-hand side becomes a *
> x). Then, the extra boost needed to get to x = 1 for frequency
> invariant is produced by the (max_freq / current_freq) factor that is
> greater than 1 as long as we are not running at max_freq and a can be
> chosen as max_freq.
>
Expanding terms again, your original formula (without the 1.1 factor of
the last version) was:
next_freq = util / max_cap * max_freq
and this doesn't work when we have freq invariance since util won't go
over curr_cap.
What you propose above is to add another factor, so that we have:
next_freq = util / max_cap * max_freq / curr_freq * max_freq
which should give us the opportunity to reach max_freq also with freq
invariance.
This should actually be the same of doing:
next_freq = util / max_cap * max_cap / curr_cap * max_freq
We are basically scaling how much the cpu is busy at curr_cap back to
the 0..1024 scale. And we use this to select next_freq. Also, we can
simplify this to:
next_freq = util / curr_cap * max_freq
and we save some ops.
However, if that is correct, I think we might have a problem, as we are
skewing OPP selection towards higher frequencies. Let's suppose we have
a platform with 3 OPPs:
freq cap
1200 1024
900 768
600 512
As soon a task reaches an utilization of 257 we will be selecting the
second OPP as
next_freq = 257 / 512 * 1200 ~ 602
While the cpu is only 50% busy in this case. And we will go at max OPP
when reaching ~492 (~64% of 768).
That said, I guess this might work as a first solution, but we will
probably need something better in the future. I understand Rafael's
concerns regardin margins, but it seems to me that some kind of
additional parameter will be probably needed anyway to fix this.
Just to say again how we handle this in schedfreq, with a -20% margin
applied to the lowest OPP we will get to the next one when utilization
reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones,
which is less aggressive and might be better IMHO.
Best,
- Juri
On Fri, Mar 04, 2016 at 03:58:22AM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Use the observation that cpufreq_update_util() is only called
> by the scheduler with rq->lock held, so the callers of
> cpufreq_set_update_util_data() can use synchronize_sched()
> instead of synchronize_rcu() to wait for cpufreq_update_util()
> to complete. Moreover, if they are updated to do that,
> rcu_read_(un)lock() calls in cpufreq_update_util() might be
> replaced with rcu_read_(un)lock_sched(), respectively, but
> those aren't really necessary, because the scheduler calls
> that function from RCU-sched read-side critical sections
> already.
>
> In addition to that, if cpufreq_set_update_util_data() checks
> the func field in the struct update_util_data before setting
> the per-CPU pointer to it, the data->func check may be dropped
> from cpufreq_update_util() as well.
>
> Make the above changes to reduce the overhead from
> cpufreq_update_util() in the scheduler paths invoking it
> and to make the cleanup after removing its callbacks less
> heavy-weight somewhat.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> Acked-by: Viresh Kumar <[email protected]>
> ---
>
> Changes from the previous version:
> - Use rcu_dereference_sched() in cpufreq_update_util().
Which I think also shows the WARN_ON I insisted upon is redundant.
In any case, I cannot object to reducing overhead, esp. as this whole
patch was suggested by me in the first place, so:
Acked-by: Peter Zijlstra (Intel) <[email protected]>
That said, how about the below? It avoids a function call.
Ideally the whole thing would be a single direct function call, but
because of the current situation with multiple governors we're stuck
with the indirect call :/
---
drivers/cpufreq/cpufreq.c | 30 +-----------------------------
include/linux/cpufreq.h | 33 +++++++++++++++++++++++++++------
2 files changed, 28 insertions(+), 35 deletions(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b6dd41824368..d594bf18cb02 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -65,7 +65,7 @@ static struct cpufreq_driver *cpufreq_driver;
static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
static DEFINE_RWLOCK(cpufreq_driver_lock);
-static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
/**
* cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
@@ -90,34 +90,6 @@ void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
}
EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
-/**
- * cpufreq_update_util - Take a note about CPU utilization changes.
- * @time: Current time.
- * @util: Current utilization.
- * @max: Utilization ceiling.
- *
- * This function is called by the scheduler on every invocation of
- * update_load_avg() on the CPU whose utilization is being updated.
- *
- * It can only be called from RCU-sched read-side critical sections.
- */
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
-{
- struct update_util_data *data;
-
-#ifdef CONFIG_LOCKDEP
- WARN_ON(debug_locks && !rcu_read_lock_sched_held());
-#endif
-
- data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
- /*
- * If this isn't inside of an RCU-sched read-side critical section, data
- * may become NULL after the check below.
- */
- if (data)
- data->func(data, time, util, max);
-}
-
/* Flag to suspend/resume CPUFreq governors */
static bool cpufreq_suspended;
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 277024ff2289..62d2a1d623e9 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -146,7 +146,33 @@ static inline bool policy_is_shared(struct cpufreq_policy *policy)
extern struct kobject *cpufreq_global_kobject;
#ifdef CONFIG_CPU_FREQ
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+
+struct update_util_data {
+ void (*func)(struct update_util_data *data,
+ u64 time, unsigned long util, unsigned long max);
+};
+
+DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @time: Current time.
+ * @util: Current utilization.
+ * @max: Utilization ceiling.
+ *
+ * This function is called by the scheduler on every invocation of
+ * update_load_avg() on the CPU whose utilization is being updated.
+ *
+ * It can only be called from RCU-sched read-side critical sections.
+ */
+static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+ struct update_util_data *data;
+
+ data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
+ if (data)
+ data->func(data, time, util, max);
+}
/**
* cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
@@ -169,11 +195,6 @@ static inline void cpufreq_trigger_update(u64 time)
cpufreq_update_util(time, ULONG_MAX, 0);
}
-struct update_util_data {
- void (*func)(struct update_util_data *data,
- u64 time, unsigned long util, unsigned long max);
-};
-
void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
unsigned int cpufreq_get(unsigned int cpu);
On Tue, Mar 08, 2016 at 03:25:16AM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Commit fe7034338ba0 (cpufreq: Add mechanism for registering
> utilization update callbacks) added cpufreq_update_util() to be
> called by the scheduler (from the CFS part) on utilization updates.
> The goal was to allow CFS to pass utilization information to cpufreq
> and to trigger it to evaluate the frequency/voltage configuration
> (P-state) of every CPU on a regular basis.
>
> However, the last two arguments of that function are never used by
> the current code, so CFS might simply call cpufreq_trigger_update()
> instead of it (like the RT and DL sched classes).
>
> For this reason, drop the last two arguments of cpufreq_update_util(),
> rename it to cpufreq_trigger_update() and modify CFS to call it.
>
> Moreover, since the utilization is not involved in that now, rename
> data types, functions and variables related to cpufreq_trigger_update()
> to reflect that (eg. struct update_util_data becomes struct
> freq_update_hook and so on).
> -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
> +void cpufreq_trigger_update(u64 time)
So I'm not convinced about this. Yes the utility of this function is
twofold. One to allow in-situ frequency adjustments where possible, but
two, also very much to allow using the statistics already gathered.
Sure, 4.5 will not have any such users, but who cares.
And I'm really not too worried about 'random' people suddenly using it
to base work on. Either people are already participating in these
discussions and will thus be aware of whatever concerns there might be,
or we'll tell them when they post their code.
And when they don't participate and don't post their code, I really
don't care about them anyway :-)
On Wed, Mar 9, 2016 at 2:41 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Mar 08, 2016 at 03:25:16AM +0100, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>>
>> Commit fe7034338ba0 (cpufreq: Add mechanism for registering
>> utilization update callbacks) added cpufreq_update_util() to be
>> called by the scheduler (from the CFS part) on utilization updates.
>> The goal was to allow CFS to pass utilization information to cpufreq
>> and to trigger it to evaluate the frequency/voltage configuration
>> (P-state) of every CPU on a regular basis.
>>
>> However, the last two arguments of that function are never used by
>> the current code, so CFS might simply call cpufreq_trigger_update()
>> instead of it (like the RT and DL sched classes).
>>
>> For this reason, drop the last two arguments of cpufreq_update_util(),
>> rename it to cpufreq_trigger_update() and modify CFS to call it.
>>
>> Moreover, since the utilization is not involved in that now, rename
>> data types, functions and variables related to cpufreq_trigger_update()
>> to reflect that (eg. struct update_util_data becomes struct
>> freq_update_hook and so on).
>
>> -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
>> +void cpufreq_trigger_update(u64 time)
>
> So I'm not convinced about this. Yes the utility of this function is
> twofold. One to allow in-situ frequency adjustments where possible, but
> two, also very much to allow using the statistics already gathered.
>
> Sure, 4.5 will not have any such users, but who cares.
>
> And I'm really not too worried about 'random' people suddenly using it
> to base work on. Either people are already participating in these
> discussions and will thus be aware of whatever concerns there might be,
> or we'll tell them when they post their code.
>
> And when they don't participate and don't post their code, I really
> don't care about them anyway :-)
OK
On Wed, Mar 9, 2016 at 1:39 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 04, 2016 at 03:58:22AM +0100, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>>
>> Use the observation that cpufreq_update_util() is only called
>> by the scheduler with rq->lock held, so the callers of
>> cpufreq_set_update_util_data() can use synchronize_sched()
>> instead of synchronize_rcu() to wait for cpufreq_update_util()
>> to complete. Moreover, if they are updated to do that,
>> rcu_read_(un)lock() calls in cpufreq_update_util() might be
>> replaced with rcu_read_(un)lock_sched(), respectively, but
>> those aren't really necessary, because the scheduler calls
>> that function from RCU-sched read-side critical sections
>> already.
>>
>> In addition to that, if cpufreq_set_update_util_data() checks
>> the func field in the struct update_util_data before setting
>> the per-CPU pointer to it, the data->func check may be dropped
>> from cpufreq_update_util() as well.
>>
>> Make the above changes to reduce the overhead from
>> cpufreq_update_util() in the scheduler paths invoking it
>> and to make the cleanup after removing its callbacks less
>> heavy-weight somewhat.
>>
>> Signed-off-by: Rafael J. Wysocki <[email protected]>
>> Acked-by: Viresh Kumar <[email protected]>
>> ---
>>
>> Changes from the previous version:
>> - Use rcu_dereference_sched() in cpufreq_update_util().
>
> Which I think also shows the WARN_ON I insisted upon is redundant.
>
> In any case, I cannot object to reducing overhead, esp. as this whole
> patch was suggested by me in the first place, so:
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
Thanks!
> That said, how about the below? It avoids a function call.
That is fine by me.
What about taking it a bit further, though, and moving the definition
of cpufreq_update_util_data to somewhere under kernel/sched/ (like
kernel/sched/cpufreq.c maybe)?
Then, the whole static inline void cpufreq_update_util() definition
can go into kernel/sched/sched.h (it doesn't have to be visible
anywhere beyond kernel/sched/) and the only thing that needs to be
exported to cpufreq will be a helper (or two), to set/clear the
cpufreq_update_util_data pointers.
I'll try to cut a patch doing that later today for illustration.
On Wed, Mar 09, 2016 at 03:17:48PM +0100, Rafael J. Wysocki wrote:
> > That said, how about the below? It avoids a function call.
>
> That is fine by me.
>
> What about taking it a bit further, though, and moving the definition
> of cpufreq_update_util_data to somewhere under kernel/sched/ (like
> kernel/sched/cpufreq.c maybe)?
>
> Then, the whole static inline void cpufreq_update_util() definition
> can go into kernel/sched/sched.h (it doesn't have to be visible
> anywhere beyond kernel/sched/) and the only thing that needs to be
> exported to cpufreq will be a helper (or two), to set/clear the
> cpufreq_update_util_data pointers.
>
> I'll try to cut a patch doing that later today for illustration.
Right, that's a blend with your second patch. Sure.
On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote:
> >> This means that on platforms where the utilization is frequency
> >> invariant we should use
> >>
> >> next_freq = a * x
> >>
> >> (where x is given by (2) above) and for platforms where the
> >> utilization is not frequency invariant
> >>
> >> next_freq = a * x * current_freq / max_freq
> >>
> >> and all boils down to finding a.
> >
> > Right.
>
> However, that doesn't seem to be in agreement with the Steve's results
> posted earlier in this thread.
I could not make anything of those numbers.
> Also theoretically, with frequency invariant, the only way you can get
> to 100% utilization is by running at the max frequency, so the closer
> to 100% you get, the faster you need to run to get any further. That
> indicates nonlinear to me.
I'm not seeing that, you get that by using a > 1. No need for
non-linear.
> >> Now, it seems reasonable for a to be something like (1 + 1/n) *
> >> max_freq, so for non-frequency invariant we get
> >>
> >> nex_freq = (1 + 1/n) * current_freq * x
> >
> > This seems like a big leap; where does:
> >
> > (1 + 1/n) * max_freq
> >
> > come from? And what is 'n'?
> a = max_freq gives next_freq = max_freq for x = 1,
next_freq = a * x * current_freq / max_freq
[ a := max_freq, x := 1 ] ->
= max_freq * 1 * current_freq / max_freq
= current_freq
!= max_freq
But I think I see what you're saying; because at x = 1,
current_frequency must be max_frequency. Per your earlier point.
> but with that choice of a you may never get to x = 1 with frequency
> invariant because of the feedback effect mentioned above, so the 1/n
> produces the extra boost needed for that (n is a positive integer).
OK, so that gets us:
a = (1 + 1/n) ; n > 0
[ I would not have chosen (1 + 1/n), but lets stick to that ]
So for n = 4 that gets you: a = 1.25, which effectively gets you an 80%
utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the
next frequency (assuming RELATION_L like selection).
Together this gets you:
next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq
= (1 + 1/n) * x * current_freq
Again, with n = 4, x > .8 will result in a next_freq > current_freq, and
hence (RELATION_L) pick a higher one.
> Quite frankly, to me it looks like linear really is a better
> approximation for "raw" utilization. That is, for frequency invariant
> x we should take:
>
> next_freq = a * x * max_freq / current_freq
(its very confusing how you use 'x' for both invariant and
non-invariant).
That doesn't make sense, remember:
util = \Sum_i u_i * freq_i / max_freq (1)
Which for systems where freq_i is constant reduces to:
util = util_raw * current_freq / max_freq (2)
But you cannot reverse this. IOW you cannot try and divide out
current_freq on a frequency invariant metric.
So going by:
next_freq = (1 + 1/n) * max_freq * util (3)
if we substitute (2) into (3) we get:
= (1 + 1/n) * max_freq * util_raw * current_freq / max_freq
= (1 + 1/n) * current_freq * util_raw (4)
Which gets you two formula with the same general behaviour. As (2) is
the only approximation of (1) we can make.
On Wednesday, March 09, 2016 04:29:34 PM Peter Zijlstra wrote:
> On Wed, Mar 09, 2016 at 03:17:48PM +0100, Rafael J. Wysocki wrote:
> > > That said, how about the below? It avoids a function call.
> >
> > That is fine by me.
> >
> > What about taking it a bit further, though, and moving the definition
> > of cpufreq_update_util_data to somewhere under kernel/sched/ (like
> > kernel/sched/cpufreq.c maybe)?
> >
> > Then, the whole static inline void cpufreq_update_util() definition
> > can go into kernel/sched/sched.h (it doesn't have to be visible
> > anywhere beyond kernel/sched/) and the only thing that needs to be
> > exported to cpufreq will be a helper (or two), to set/clear the
> > cpufreq_update_util_data pointers.
> >
> > I'll try to cut a patch doing that later today for illustration.
>
> Right, that's a blend with your second patch. Sure.
OK, patch below.
---
From: Rafael J. Wysocki <[email protected]>
Subject: [PATCH] cpufreq: Move scheduler-related code to the sched directory
Create cpufreq.c under kernel/sched/ and move the cpufreq code
related to the scheduler to that file and to sched.h.
Redefine cpufreq_update_util() as a static inline function to avoid
function calls at its call sites in the scheduler code (as suggested
by Peter Zijlstra).
Also move the definition of struct update_util_data and declaration
of cpufreq_set_update_util_data() from include/linux/cpufreq.h to
include/linux/sched.h.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpufreq/cpufreq.c | 53 -------------------------------------
drivers/cpufreq/cpufreq_governor.c | 1
include/linux/cpufreq.h | 34 -----------------------
include/linux/sched.h | 9 ++++++
kernel/sched/Makefile | 1
kernel/sched/cpufreq.c | 37 +++++++++++++++++++++++++
kernel/sched/sched.h | 49 +++++++++++++++++++++++++++++++++-
7 files changed, 96 insertions(+), 88 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -65,59 +65,6 @@ static struct cpufreq_driver *cpufreq_dr
static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
static DEFINE_RWLOCK(cpufreq_driver_lock);
-static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
-
-/**
- * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
- * @cpu: The CPU to set the pointer for.
- * @data: New pointer value.
- *
- * Set and publish the update_util_data pointer for the given CPU. That pointer
- * points to a struct update_util_data object containing a callback function
- * to call from cpufreq_update_util(). That function will be called from an RCU
- * read-side critical section, so it must not sleep.
- *
- * Callers must use RCU-sched callbacks to free any memory that might be
- * accessed via the old update_util_data pointer or invoke synchronize_sched()
- * right after this function to avoid use-after-free.
- */
-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
-{
- if (WARN_ON(data && !data->func))
- return;
-
- rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
-}
-EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
-
-/**
- * cpufreq_update_util - Take a note about CPU utilization changes.
- * @time: Current time.
- * @util: Current utilization.
- * @max: Utilization ceiling.
- *
- * This function is called by the scheduler on every invocation of
- * update_load_avg() on the CPU whose utilization is being updated.
- *
- * It can only be called from RCU-sched read-side critical sections.
- */
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
-{
- struct update_util_data *data;
-
-#ifdef CONFIG_LOCKDEP
- WARN_ON(debug_locks && !rcu_read_lock_sched_held());
-#endif
-
- data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
- /*
- * If this isn't inside of an RCU-sched read-side critical section, data
- * may become NULL after the check below.
- */
- if (data)
- data->func(data, time, util, max);
-}
-
/* Flag to suspend/resume CPUFreq governors */
static bool cpufreq_suspended;
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -146,36 +146,6 @@ static inline bool policy_is_shared(stru
extern struct kobject *cpufreq_global_kobject;
#ifdef CONFIG_CPU_FREQ
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
-
-/**
- * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
- * @time: Current time.
- *
- * The way cpufreq is currently arranged requires it to evaluate the CPU
- * performance state (frequency/voltage) on a regular basis to prevent it from
- * being stuck in a completely inadequate performance level for too long.
- * That is not guaranteed to happen if the updates are only triggered from CFS,
- * though, because they may not be coming in if RT or deadline tasks are active
- * all the time (or there are RT and DL tasks only).
- *
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
- * but that really is a band-aid. Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
- */
-static inline void cpufreq_trigger_update(u64 time)
-{
- cpufreq_update_util(time, ULONG_MAX, 0);
-}
-
-struct update_util_data {
- void (*func)(struct update_util_data *data,
- u64 time, unsigned long util, unsigned long max);
-};
-
-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
-
unsigned int cpufreq_get(unsigned int cpu);
unsigned int cpufreq_quick_get(unsigned int cpu);
unsigned int cpufreq_quick_get_max(unsigned int cpu);
@@ -187,10 +157,6 @@ int cpufreq_update_policy(unsigned int c
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
#else
-static inline void cpufreq_update_util(u64 time, unsigned long util,
- unsigned long max) {}
-static inline void cpufreq_trigger_update(u64 time) {}
-
static inline unsigned int cpufreq_get(unsigned int cpu)
{
return 0;
Index: linux-pm/kernel/sched/cpufreq.c
===================================================================
--- /dev/null
+++ linux-pm/kernel/sched/cpufreq.c
@@ -0,0 +1,37 @@
+/*
+ * Scheduler code and data structures related to cpufreq.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "sched.h"
+
+DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @data: New pointer value.
+ *
+ * Set and publish the update_util_data pointer for the given CPU. That pointer
+ * points to a struct update_util_data object containing a callback function
+ * to call from cpufreq_update_util(). That function will be called from an RCU
+ * read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU-sched callbacks to free any memory that might be
+ * accessed via the old update_util_data pointer or invoke synchronize_sched()
+ * right after this function to avoid use-after-free.
+ */
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+{
+ if (WARN_ON(data && !data->func))
+ return;
+
+ rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -9,7 +9,6 @@
#include <linux/irq_work.h>
#include <linux/tick.h>
#include <linux/slab.h>
-#include <linux/cpufreq.h>
#include "cpupri.h"
#include "cpudeadline.h"
@@ -1739,3 +1738,51 @@ static inline u64 irq_time_read(int cpu)
}
#endif /* CONFIG_64BIT */
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+#ifdef CONFIG_CPU_FREQ
+DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @time: Current time.
+ * @util: Current utilization.
+ * @max: Utilization ceiling.
+ *
+ * This function is called by the scheduler on every invocation of
+ * update_load_avg() on the CPU whose utilization is being updated.
+ *
+ * It can only be called from RCU-sched read-side critical sections.
+ */
+static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+ struct update_util_data *data;
+
+ data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
+ if (data)
+ data->func(data, time, util, max);
+}
+
+/**
+ * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
+ * @time: Current time.
+ *
+ * The way cpufreq is currently arranged requires it to evaluate the CPU
+ * performance state (frequency/voltage) on a regular basis to prevent it from
+ * being stuck in a completely inadequate performance level for too long.
+ * That is not guaranteed to happen if the updates are only triggered from CFS,
+ * though, because they may not be coming in if RT or deadline tasks are active
+ * all the time (or there are RT and DL tasks only).
+ *
+ * As a workaround for that issue, this function is called by the RT and DL
+ * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * but that really is a band-aid. Going forward it should be replaced with
+ * solutions targeted more specifically at RT and DL tasks.
+ */
+static inline void cpufreq_trigger_update(u64 time)
+{
+ cpufreq_update_util(time, ULONG_MAX, 0);
+}
+#else
+static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {}
+static inline void cpufreq_trigger_update(u64 time) {}
+#endif /* CONFIG_CPU_FREQ */
Index: linux-pm/include/linux/sched.h
===================================================================
--- linux-pm.orig/include/linux/sched.h
+++ linux-pm/include/linux/sched.h
@@ -3207,4 +3207,13 @@ static inline unsigned long rlimit_max(u
return task_rlimit_max(current, limit);
}
+#ifdef CONFIG_CPU_FREQ
+struct update_util_data {
+ void (*func)(struct update_util_data *data,
+ u64 time, unsigned long util, unsigned long max);
+};
+
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+#endif /* CONFIG_CPU_FREQ */
+
#endif
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -18,6 +18,7 @@
#include <linux/export.h>
#include <linux/kernel_stat.h>
+#include <linux/sched.h>
#include <linux/slab.h>
#include "cpufreq_governor.h"
Index: linux-pm/kernel/sched/Makefile
===================================================================
--- linux-pm.orig/kernel/sched/Makefile
+++ linux-pm/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_gr
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ) += cpufreq.o
On Wed, Mar 9, 2016 at 5:39 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote:
>> >> This means that on platforms where the utilization is frequency
>> >> invariant we should use
>> >>
>> >> next_freq = a * x
>> >>
>> >> (where x is given by (2) above) and for platforms where the
>> >> utilization is not frequency invariant
>> >>
>> >> next_freq = a * x * current_freq / max_freq
>> >>
>> >> and all boils down to finding a.
>> >
>> > Right.
>>
>> However, that doesn't seem to be in agreement with the Steve's results
>> posted earlier in this thread.
>
> I could not make anything of those numbers.
>
>> Also theoretically, with frequency invariant, the only way you can get
>> to 100% utilization is by running at the max frequency, so the closer
>> to 100% you get, the faster you need to run to get any further. That
>> indicates nonlinear to me.
>
> I'm not seeing that, you get that by using a > 1. No need for
> non-linear.
OK
>> >> Now, it seems reasonable for a to be something like (1 + 1/n) *
>> >> max_freq, so for non-frequency invariant we get
>> >>
>> >> nex_freq = (1 + 1/n) * current_freq * x
(*) (see below)
>> > This seems like a big leap; where does:
>> >
>> > (1 + 1/n) * max_freq
>> >
>> > come from? And what is 'n'?
>
>> a = max_freq gives next_freq = max_freq for x = 1,
>
> next_freq = a * x * current_freq / max_freq
>
> [ a := max_freq, x := 1 ] ->
>
> = max_freq * 1 * current_freq / max_freq
> = current_freq
>
> != max_freq
>
> But I think I see what you're saying; because at x = 1,
> current_frequency must be max_frequency. Per your earlier point.
Correct.
>> but with that choice of a you may never get to x = 1 with frequency
>> invariant because of the feedback effect mentioned above, so the 1/n
>> produces the extra boost needed for that (n is a positive integer).
>
> OK, so that gets us:
>
> a = (1 + 1/n) ; n > 0
>
> [ I would not have chosen (1 + 1/n), but lets stick to that ]
Well, what would you choose then? :-)
> So for n = 4 that gets you: a = 1.25, which effectively gets you an 80%
> utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the
> next frequency (assuming RELATION_L like selection).
>
> Together this gets you:
>
> next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq
> = (1 + 1/n) * x * current_freq
That seems to be what I said above (*), isn't it?
> Again, with n = 4, x > .8 will result in a next_freq > current_freq, and
> hence (RELATION_L) pick a higher one.
OK
>> Quite frankly, to me it looks like linear really is a better
>> approximation for "raw" utilization. That is, for frequency invariant
>> x we should take:
>>
>> next_freq = a * x * max_freq / current_freq
>
> (its very confusing how you use 'x' for both invariant and
> non-invariant).
>
> That doesn't make sense, remember:
>
> util = \Sum_i u_i * freq_i / max_freq (1)
>
> Which for systems where freq_i is constant reduces to:
>
> util = util_raw * current_freq / max_freq (2)
>
> But you cannot reverse this. IOW you cannot try and divide out
> current_freq on a frequency invariant metric.
I see.
> So going by:
>
> next_freq = (1 + 1/n) * max_freq * util (3)
I think that should be
next_freq = (1 + 1/n) * max_freq * util / max
(where max is the second argument of cpufreq_update_util) or the
dimensions on both sides don't match.
> if we substitute (2) into (3) we get:
>
> = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq
> = (1 + 1/n) * current_freq * util_raw (4)
>
> Which gets you two formula with the same general behaviour. As (2) is
> the only approximation of (1) we can make.
OK
So since utilization is not frequency invariant in the current
mainline (or linux-next for that matter) AFAIC, I'm going to use the
following in the next version of the schedutil patch series:
next_freq = 1.25 * current_freq * util_raw / max
where util_raw and max are what I get from cpufreq_update_util().
1.25 is for the 80% tipping point which I think is reasonable.
On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <[email protected]> wrote:
> Hi,
>
> sorry if I didn't reply yet. Trying to cope with jetlag and
> talks/meetings these days :-). Let me see if I'm getting what you are
> discussing, though.
>
> On 08/03/16 21:05, Rafael J. Wysocki wrote:
>> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
>> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <[email protected]> wrote:
>
> [...]
>
>> a = max_freq gives next_freq = max_freq for x = 1, but with that
>> choice of a you may never get to x = 1 with frequency invariant
>> because of the feedback effect mentioned above, so the 1/n produces
>> the extra boost needed for that (n is a positive integer).
>>
>> Quite frankly, to me it looks like linear really is a better
>> approximation for "raw" utilization. That is, for frequency invariant
>> x we should take:
>>
>> next_freq = a * x * max_freq / current_freq
>>
>> (and if x is not frequency invariant, the right-hand side becomes a *
>> x). Then, the extra boost needed to get to x = 1 for frequency
>> invariant is produced by the (max_freq / current_freq) factor that is
>> greater than 1 as long as we are not running at max_freq and a can be
>> chosen as max_freq.
>>
>
> Expanding terms again, your original formula (without the 1.1 factor of
> the last version) was:
>
> next_freq = util / max_cap * max_freq
>
> and this doesn't work when we have freq invariance since util won't go
> over curr_cap.
Can you please remind me what curr_cap is?
> What you propose above is to add another factor, so that we have:
>
> next_freq = util / max_cap * max_freq / curr_freq * max_freq
>
> which should give us the opportunity to reach max_freq also with freq
> invariance.
>
> This should actually be the same of doing:
>
> next_freq = util / max_cap * max_cap / curr_cap * max_freq
>
> We are basically scaling how much the cpu is busy at curr_cap back to
> the 0..1024 scale. And we use this to select next_freq. Also, we can
> simplify this to:
>
> next_freq = util / curr_cap * max_freq
>
> and we save some ops.
>
> However, if that is correct, I think we might have a problem, as we are
> skewing OPP selection towards higher frequencies. Let's suppose we have
> a platform with 3 OPPs:
>
> freq cap
> 1200 1024
> 900 768
> 600 512
>
> As soon a task reaches an utilization of 257 we will be selecting the
> second OPP as
>
> next_freq = 257 / 512 * 1200 ~ 602
>
> While the cpu is only 50% busy in this case. And we will go at max OPP
> when reaching ~492 (~64% of 768).
>
> That said, I guess this might work as a first solution, but we will
> probably need something better in the future. I understand Rafael's
> concerns regardin margins, but it seems to me that some kind of
> additional parameter will be probably needed anyway to fix this.
> Just to say again how we handle this in schedfreq, with a -20% margin
> applied to the lowest OPP we will get to the next one when utilization
> reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones,
> which is less aggressive and might be better IMHO.
Well, Peter says that my idea is incorrect, so I'll go for
next_freq = C * current_freq * util_raw / max
where C > 1 (and likely C < 1.5) instead.
That means C has to be determined somehow or guessed. The 80% tipping
point condition seems reasonable to me, though, which leads to C =
1.25.
On 10 March 2016 at 06:28, Rafael J. Wysocki <[email protected]> wrote:
> On Wed, Mar 9, 2016 at 5:39 PM, Peter Zijlstra <[email protected]> wrote:
>> On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote:
>>> >> This means that on platforms where the utilization is frequency
>>> >> invariant we should use
>>> >>
>>> >> next_freq = a * x
>>> >>
>>> >> (where x is given by (2) above) and for platforms where the
>>> >> utilization is not frequency invariant
>>> >>
>>> >> next_freq = a * x * current_freq / max_freq
>>> >>
>>> >> and all boils down to finding a.
>>> >
>>> > Right.
>>>
>>> However, that doesn't seem to be in agreement with the Steve's results
>>> posted earlier in this thread.
>>
>> I could not make anything of those numbers.
>>
>>> Also theoretically, with frequency invariant, the only way you can get
>>> to 100% utilization is by running at the max frequency, so the closer
>>> to 100% you get, the faster you need to run to get any further. That
>>> indicates nonlinear to me.
>>
>> I'm not seeing that, you get that by using a > 1. No need for
>> non-linear.
>
> OK
>
>>> >> Now, it seems reasonable for a to be something like (1 + 1/n) *
>>> >> max_freq, so for non-frequency invariant we get
>>> >>
>>> >> nex_freq = (1 + 1/n) * current_freq * x
>
> (*) (see below)
>
>>> > This seems like a big leap; where does:
>>> >
>>> > (1 + 1/n) * max_freq
>>> >
>>> > come from? And what is 'n'?
>>
>>> a = max_freq gives next_freq = max_freq for x = 1,
>>
>> next_freq = a * x * current_freq / max_freq
>>
>> [ a := max_freq, x := 1 ] ->
>>
>> = max_freq * 1 * current_freq / max_freq
>> = current_freq
>>
>> != max_freq
>>
>> But I think I see what you're saying; because at x = 1,
>> current_frequency must be max_frequency. Per your earlier point.
>
> Correct.
>
>>> but with that choice of a you may never get to x = 1 with frequency
>>> invariant because of the feedback effect mentioned above, so the 1/n
>>> produces the extra boost needed for that (n is a positive integer).
>>
>> OK, so that gets us:
>>
>> a = (1 + 1/n) ; n > 0
>>
>> [ I would not have chosen (1 + 1/n), but lets stick to that ]
>
> Well, what would you choose then? :-)
>
>> So for n = 4 that gets you: a = 1.25, which effectively gets you an 80%
>> utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the
>> next frequency (assuming RELATION_L like selection).
>>
>> Together this gets you:
>>
>> next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq
>> = (1 + 1/n) * x * current_freq
>
> That seems to be what I said above (*), isn't it?
>
>> Again, with n = 4, x > .8 will result in a next_freq > current_freq, and
>> hence (RELATION_L) pick a higher one.
>
> OK
>
>>> Quite frankly, to me it looks like linear really is a better
>>> approximation for "raw" utilization. That is, for frequency invariant
>>> x we should take:
>>>
>>> next_freq = a * x * max_freq / current_freq
>>
>> (its very confusing how you use 'x' for both invariant and
>> non-invariant).
>>
>> That doesn't make sense, remember:
>>
>> util = \Sum_i u_i * freq_i / max_freq (1)
>>
>> Which for systems where freq_i is constant reduces to:
>>
>> util = util_raw * current_freq / max_freq (2)
>>
>> But you cannot reverse this. IOW you cannot try and divide out
>> current_freq on a frequency invariant metric.
>
> I see.
>
>> So going by:
>>
>> next_freq = (1 + 1/n) * max_freq * util (3)
>
> I think that should be
>
> next_freq = (1 + 1/n) * max_freq * util / max
>
> (where max is the second argument of cpufreq_update_util) or the
> dimensions on both sides don't match.
>
>> if we substitute (2) into (3) we get:
>>
>> = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq
>> = (1 + 1/n) * current_freq * util_raw (4)
>>
>> Which gets you two formula with the same general behaviour. As (2) is
>> the only approximation of (1) we can make.
>
> OK
>
> So since utilization is not frequency invariant in the current
> mainline (or linux-next for that matter) AFAIC, I'm going to use the
> following in the next version of the schedutil patch series:
>
> next_freq = 1.25 * current_freq * util_raw / max
>
> where util_raw and max are what I get from cpufreq_update_util().
>
> 1.25 is for the 80% tipping point which I think is reasonable.
We have the arch_scale_freq_capacity function that is arch dependent
and can be used to merge the 2 formula that were described by peter
above.
By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which
is max capacity
but when arch_scale_freq_capacity is defined by an architecture,
arch_scale_freq_capacity returns current_freq * max_capacity/max_freq
so can't we use arch_scale_freq in your formula ? Taking your formula
above it becomes:
next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity()
Without invariance feature, we have the same formula than above :
next_freq = 1.25 * current_freq * util_raw / max because
SCHED_CAPACITY_SCALE is max capacity
With invariance feature, we have next_freq = 1.25 * current_freq *
util / (current_freq*max_capacity/max_freq) = 1.25 * util * max_freq /
max which is the formula that has to be used with frequency invariant
utilization.
so we have one formula that works for both configuration (this is not
really optimized for invariant system because we multiply then divide
by current_freq in 2 different places but it's better than a wrong
formula)
Now, arch_scale_freq_capacity is available in kernel/sched/sched.h
header file which can only be accessed by scheduler code...
May be we can pass arch_scale_freq_capacity value instead of max one
as a parameter of update_util function prototype
Vincent
On 10/03/16 00:41, Rafael J. Wysocki wrote:
> On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <[email protected]> wrote:
> > Hi,
> >
> > sorry if I didn't reply yet. Trying to cope with jetlag and
> > talks/meetings these days :-). Let me see if I'm getting what you are
> > discussing, though.
> >
> > On 08/03/16 21:05, Rafael J. Wysocki wrote:
> >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <[email protected]> wrote:
> >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
> >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <[email protected]> wrote:
> >
> > [...]
> >
> >> a = max_freq gives next_freq = max_freq for x = 1, but with that
> >> choice of a you may never get to x = 1 with frequency invariant
> >> because of the feedback effect mentioned above, so the 1/n produces
> >> the extra boost needed for that (n is a positive integer).
> >>
> >> Quite frankly, to me it looks like linear really is a better
> >> approximation for "raw" utilization. That is, for frequency invariant
> >> x we should take:
> >>
> >> next_freq = a * x * max_freq / current_freq
> >>
> >> (and if x is not frequency invariant, the right-hand side becomes a *
> >> x). Then, the extra boost needed to get to x = 1 for frequency
> >> invariant is produced by the (max_freq / current_freq) factor that is
> >> greater than 1 as long as we are not running at max_freq and a can be
> >> chosen as max_freq.
> >>
> >
> > Expanding terms again, your original formula (without the 1.1 factor of
> > the last version) was:
> >
> > next_freq = util / max_cap * max_freq
> >
> > and this doesn't work when we have freq invariance since util won't go
> > over curr_cap.
>
> Can you please remind me what curr_cap is?
>
The capacity at current frequency.
> > What you propose above is to add another factor, so that we have:
> >
> > next_freq = util / max_cap * max_freq / curr_freq * max_freq
> >
> > which should give us the opportunity to reach max_freq also with freq
> > invariance.
> >
> > This should actually be the same of doing:
> >
> > next_freq = util / max_cap * max_cap / curr_cap * max_freq
> >
> > We are basically scaling how much the cpu is busy at curr_cap back to
> > the 0..1024 scale. And we use this to select next_freq. Also, we can
> > simplify this to:
> >
> > next_freq = util / curr_cap * max_freq
> >
> > and we save some ops.
> >
> > However, if that is correct, I think we might have a problem, as we are
> > skewing OPP selection towards higher frequencies. Let's suppose we have
> > a platform with 3 OPPs:
> >
> > freq cap
> > 1200 1024
> > 900 768
> > 600 512
> >
> > As soon a task reaches an utilization of 257 we will be selecting the
> > second OPP as
> >
> > next_freq = 257 / 512 * 1200 ~ 602
> >
> > While the cpu is only 50% busy in this case. And we will go at max OPP
> > when reaching ~492 (~64% of 768).
> >
> > That said, I guess this might work as a first solution, but we will
> > probably need something better in the future. I understand Rafael's
> > concerns regardin margins, but it seems to me that some kind of
> > additional parameter will be probably needed anyway to fix this.
> > Just to say again how we handle this in schedfreq, with a -20% margin
> > applied to the lowest OPP we will get to the next one when utilization
> > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones,
> > which is less aggressive and might be better IMHO.
>
> Well, Peter says that my idea is incorrect, so I'll go for
>
> next_freq = C * current_freq * util_raw / max
>
> where C > 1 (and likely C < 1.5) instead.
>
> That means C has to be determined somehow or guessed. The 80% tipping
> point condition seems reasonable to me, though, which leads to C =
> 1.25.
>
Right. So, when using freq. invariant util we have:
next_freq = C * curr_freq * util / curr_cap
as
util_raw = util * max / curr_cap
What Vincent is saying makes sense, though. If we use
arch_scale_freq_capacity() as denominator instead of max, we can use a
single formula for both cases.
Best,
- Juri
On Thu, Mar 10, 2016 at 12:28:52AM +0100, Rafael J. Wysocki wrote:
> > [ I would not have chosen (1 + 1/n), but lets stick to that ]
>
> Well, what would you choose then? :-)
1/p ; 0 < p < 1
or so. Where p then represents the percentile threshold where you want
to bump to the next freq.
> I think that should be
>
> next_freq = (1 + 1/n) * max_freq * util / max
>
> (where max is the second argument of cpufreq_update_util) or the
> dimensions on both sides don't match.
Well yes, but so far we were treating util (and util_raw) as 0 < u < 1,
values, so already normalized against max.
But yes..
> > if we substitute (2) into (3) we get:
> >
> > = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq
> > = (1 + 1/n) * current_freq * util_raw (4)
> >
> > Which gets you two formula with the same general behaviour. As (2) is
> > the only approximation of (1) we can make.
>
> OK
>
> So since utilization is not frequency invariant in the current
> mainline (or linux-next for that matter) AFAIC, I'm going to use the
> following in the next version of the schedutil patch series:
>
> next_freq = 1.25 * current_freq * util_raw / max
>
> where util_raw and max are what I get from cpufreq_update_util().
>
> 1.25 is for the 80% tipping point which I think is reasonable.
OK.
On Wed, Mar 09, 2016 at 10:35:02PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
> Subject: [PATCH] cpufreq: Move scheduler-related code to the sched directory
>
> Create cpufreq.c under kernel/sched/ and move the cpufreq code
> related to the scheduler to that file and to sched.h.
>
> Redefine cpufreq_update_util() as a static inline function to avoid
> function calls at its call sites in the scheduler code (as suggested
> by Peter Zijlstra).
>
> Also move the definition of struct update_util_data and declaration
> of cpufreq_set_update_util_data() from include/linux/cpufreq.h to
> include/linux/sched.h.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> drivers/cpufreq/cpufreq.c | 53 -------------------------------------
> drivers/cpufreq/cpufreq_governor.c | 1
> include/linux/cpufreq.h | 34 -----------------------
> include/linux/sched.h | 9 ++++++
> kernel/sched/Makefile | 1
> kernel/sched/cpufreq.c | 37 +++++++++++++++++++++++++
> kernel/sched/sched.h | 49 +++++++++++++++++++++++++++++++++-
> 7 files changed, 96 insertions(+), 88 deletions(-)
Acked-by: Peter Zijlstra (Intel) <[email protected]>
On Thu, Mar 10, 2016 at 10:44:21AM +0700, Vincent Guittot wrote:
> We have the arch_scale_freq_capacity function that is arch dependent
> and can be used to merge the 2 formula that were described by peter
> above.
> By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which
> is max capacity
> but when arch_scale_freq_capacity is defined by an architecture,
> arch_scale_freq_capacity returns current_freq * max_capacity/max_freq
However, current_freq is a very fluid thing, it might (and will) change
very rapidly on some platforms.
This is the same point I made earlier, you cannot try and divide out
current_freq from the invariant measure.
> so can't we use arch_scale_freq in your formula ? Taking your formula
> above it becomes:
> next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity()
No, that cannot work, nor makes any sense, per the above.
> With invariance feature, we have:
>
> next_freq = 1.25 * current_freq * util / (current_freq*max_capacity/max_freq)
> = 1.25 * util * max_freq / max
>
> which is the formula that has to be used with frequency invariant
> utilization.
Wrong, you cannot talk about current_freq in the invariant case.
> May be we can pass arch_scale_freq_capacity value instead of max one
> as a parameter of update_util function prototype
No, since its a compile time thing, we can simply do:
#ifdef arch_scale_freq_capacity
next_freq = (1 + 1/n) * max_freq * (util / max)
#else
next_freq = (1 + 1/n) * current_freq * (util_raw / max)
#endif
On 10 March 2016 at 17:07, Peter Zijlstra <[email protected]> wrote:
> On Thu, Mar 10, 2016 at 10:44:21AM +0700, Vincent Guittot wrote:
>> We have the arch_scale_freq_capacity function that is arch dependent
>> and can be used to merge the 2 formula that were described by peter
>> above.
>> By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which
>> is max capacity
>> but when arch_scale_freq_capacity is defined by an architecture,
>
>> arch_scale_freq_capacity returns current_freq * max_capacity/max_freq
>
> However, current_freq is a very fluid thing, it might (and will) change
> very rapidly on some platforms.
>
> This is the same point I made earlier, you cannot try and divide out
> current_freq from the invariant measure.
>
>> so can't we use arch_scale_freq in your formula ? Taking your formula
>> above it becomes:
>> next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity()
>
> No, that cannot work, nor makes any sense, per the above.
>
>> With invariance feature, we have:
>>
>> next_freq = 1.25 * current_freq * util / (current_freq*max_capacity/max_freq)
>> = 1.25 * util * max_freq / max
>>
>> which is the formula that has to be used with frequency invariant
>> utilization.
>
> Wrong, you cannot talk about current_freq in the invariant case.
>
>> May be we can pass arch_scale_freq_capacity value instead of max one
>> as a parameter of update_util function prototype
>
> No, since its a compile time thing, we can simply do:
>
> #ifdef arch_scale_freq_capacity
> next_freq = (1 + 1/n) * max_freq * (util / max)
> #else
> next_freq = (1 + 1/n) * current_freq * (util_raw / max)
> #endif
selecting formula at compilation is clearly better. I wrongly thought
that it can't be accepted as a solution.
On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote:
> > No, since its a compile time thing, we can simply do:
> >
> > #ifdef arch_scale_freq_capacity
> > next_freq = (1 + 1/n) * max_freq * (util / max)
> > #else
> > next_freq = (1 + 1/n) * current_freq * (util_raw / max)
> > #endif
>
> selecting formula at compilation is clearly better. I wrongly thought that
> it can't be accepted as a solution.
Well, its bound to get more 'interesting' since I forse implementations
not always actually doing the invariant thing.
Take for example the thing I send:
lkml.kernel.org/r/[email protected]
it both shows why you cannot talk about current_freq but also that the
above needs a little more help (for the !X86_FEATURE_APERFMPERF case).
But the !arch_scale_freq_capacity case should indeed be that simple.
On Thu, Mar 10, 2016 at 11:30:08AM +0100, Peter Zijlstra wrote:
> On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote:
>
> > > No, since its a compile time thing, we can simply do:
> > >
> > > #ifdef arch_scale_freq_capacity
> > > next_freq = (1 + 1/n) * max_freq * (util / max)
> > > #else
> > > next_freq = (1 + 1/n) * current_freq * (util_raw / max)
> > > #endif
> >
> > selecting formula at compilation is clearly better. I wrongly thought that
> > it can't be accepted as a solution.
>
> Well, its bound to get more 'interesting' since I forse implementations
> not always actually doing the invariant thing.
>
> Take for example the thing I send:
>
> lkml.kernel.org/r/[email protected]
>
> it both shows why you cannot talk about current_freq but also that the
> above needs a little more help (for the !X86_FEATURE_APERFMPERF case).
>
> But the !arch_scale_freq_capacity case should indeed be that simple.
Maybe something like:
#ifdef arch_scale_freq_capacity
#ifndef arch_scale_freq_invariant
#define arch_scale_freq_invariant() (true)
#endif
#else /* arch_scale_freq_capacity */
#define arch_scale_freq_invariant() (false)
#endif
if (arch_scale_freq_invariant())
And have archs that have conditional arch_scale_freq_capacity()
implementation provide a arch_scale_freq_invariant implementation.
On Thursday, March 10, 2016 11:30:34 AM Juri Lelli wrote:
> On 10/03/16 00:41, Rafael J. Wysocki wrote:
> > On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <[email protected]> wrote:
> > > Hi,
> > >
> > > sorry if I didn't reply yet. Trying to cope with jetlag and
> > > talks/meetings these days :-). Let me see if I'm getting what you are
> > > discussing, though.
> > >
> > > On 08/03/16 21:05, Rafael J. Wysocki wrote:
> > >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <[email protected]> wrote:
> > >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
> > >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > [...]
> > >
> > >> a = max_freq gives next_freq = max_freq for x = 1, but with that
> > >> choice of a you may never get to x = 1 with frequency invariant
> > >> because of the feedback effect mentioned above, so the 1/n produces
> > >> the extra boost needed for that (n is a positive integer).
> > >>
> > >> Quite frankly, to me it looks like linear really is a better
> > >> approximation for "raw" utilization. That is, for frequency invariant
> > >> x we should take:
> > >>
> > >> next_freq = a * x * max_freq / current_freq
> > >>
> > >> (and if x is not frequency invariant, the right-hand side becomes a *
> > >> x). Then, the extra boost needed to get to x = 1 for frequency
> > >> invariant is produced by the (max_freq / current_freq) factor that is
> > >> greater than 1 as long as we are not running at max_freq and a can be
> > >> chosen as max_freq.
> > >>
> > >
> > > Expanding terms again, your original formula (without the 1.1 factor of
> > > the last version) was:
> > >
> > > next_freq = util / max_cap * max_freq
> > >
> > > and this doesn't work when we have freq invariance since util won't go
> > > over curr_cap.
> >
> > Can you please remind me what curr_cap is?
> >
>
> The capacity at current frequency.
I see, thanks!
> > > What you propose above is to add another factor, so that we have:
> > >
> > > next_freq = util / max_cap * max_freq / curr_freq * max_freq
> > >
> > > which should give us the opportunity to reach max_freq also with freq
> > > invariance.
> > >
> > > This should actually be the same of doing:
> > >
> > > next_freq = util / max_cap * max_cap / curr_cap * max_freq
> > >
> > > We are basically scaling how much the cpu is busy at curr_cap back to
> > > the 0..1024 scale. And we use this to select next_freq. Also, we can
> > > simplify this to:
> > >
> > > next_freq = util / curr_cap * max_freq
> > >
> > > and we save some ops.
> > >
> > > However, if that is correct, I think we might have a problem, as we are
> > > skewing OPP selection towards higher frequencies. Let's suppose we have
> > > a platform with 3 OPPs:
> > >
> > > freq cap
> > > 1200 1024
> > > 900 768
> > > 600 512
> > >
> > > As soon a task reaches an utilization of 257 we will be selecting the
> > > second OPP as
> > >
> > > next_freq = 257 / 512 * 1200 ~ 602
> > >
> > > While the cpu is only 50% busy in this case. And we will go at max OPP
> > > when reaching ~492 (~64% of 768).
> > >
> > > That said, I guess this might work as a first solution, but we will
> > > probably need something better in the future. I understand Rafael's
> > > concerns regardin margins, but it seems to me that some kind of
> > > additional parameter will be probably needed anyway to fix this.
> > > Just to say again how we handle this in schedfreq, with a -20% margin
> > > applied to the lowest OPP we will get to the next one when utilization
> > > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones,
> > > which is less aggressive and might be better IMHO.
> >
> > Well, Peter says that my idea is incorrect, so I'll go for
> >
> > next_freq = C * current_freq * util_raw / max
> >
> > where C > 1 (and likely C < 1.5) instead.
> >
> > That means C has to be determined somehow or guessed. The 80% tipping
> > point condition seems reasonable to me, though, which leads to C =
> > 1.25.
> >
>
> Right. So, when using freq. invariant util we have:
>
> next_freq = C * curr_freq * util / curr_cap
>
> as
>
> util_raw = util * max / curr_cap
>
> What Vincent is saying makes sense, though. If we use
> arch_scale_freq_capacity() as denominator instead of max, we can use a
> single formula for both cases.
I'm not convinced about that yet, but let me think about it some more. :-)
Thanks,
Rafael
On Thursday, March 10, 2016 11:56:14 AM Peter Zijlstra wrote:
> On Thu, Mar 10, 2016 at 11:30:08AM +0100, Peter Zijlstra wrote:
> > On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote:
> >
> > > > No, since its a compile time thing, we can simply do:
> > > >
> > > > #ifdef arch_scale_freq_capacity
> > > > next_freq = (1 + 1/n) * max_freq * (util / max)
> > > > #else
> > > > next_freq = (1 + 1/n) * current_freq * (util_raw / max)
> > > > #endif
> > >
> > > selecting formula at compilation is clearly better. I wrongly thought that
> > > it can't be accepted as a solution.
> >
> > Well, its bound to get more 'interesting' since I forse implementations
> > not always actually doing the invariant thing.
> >
> > Take for example the thing I send:
> >
> > lkml.kernel.org/r/[email protected]
> >
> > it both shows why you cannot talk about current_freq but also that the
> > above needs a little more help (for the !X86_FEATURE_APERFMPERF case).
> >
> > But the !arch_scale_freq_capacity case should indeed be that simple.
>
> Maybe something like:
>
> #ifdef arch_scale_freq_capacity
> #ifndef arch_scale_freq_invariant
> #define arch_scale_freq_invariant() (true)
> #endif
> #else /* arch_scale_freq_capacity */
> #define arch_scale_freq_invariant() (false)
> #endif
>
> if (arch_scale_freq_invariant())
>
> And have archs that have conditional arch_scale_freq_capacity()
> implementation provide a arch_scale_freq_invariant implementation.
Yeah, looks workable to me.
Quoting Rafael J. Wysocki (2016-03-09 15:41:34)
> On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <[email protected]> wrote:
> > Hi,
> >
> > sorry if I didn't reply yet. Trying to cope with jetlag and
> > talks/meetings these days :-). Let me see if I'm getting what you are
> > discussing, though.
> >
> > On 08/03/16 21:05, Rafael J. Wysocki wrote:
> >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <[email protected]> wrote:
> >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
> >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <[email protected]> wrote:
> >
> > [...]
> >
> >> a = max_freq gives next_freq = max_freq for x = 1, but with that
> >> choice of a you may never get to x = 1 with frequency invariant
> >> because of the feedback effect mentioned above, so the 1/n produces
> >> the extra boost needed for that (n is a positive integer).
> >>
> >> Quite frankly, to me it looks like linear really is a better
> >> approximation for "raw" utilization. That is, for frequency invariant
> >> x we should take:
> >>
> >> next_freq = a * x * max_freq / current_freq
> >>
> >> (and if x is not frequency invariant, the right-hand side becomes a *
> >> x). Then, the extra boost needed to get to x = 1 for frequency
> >> invariant is produced by the (max_freq / current_freq) factor that is
> >> greater than 1 as long as we are not running at max_freq and a can be
> >> chosen as max_freq.
> >>
> >
> > Expanding terms again, your original formula (without the 1.1 factor of
> > the last version) was:
> >
> > next_freq = util / max_cap * max_freq
> >
> > and this doesn't work when we have freq invariance since util won't go
> > over curr_cap.
>
> Can you please remind me what curr_cap is?
>
> > What you propose above is to add another factor, so that we have:
> >
> > next_freq = util / max_cap * max_freq / curr_freq * max_freq
> >
> > which should give us the opportunity to reach max_freq also with freq
> > invariance.
> >
> > This should actually be the same of doing:
> >
> > next_freq = util / max_cap * max_cap / curr_cap * max_freq
> >
> > We are basically scaling how much the cpu is busy at curr_cap back to
> > the 0..1024 scale. And we use this to select next_freq. Also, we can
> > simplify this to:
> >
> > next_freq = util / curr_cap * max_freq
> >
> > and we save some ops.
> >
> > However, if that is correct, I think we might have a problem, as we are
> > skewing OPP selection towards higher frequencies. Let's suppose we have
> > a platform with 3 OPPs:
> >
> > freq cap
> > 1200 1024
> > 900 768
> > 600 512
> >
> > As soon a task reaches an utilization of 257 we will be selecting the
> > second OPP as
> >
> > next_freq = 257 / 512 * 1200 ~ 602
> >
> > While the cpu is only 50% busy in this case. And we will go at max OPP
> > when reaching ~492 (~64% of 768).
> >
> > That said, I guess this might work as a first solution, but we will
> > probably need something better in the future. I understand Rafael's
> > concerns regardin margins, but it seems to me that some kind of
> > additional parameter will be probably needed anyway to fix this.
> > Just to say again how we handle this in schedfreq, with a -20% margin
> > applied to the lowest OPP we will get to the next one when utilization
> > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones,
> > which is less aggressive and might be better IMHO.
>
> Well, Peter says that my idea is incorrect, so I'll go for
>
> next_freq = C * current_freq * util_raw / max
>
> where C > 1 (and likely C < 1.5) instead.
>
> That means C has to be determined somehow or guessed. The 80% tipping
> point condition seems reasonable to me, though, which leads to C =
> 1.25.
Right, that is the same value used in the schedfreq series:
+/*
+ * Capacity margin added to CFS and RT capacity requests to provide
+ * some head room if task utilization further increases.
+ */
+unsigned int capacity_margin = 1280;
Regards,
Mike
From: Rafael J. Wysocki <[email protected]>
Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
to support that method if available.
Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context by (future)
governors supporting that feature via (new) helper function
cpufreq_driver_fast_switch().
Add two new policy flags, fast_switch_possible, to be set by the
cpufreq driver if fast frequency switching can be used for the
given policy and fast_switch_enabled, to be set by the governor
if it is going to use fast frequency switching for the given
policy. Also add a helper for setting the latter.
Since fast frequency switching is inherently incompatible with
cpufreq transition notifiers, make it possible to set the
fast_switch_enabled only if there are no transition notifiers
already registered and make the registration of new transition
notifiers fail if fast_switch_enabled is set for at least one
policy.
Implement the ->fast_switch callback in the ACPI cpufreq driver
and make it set fast_switch_possible during policy initialization
as appropriate.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Changes from v3:
- New fast_switch_enabled field in struct cpufreq_policy to help
avoid affecting existing setups by setting the fast_switch_possible
flag in the driver.
- __cpufreq_get() skips the policy->cur check if fast_switch_enabled is set.
Changes from v2:
- The driver ->fast_switch callback and cpufreq_driver_fast_switch()
don't need the relation argument as they will always do RELATION_L now.
- New mechanism to make fast switch and cpufreq notifiers mutually
exclusive.
- cpufreq_driver_fast_switch() doesn't do anything in addition to
invoking the driver callback and returns its return value.
---
drivers/cpufreq/acpi-cpufreq.c | 41 +++++++++++++++
drivers/cpufreq/cpufreq.c | 108 +++++++++++++++++++++++++++++++++++++----
include/linux/cpufreq.h | 9 +++
3 files changed, 149 insertions(+), 9 deletions(-)
Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp
return result;
}
+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ struct acpi_cpufreq_data *data = policy->driver_data;
+ struct acpi_processor_performance *perf;
+ struct cpufreq_frequency_table *entry;
+ unsigned int next_perf_state, next_freq, freq;
+
+ /*
+ * Find the closest frequency above target_freq.
+ *
+ * The table is sorted in the reverse order with respect to the
+ * frequency and all of the entries are valid (see the initialization).
+ */
+ entry = data->freq_table;
+ do {
+ entry++;
+ freq = entry->frequency;
+ } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
+ entry--;
+ next_freq = entry->frequency;
+ next_perf_state = entry->driver_data;
+
+ perf = to_perf_data(data);
+ if (perf->state == next_perf_state) {
+ if (unlikely(data->resume))
+ data->resume = 0;
+ else
+ return next_freq;
+ }
+
+ data->cpu_freq_write(&perf->control_register,
+ perf->states[next_perf_state].control);
+ perf->state = next_perf_state;
+ return next_freq;
+}
+
static unsigned long
acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
{
@@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct
goto err_unreg;
}
+ policy->fast_switch_possible = !acpi_pstate_strict &&
+ !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
+
data->freq_table = kzalloc(sizeof(*data->freq_table) *
(perf->state_count+1), GFP_KERNEL);
if (!data->freq_table) {
@@ -874,6 +914,7 @@ static struct freq_attr *acpi_cpufreq_at
static struct cpufreq_driver acpi_cpufreq_driver = {
.verify = cpufreq_generic_frequency_table_verify,
.target_index = acpi_cpufreq_target,
+ .fast_switch = acpi_cpufreq_fast_switch,
.bios_limit = acpi_processor_get_bios_limit,
.init = acpi_cpufreq_cpu_init,
.exit = acpi_cpufreq_cpu_exit,
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -102,6 +102,10 @@ struct cpufreq_policy {
*/
struct rw_semaphore rwsem;
+ /* Fast switch flags */
+ bool fast_switch_possible; /* Set by the driver. */
+ bool fast_switch_enabled;
+
/* Synchronization for frequency transitions */
bool transition_ongoing; /* Tracks transition status */
spinlock_t transition_lock;
@@ -156,6 +160,7 @@ int cpufreq_get_policy(struct cpufreq_po
int cpufreq_update_policy(unsigned int cpu);
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
#else
static inline unsigned int cpufreq_get(unsigned int cpu)
{
@@ -236,6 +241,8 @@ struct cpufreq_driver {
unsigned int relation); /* Deprecated */
int (*target_index)(struct cpufreq_policy *policy,
unsigned int index);
+ unsigned int (*fast_switch)(struct cpufreq_policy *policy,
+ unsigned int target_freq);
/*
* Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
* unset.
@@ -464,6 +471,8 @@ struct cpufreq_governor {
};
/* Pass a target to the cpufreq driver */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq);
int cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -428,6 +428,39 @@ void cpufreq_freq_transition_end(struct
}
EXPORT_SYMBOL_GPL(cpufreq_freq_transition_end);
+/*
+ * Fast frequency switching status count. Positive means "enabled", negative
+ * means "disabled" and 0 means "not decided yet".
+ */
+static int cpufreq_fast_switch_count;
+static DEFINE_MUTEX(cpufreq_fast_switch_lock);
+
+/**
+ * cpufreq_enable_fast_switch - Enable fast frequency switching for policy.
+ * @policy: cpufreq policy to enable fast frequency switching for.
+ *
+ * Try to enable fast frequency switching for @policy.
+ *
+ * The attempt will fail if there is at least one transition notifier registered
+ * at this point, as fast frequency switching is quite fundamentally at odds
+ * with transition notifiers. Thus if successful, it will make registration of
+ * transition notifiers fail going forward.
+ *
+ * Call under policy->rwsem.
+ */
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
+{
+ mutex_lock(&cpufreq_fast_switch_lock);
+ if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) {
+ cpufreq_fast_switch_count++;
+ policy->fast_switch_enabled = true;
+ } else {
+ pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n",
+ policy->cpu);
+ }
+ mutex_unlock(&cpufreq_fast_switch_lock);
+}
+EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch);
/*********************************************************************
* SYSFS INTERFACE *
@@ -1083,6 +1116,24 @@ static void cpufreq_policy_free(struct c
kfree(policy);
}
+static void cpufreq_driver_exit_policy(struct cpufreq_policy *policy)
+{
+ if (policy->fast_switch_enabled) {
+ mutex_lock(&cpufreq_fast_switch_lock);
+
+ policy->fast_switch_enabled = false;
+ if (!WARN_ON(cpufreq_fast_switch_count <= 0))
+ cpufreq_fast_switch_count--;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
+ }
+
+ if (cpufreq_driver->exit) {
+ cpufreq_driver->exit(policy);
+ policy->freq_table = NULL;
+ }
+}
+
static int cpufreq_online(unsigned int cpu)
{
struct cpufreq_policy *policy;
@@ -1236,8 +1287,7 @@ static int cpufreq_online(unsigned int c
out_exit_policy:
up_write(&policy->rwsem);
- if (cpufreq_driver->exit)
- cpufreq_driver->exit(policy);
+ cpufreq_driver_exit_policy(policy);
out_free_policy:
cpufreq_policy_free(policy, !new_policy);
return ret;
@@ -1334,10 +1384,7 @@ static void cpufreq_offline(unsigned int
* since this is a core component, and is essential for the
* subsequent light-weight ->init() to succeed.
*/
- if (cpufreq_driver->exit) {
- cpufreq_driver->exit(policy);
- policy->freq_table = NULL;
- }
+ cpufreq_driver_exit_policy(policy);
unlock:
up_write(&policy->rwsem);
@@ -1444,8 +1491,12 @@ static unsigned int __cpufreq_get(struct
ret_freq = cpufreq_driver->get(policy->cpu);
- /* Updating inactive policies is invalid, so avoid doing that. */
- if (unlikely(policy_is_inactive(policy)))
+ /*
+ * Updating inactive policies is invalid, so avoid doing that. Also
+ * if fast frequency switching is used with the given policy, the check
+ * against policy->cur is pointless, so skip it in that case too.
+ */
+ if (unlikely(policy_is_inactive(policy)) || policy->fast_switch_enabled)
return ret_freq;
if (ret_freq && policy->cur &&
@@ -1457,7 +1508,6 @@ static unsigned int __cpufreq_get(struct
schedule_work(&policy->update);
}
}
-
return ret_freq;
}
@@ -1653,8 +1703,18 @@ int cpufreq_register_notifier(struct not
switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
+ if (cpufreq_fast_switch_count > 0) {
+ mutex_unlock(&cpufreq_fast_switch_lock);
+ return -EPERM;
+ }
ret = srcu_notifier_chain_register(
&cpufreq_transition_notifier_list, nb);
+ if (!ret)
+ cpufreq_fast_switch_count--;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_register(
@@ -1687,8 +1747,14 @@ int cpufreq_unregister_notifier(struct n
switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
ret = srcu_notifier_chain_unregister(
&cpufreq_transition_notifier_list, nb);
+ if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0))
+ cpufreq_fast_switch_count++;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_unregister(
@@ -1707,6 +1773,30 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
* GOVERNORS *
*********************************************************************/
+/**
+ * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
+ * @policy: cpufreq policy to switch the frequency for.
+ * @target_freq: New frequency to set (may be approximate).
+ *
+ * Carry out a fast frequency switch from interrupt context.
+ *
+ * This function must not be called if policy->fast_switch_enabled is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same policy and that it will never be called in
+ * parallel with either ->target() or ->target_index() for the same policy.
+ *
+ * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
+ * callback to indicate an error condition, the hardware configuration must be
+ * preserved.
+ */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ return cpufreq_driver->fast_switch(policy, target_freq);
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
+
/* Must set freqs->new to intermediate frequency */
static int __target_intermediate(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, int index)
From: Rafael J. Wysocki <[email protected]>
Move definitions of symbols related to transition latency and
sampling rate to include/linux/cpufreq.h so they can be used by
(future) goverernors located outside of drivers/cpufreq/.
No functional changes.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
New patch.
---
drivers/cpufreq/cpufreq_governor.h | 14 --------------
include/linux/cpufreq.h | 14 ++++++++++++++
2 files changed, 14 insertions(+), 14 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -24,20 +24,6 @@
#include <linux/module.h>
#include <linux/mutex.h>
-/*
- * The polling frequency depends on the capability of the processor. Default
- * polling frequency is 1000 times the transition latency of the processor. The
- * governor will work on any processor with transition latency <= 10ms, using
- * appropriate sampling rate.
- *
- * For CPUs with transition latency > 10ms (mostly drivers with CPUFREQ_ETERNAL)
- * this governor will not work. All times here are in us (micro seconds).
- */
-#define MIN_SAMPLING_RATE_RATIO (2)
-#define LATENCY_MULTIPLIER (1000)
-#define MIN_LATENCY_MULTIPLIER (20)
-#define TRANSITION_LATENCY_LIMIT (10 * 1000 * 1000)
-
/* Ondemand Sampling types */
enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -426,6 +426,20 @@ static inline unsigned long cpufreq_scal
#define CPUFREQ_POLICY_POWERSAVE (1)
#define CPUFREQ_POLICY_PERFORMANCE (2)
+/*
+ * The polling frequency depends on the capability of the processor. Default
+ * polling frequency is 1000 times the transition latency of the processor. The
+ * ondemand governor will work on any processor with transition latency <= 10ms,
+ * using appropriate sampling rate.
+ *
+ * For CPUs with transition latency > 10ms (mostly drivers with CPUFREQ_ETERNAL)
+ * the ondemand governor will not work. All times here are in us (microseconds).
+ */
+#define MIN_SAMPLING_RATE_RATIO (2)
+#define LATENCY_MULTIPLIER (1000)
+#define MIN_LATENCY_MULTIPLIER (20)
+#define TRANSITION_LATENCY_LIMIT (10 * 1000 * 1000)
+
/* Governor Events */
#define CPUFREQ_GOV_START 1
#define CPUFREQ_GOV_STOP 2
Hi,
Here's a new iteration of the schedutil governor series. It is based on
linux-next (particularly on the material from my pull request for 4.6-rc1),
so I'm not resending the patches already included there. It has been
present in my pm-cpufreq-experimental branch for a few days.
The first patch is new, but it is just something I think would be useful
(and seems to be kind of compatible with thigs currently under discussion:
http://marc.info/?l=linux-pm&m=145813384117349&w=4).
The next four patches are needed for sharing code between the new governor
and the existing ones. Three of them have not changed since the previous
iteration of the series and the fourth one is new (but it only moves some
symbols around).
Patch [6/7] adds fast frequency switching support to cpufreq. It has changed
since the previous version. Most importantly, there's a new fast_switch_enabled
field in struct cpufreq_policy which is to be set when fast switching is actually
enabled for the given policy and governors are supposed to set it (using a
helper function provided for that). This way notifier registrations are only
affected if someone is really using fast switching and that prevents existing
setups from being affected in particular.
Patch [7/7] introduces the schedutil governor. There are a few changes in it
from the previous version.
First off, I've attempted to address some points made during the recent
discussion on the next frequency selection formula
(http://marc.info/?t=145688568600003&r=1&w=4). It essentially uses the
formula from http://marc.info/?l=linux-acpi&m=145756618321500&w=4 (bottom
of the message body), but with the modification that if the utilization
is frequency-invariant, it will use max_freq instead of the current frequency.
It uses the mechanism suggested by Peter to recognize whether or not the
utilization is frequency invariant
(http://marc.info/?l=linux-kernel&m=145760739700716&w=4).
Second, because of the above, the schedutil governor goes into kernel/sched/
(again). Namely, I don't want arch_scale_freq_invariant() to be visible by
all cpufreq governors that won't need it.
Now, since we seem to want to build upon this series (ref the recent Mike's
patchset: http://marc.info/?l=linux-kernel&m=145793318016832&w=4), I need
you to tell me what to change before it is good enough to be queued up for
4.7 (assuming that my 4.6 material is merged, that is).
Thanks,
Rafael
From: Rafael J. Wysocki <[email protected]>
Replace the single helper for adding and removing cpufreq utilization
update hooks, cpufreq_set_update_util_data(), with a pair of helpers,
cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(),
and modify the users of cpufreq_set_update_util_data() accordingly.
With the new helpers, the code using them doesn't need to worry
about the internals of struct update_util_data and in particular
it doesn't need to worry about populating the func field in it
properly upfront.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
New patch.
---
drivers/cpufreq/cpufreq_governor.c | 76 ++++++++++++++++++-------------------
drivers/cpufreq/intel_pstate.c | 8 +--
include/linux/sched.h | 5 +-
kernel/sched/cpufreq.c | 48 ++++++++++++++++++-----
4 files changed, 83 insertions(+), 54 deletions(-)
Index: linux-pm/include/linux/sched.h
===================================================================
--- linux-pm.orig/include/linux/sched.h
+++ linux-pm/include/linux/sched.h
@@ -3213,7 +3213,10 @@ struct update_util_data {
u64 time, unsigned long util, unsigned long max);
};
-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
+ void (*func)(struct update_util_data *data, u64 time,
+ unsigned long util, unsigned long max));
+void cpufreq_remove_update_util_hook(int cpu);
#endif /* CONFIG_CPU_FREQ */
#endif
Index: linux-pm/kernel/sched/cpufreq.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq.c
+++ linux-pm/kernel/sched/cpufreq.c
@@ -14,24 +14,50 @@
DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
/**
- * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * cpufreq_add_update_util_hook - Populate the CPU's update_util_data pointer.
* @cpu: The CPU to set the pointer for.
* @data: New pointer value.
+ * @func: Callback function to set for the CPU.
*
- * Set and publish the update_util_data pointer for the given CPU. That pointer
- * points to a struct update_util_data object containing a callback function
- * to call from cpufreq_update_util(). That function will be called from an RCU
- * read-side critical section, so it must not sleep.
+ * Set and publish the update_util_data pointer for the given CPU.
*
- * Callers must use RCU-sched callbacks to free any memory that might be
- * accessed via the old update_util_data pointer or invoke synchronize_sched()
- * right after this function to avoid use-after-free.
+ * The update_util_data pointer of @cpu is set to @data and the callback
+ * function pointer in the target struct update_util_data is set to @func.
+ * That function will be called by cpufreq_update_util() from RCU-sched
+ * read-side critical sections, so it must not sleep. @data will always be
+ * passed to it as the first argument which allows the function to get to the
+ * target update_util_data structure and its container.
+ *
+ * The update_util_data pointer of @cpu must be NULL when this function is
+ * called or it will WARN() and return with no effect.
*/
-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
+ void (*func)(struct update_util_data *data, u64 time,
+ unsigned long util, unsigned long max))
{
- if (WARN_ON(data && !data->func))
+ if (WARN_ON(!data || !func))
return;
+ if (WARN_ON(per_cpu(cpufreq_update_util_data, cpu)))
+ return;
+
+ data->func = func;
rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
}
-EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+EXPORT_SYMBOL_GPL(cpufreq_add_update_util_hook);
+
+/**
+ * cpufreq_remove_update_util_hook - Clear the CPU's update_util_data pointer.
+ * @cpu: The CPU to clear the pointer for.
+ *
+ * Clear the update_util_data pointer for the given CPU.
+ *
+ * Callers must use RCU-sched callbacks to free any memory that might be
+ * accessed via the old update_util_data pointer or invoke synchronize_sched()
+ * right after this function to avoid use-after-free.
+ */
+void cpufreq_remove_update_util_hook(int cpu)
+{
+ rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), NULL);
+}
+EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -258,43 +258,6 @@ unsigned int dbs_update(struct cpufreq_p
}
EXPORT_SYMBOL_GPL(dbs_update);
-static void gov_set_update_util(struct policy_dbs_info *policy_dbs,
- unsigned int delay_us)
-{
- struct cpufreq_policy *policy = policy_dbs->policy;
- int cpu;
-
- gov_update_sample_delay(policy_dbs, delay_us);
- policy_dbs->last_sample_time = 0;
-
- for_each_cpu(cpu, policy->cpus) {
- struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
-
- cpufreq_set_update_util_data(cpu, &cdbs->update_util);
- }
-}
-
-static inline void gov_clear_update_util(struct cpufreq_policy *policy)
-{
- int i;
-
- for_each_cpu(i, policy->cpus)
- cpufreq_set_update_util_data(i, NULL);
-
- synchronize_sched();
-}
-
-static void gov_cancel_work(struct cpufreq_policy *policy)
-{
- struct policy_dbs_info *policy_dbs = policy->governor_data;
-
- gov_clear_update_util(policy_dbs->policy);
- irq_work_sync(&policy_dbs->irq_work);
- cancel_work_sync(&policy_dbs->work);
- atomic_set(&policy_dbs->work_count, 0);
- policy_dbs->work_in_progress = false;
-}
-
static void dbs_work_handler(struct work_struct *work)
{
struct policy_dbs_info *policy_dbs;
@@ -382,6 +345,44 @@ static void dbs_update_util_handler(stru
irq_work_queue(&policy_dbs->irq_work);
}
+static void gov_set_update_util(struct policy_dbs_info *policy_dbs,
+ unsigned int delay_us)
+{
+ struct cpufreq_policy *policy = policy_dbs->policy;
+ int cpu;
+
+ gov_update_sample_delay(policy_dbs, delay_us);
+ policy_dbs->last_sample_time = 0;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
+
+ cpufreq_add_update_util_hook(cpu, &cdbs->update_util,
+ dbs_update_util_handler);
+ }
+}
+
+static inline void gov_clear_update_util(struct cpufreq_policy *policy)
+{
+ int i;
+
+ for_each_cpu(i, policy->cpus)
+ cpufreq_remove_update_util_hook(i);
+
+ synchronize_sched();
+}
+
+static void gov_cancel_work(struct cpufreq_policy *policy)
+{
+ struct policy_dbs_info *policy_dbs = policy->governor_data;
+
+ gov_clear_update_util(policy_dbs->policy);
+ irq_work_sync(&policy_dbs->irq_work);
+ cancel_work_sync(&policy_dbs->work);
+ atomic_set(&policy_dbs->work_count, 0);
+ policy_dbs->work_in_progress = false;
+}
+
static struct policy_dbs_info *alloc_policy_dbs_info(struct cpufreq_policy *policy,
struct dbs_governor *gov)
{
@@ -404,7 +405,6 @@ static struct policy_dbs_info *alloc_pol
struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
j_cdbs->policy_dbs = policy_dbs;
- j_cdbs->update_util.func = dbs_update_util_handler;
}
return policy_dbs;
}
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -1089,8 +1089,8 @@ static int intel_pstate_init_cpu(unsigne
intel_pstate_busy_pid_reset(cpu);
intel_pstate_sample(cpu, 0);
- cpu->update_util.func = intel_pstate_update_util;
- cpufreq_set_update_util_data(cpunum, &cpu->update_util);
+ cpufreq_add_update_util_hook(cpunum, &cpu->update_util,
+ intel_pstate_update_util);
pr_debug("intel_pstate: controlling: cpu %d\n", cpunum);
@@ -1174,7 +1174,7 @@ static void intel_pstate_stop_cpu(struct
pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
- cpufreq_set_update_util_data(cpu_num, NULL);
+ cpufreq_remove_update_util_hook(cpu_num);
synchronize_sched();
if (hwp_active)
@@ -1442,7 +1442,7 @@ out:
get_online_cpus();
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu]) {
- cpufreq_set_update_util_data(cpu, NULL);
+ cpufreq_remove_update_util_hook(cpu);
synchronize_sched();
kfree(all_cpu_data[cpu]);
}
From: Rafael J. Wysocki <[email protected]>
In addition to fields representing governor tunables, struct dbs_data
contains some fields needed for the management of objects of that
type. As it turns out, that part of struct dbs_data may be shared
with (future) governors that won't use the common code used by
"ondemand" and "conservative", so move it to a separate struct type
and modify the code using struct dbs_data to follow.
Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
No changes from the previous version.
---
drivers/cpufreq/cpufreq_conservative.c | 25 +++++----
drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++-------------
drivers/cpufreq/cpufreq_governor.h | 35 +++++++-----
drivers/cpufreq/cpufreq_ondemand.c | 29 ++++++----
4 files changed, 107 insertions(+), 72 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -41,6 +41,13 @@
/* Ondemand Sampling types */
enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
+struct gov_attr_set {
+ struct kobject kobj;
+ struct list_head policy_list;
+ struct mutex update_lock;
+ int usage_count;
+};
+
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
@@ -52,7 +59,7 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
/* Governor demand based switching data (per-policy or global). */
struct dbs_data {
- int usage_count;
+ struct gov_attr_set attr_set;
void *tuners;
unsigned int min_sampling_rate;
unsigned int ignore_nice_load;
@@ -60,37 +67,35 @@ struct dbs_data {
unsigned int sampling_down_factor;
unsigned int up_threshold;
unsigned int io_is_busy;
-
- struct kobject kobj;
- struct list_head policy_dbs_list;
- /*
- * Protect concurrent updates to governor tunables from sysfs,
- * policy_dbs_list and usage_count.
- */
- struct mutex mutex;
};
+static inline struct dbs_data *to_dbs_data(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct dbs_data, attr_set);
+}
+
/* Governor's specific attributes */
-struct dbs_data;
struct governor_attr {
struct attribute attr;
- ssize_t (*show)(struct dbs_data *dbs_data, char *buf);
- ssize_t (*store)(struct dbs_data *dbs_data, const char *buf,
+ ssize_t (*show)(struct gov_attr_set *attr_set, char *buf);
+ ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf,
size_t count);
};
#define gov_show_one(_gov, file_name) \
static ssize_t show_##file_name \
-(struct dbs_data *dbs_data, char *buf) \
+(struct gov_attr_set *attr_set, char *buf) \
{ \
+ struct dbs_data *dbs_data = to_dbs_data(attr_set); \
struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \
return sprintf(buf, "%u\n", tuners->file_name); \
}
#define gov_show_one_common(file_name) \
static ssize_t show_##file_name \
-(struct dbs_data *dbs_data, char *buf) \
+(struct gov_attr_set *attr_set, char *buf) \
{ \
+ struct dbs_data *dbs_data = to_dbs_data(attr_set); \
return sprintf(buf, "%u\n", dbs_data->file_name); \
}
@@ -184,7 +189,7 @@ void od_register_powersave_bias_handler(
(struct cpufreq_policy *, unsigned int, unsigned int),
unsigned int powersave_bias);
void od_unregister_powersave_bias_handler(void);
-ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
+ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf,
size_t count);
void gov_update_cpu_data(struct dbs_data *dbs_data);
#endif /* _CPUFREQ_GOVERNOR_H */
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -43,9 +43,10 @@ static DEFINE_MUTEX(gov_dbs_data_mutex);
* This must be called with dbs_data->mutex held, otherwise traversing
* policy_dbs_list isn't safe.
*/
-ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
+ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf,
size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct policy_dbs_info *policy_dbs;
unsigned int rate;
int ret;
@@ -59,7 +60,7 @@ ssize_t store_sampling_rate(struct dbs_d
* We are operating under dbs_data->mutex and so the list and its
* entries can't be freed concurrently.
*/
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list) {
mutex_lock(&policy_dbs->timer_mutex);
/*
* On 32-bit architectures this may race with the
@@ -96,7 +97,7 @@ void gov_update_cpu_data(struct dbs_data
{
struct policy_dbs_info *policy_dbs;
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &dbs_data->attr_set.policy_list, list) {
unsigned int j;
for_each_cpu(j, policy_dbs->policy->cpus) {
@@ -111,9 +112,9 @@ void gov_update_cpu_data(struct dbs_data
}
EXPORT_SYMBOL_GPL(gov_update_cpu_data);
-static inline struct dbs_data *to_dbs_data(struct kobject *kobj)
+static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
{
- return container_of(kobj, struct dbs_data, kobj);
+ return container_of(kobj, struct gov_attr_set, kobj);
}
static inline struct governor_attr *to_gov_attr(struct attribute *attr)
@@ -124,25 +125,24 @@ static inline struct governor_attr *to_g
static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
char *buf)
{
- struct dbs_data *dbs_data = to_dbs_data(kobj);
struct governor_attr *gattr = to_gov_attr(attr);
- return gattr->show(dbs_data, buf);
+ return gattr->show(to_gov_attr_set(kobj), buf);
}
static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
const char *buf, size_t count)
{
- struct dbs_data *dbs_data = to_dbs_data(kobj);
+ struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
struct governor_attr *gattr = to_gov_attr(attr);
int ret = -EBUSY;
- mutex_lock(&dbs_data->mutex);
+ mutex_lock(&attr_set->update_lock);
- if (dbs_data->usage_count)
- ret = gattr->store(dbs_data, buf, count);
+ if (attr_set->usage_count)
+ ret = gattr->store(attr_set, buf, count);
- mutex_unlock(&dbs_data->mutex);
+ mutex_unlock(&attr_set->update_lock);
return ret;
}
@@ -425,6 +425,41 @@ static void free_policy_dbs_info(struct
gov->free(policy_dbs);
}
+static void gov_attr_set_init(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ INIT_LIST_HEAD(&attr_set->policy_list);
+ mutex_init(&attr_set->update_lock);
+ attr_set->usage_count = 1;
+ list_add(list_node, &attr_set->policy_list);
+}
+
+static void gov_attr_set_get(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ mutex_lock(&attr_set->update_lock);
+ attr_set->usage_count++;
+ list_add(list_node, &attr_set->policy_list);
+ mutex_unlock(&attr_set->update_lock);
+}
+
+static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set,
+ struct list_head *list_node)
+{
+ unsigned int count;
+
+ mutex_lock(&attr_set->update_lock);
+ list_del(list_node);
+ count = --attr_set->usage_count;
+ mutex_unlock(&attr_set->update_lock);
+ if (count)
+ return count;
+
+ kobject_put(&attr_set->kobj);
+ mutex_destroy(&attr_set->update_lock);
+ return 0;
+}
+
static int cpufreq_governor_init(struct cpufreq_policy *policy)
{
struct dbs_governor *gov = dbs_governor_of(policy);
@@ -453,10 +488,7 @@ static int cpufreq_governor_init(struct
policy_dbs->dbs_data = dbs_data;
policy->governor_data = policy_dbs;
- mutex_lock(&dbs_data->mutex);
- dbs_data->usage_count++;
- list_add(&policy_dbs->list, &dbs_data->policy_dbs_list);
- mutex_unlock(&dbs_data->mutex);
+ gov_attr_set_get(&dbs_data->attr_set, &policy_dbs->list);
goto out;
}
@@ -466,8 +498,7 @@ static int cpufreq_governor_init(struct
goto free_policy_dbs_info;
}
- INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
- mutex_init(&dbs_data->mutex);
+ gov_attr_set_init(&dbs_data->attr_set, &policy_dbs->list);
ret = gov->init(dbs_data, !policy->governor->initialized);
if (ret)
@@ -487,14 +518,11 @@ static int cpufreq_governor_init(struct
if (!have_governor_per_policy())
gov->gdbs_data = dbs_data;
- policy->governor_data = policy_dbs;
-
policy_dbs->dbs_data = dbs_data;
- dbs_data->usage_count = 1;
- list_add(&policy_dbs->list, &dbs_data->policy_dbs_list);
+ policy->governor_data = policy_dbs;
gov->kobj_type.sysfs_ops = &governor_sysfs_ops;
- ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type,
+ ret = kobject_init_and_add(&dbs_data->attr_set.kobj, &gov->kobj_type,
get_governor_parent_kobj(policy),
"%s", gov->gov.name);
if (!ret)
@@ -523,29 +551,21 @@ static int cpufreq_governor_exit(struct
struct dbs_governor *gov = dbs_governor_of(policy);
struct policy_dbs_info *policy_dbs = policy->governor_data;
struct dbs_data *dbs_data = policy_dbs->dbs_data;
- int count;
+ unsigned int count;
/* Protect gov->gdbs_data against concurrent updates. */
mutex_lock(&gov_dbs_data_mutex);
- mutex_lock(&dbs_data->mutex);
- list_del(&policy_dbs->list);
- count = --dbs_data->usage_count;
- mutex_unlock(&dbs_data->mutex);
+ count = gov_attr_set_put(&dbs_data->attr_set, &policy_dbs->list);
- if (!count) {
- kobject_put(&dbs_data->kobj);
-
- policy->governor_data = NULL;
+ policy->governor_data = NULL;
+ if (!count) {
if (!have_governor_per_policy())
gov->gdbs_data = NULL;
gov->exit(dbs_data, policy->governor->initialized == 1);
- mutex_destroy(&dbs_data->mutex);
kfree(dbs_data);
- } else {
- policy->governor_data = NULL;
}
free_policy_dbs_info(policy_dbs, gov);
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -207,9 +207,10 @@ static unsigned int od_dbs_timer(struct
/************************** sysfs interface ************************/
static struct dbs_governor od_dbs_gov;
-static ssize_t store_io_is_busy(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_io_is_busy(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
@@ -224,9 +225,10 @@ static ssize_t store_io_is_busy(struct d
return count;
}
-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_up_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -240,9 +242,10 @@ static ssize_t store_up_threshold(struct
return count;
}
-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct policy_dbs_info *policy_dbs;
unsigned int input;
int ret;
@@ -254,7 +257,7 @@ static ssize_t store_sampling_down_facto
dbs_data->sampling_down_factor = input;
/* Reset down sampling multiplier in case it was active */
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) {
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list) {
/*
* Doing this without locking might lead to using different
* rate_mult values in od_update() and od_dbs_timer().
@@ -267,9 +270,10 @@ static ssize_t store_sampling_down_facto
return count;
}
-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
@@ -291,9 +295,10 @@ static ssize_t store_ignore_nice_load(st
return count;
}
-static ssize_t store_powersave_bias(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_powersave_bias(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct od_dbs_tuners *od_tuners = dbs_data->tuners;
struct policy_dbs_info *policy_dbs;
unsigned int input;
@@ -308,7 +313,7 @@ static ssize_t store_powersave_bias(stru
od_tuners->powersave_bias = input;
- list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list)
+ list_for_each_entry(policy_dbs, &attr_set->policy_list, list)
ondemand_powersave_bias_init(policy_dbs->policy);
return count;
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -129,9 +129,10 @@ static struct notifier_block cs_cpufreq_
/************************** sysfs interface ************************/
static struct dbs_governor cs_dbs_gov;
-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -143,9 +144,10 @@ static ssize_t store_sampling_down_facto
return count;
}
-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_up_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
@@ -158,9 +160,10 @@ static ssize_t store_up_threshold(struct
return count;
}
-static ssize_t store_down_threshold(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_down_threshold(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
@@ -175,9 +178,10 @@ static ssize_t store_down_threshold(stru
return count;
}
-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
- const char *buf, size_t count)
+static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
unsigned int input;
int ret;
@@ -199,9 +203,10 @@ static ssize_t store_ignore_nice_load(st
return count;
}
-static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf,
- size_t count)
+static ssize_t store_freq_step(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
{
+ struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
From: Rafael J. Wysocki <[email protected]>
Move definitions and function headers related to struct gov_attr_set
to include/linux/cpufreq.h so they can be used by (future) goverernors
located outside of drivers/cpufreq/.
No functional changes.
Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
This one was present in v2, no changes since then.
---
drivers/cpufreq/cpufreq_governor.h | 21 ---------------------
include/linux/cpufreq.h | 23 +++++++++++++++++++++++
2 files changed, 23 insertions(+), 21 deletions(-)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -41,19 +41,6 @@
/* Ondemand Sampling types */
enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
-struct gov_attr_set {
- struct kobject kobj;
- struct list_head policy_list;
- struct mutex update_lock;
- int usage_count;
-};
-
-extern const struct sysfs_ops governor_sysfs_ops;
-
-void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node);
-void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node);
-unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node);
-
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
@@ -80,14 +67,6 @@ static inline struct dbs_data *to_dbs_da
return container_of(attr_set, struct dbs_data, attr_set);
}
-/* Governor's specific attributes */
-struct governor_attr {
- struct attribute attr;
- ssize_t (*show)(struct gov_attr_set *attr_set, char *buf);
- ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf,
- size_t count);
-};
-
#define gov_show_one(_gov, file_name) \
static ssize_t show_##file_name \
(struct gov_attr_set *attr_set, char *buf) \
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -462,6 +462,29 @@ void cpufreq_unregister_governor(struct
struct cpufreq_governor *cpufreq_default_governor(void);
struct cpufreq_governor *cpufreq_fallback_governor(void);
+/* Governor attribute set */
+struct gov_attr_set {
+ struct kobject kobj;
+ struct list_head policy_list;
+ struct mutex update_lock;
+ int usage_count;
+};
+
+/* sysfs ops for cpufreq governors */
+extern const struct sysfs_ops governor_sysfs_ops;
+
+void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node);
+void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node);
+unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node);
+
+/* Governor sysfs attribute */
+struct governor_attr {
+ struct attribute attr;
+ ssize_t (*show)(struct gov_attr_set *attr_set, char *buf);
+ ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf,
+ size_t count);
+};
+
/*********************************************************************
* FREQUENCY TABLE HELPERS *
*********************************************************************/
From: Rafael J. Wysocki <[email protected]>
Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.
Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.
The new governor is relatively simple.
The frequency selection formula used by it depends on whether or not
the utilization is frequency-invariant. In the frequency-invariant
case the new CPU frequency is given by
next_freq = 1.25 * max_freq * util / max
where util and max are the last two arguments of cpufreq_update_util().
In turn, if util is not frequency-invariant, the maximum frequency in
the above formula is replaced with the current frequency of the CPU:
next_freq = 1.25 * curr_freq * util / max
The coefficient 1.25 corresponds to the frequency tipping point at
(util / max) = 0.8.
All of the computations are carried out in the utilization update
handlers provided by the new governor. One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).
The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers. If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).
Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes. That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.
The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
The reason I decided to use policy->cur as the current frequency
representation is that it happens to hold the value in question in
both the "fast switch" and the "work item" cases quite naturally.
The (rather theoretical) concern about it is that policy->cur
may be updated by the core asynchronously if it thinks that it
got out of sync with the "real" setting (as reported by the
driver's ->get routine). I don't think it will turn out to be
a real problem in practice, though.
Changes from v3:
- The "next frequency" formula based on
http://marc.info/?l=linux-acpi&m=145756618321500&w=4 and
http://marc.info/?l=linux-kernel&m=145760739700716&w=4
- The governor goes into kernel/sched/ (again).
Changes from v2:
- The governor goes into drivers/cpufreq/.
- The "next frequency" formula has an additional 1.1 factor to allow
more util/max values to map onto the top-most frequency in case the
distance between that and the previous one is unproportionally small.
- sugov_update_commit() traces CPU frequency even if the new one is
the same as the previous one (otherwise, if the system is 100% loaded
for long enough, powertop starts to report that all CPUs are 100% idle).
---
drivers/cpufreq/Kconfig | 26 +
kernel/sched/Makefile | 1
kernel/sched/cpufreq_schedutil.c | 531 +++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 8
4 files changed, 566 insertions(+)
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+ bool "schedutil"
+ select CPU_FREQ_GOV_SCHEDUTIL
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the 'schedutil' CPUFreq governor by default. If unsure,
+ have a look at the help section of that governor. The fallback
+ governor will be 'performance'.
+
endchoice
config CPU_FREQ_GOV_PERFORMANCE
@@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE
If in doubt, say N.
+config CPU_FREQ_GOV_SCHEDUTIL
+ tristate "'schedutil' cpufreq policy governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_ATTR_SET
+ select IRQ_WORK
+ help
+ The frequency selection formula used by this governor is analogous
+ to the one used by 'ondemand', but instead of computing CPU load
+ as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+ utilization data provided by the scheduler as input.
+
+ To compile this driver as a module, choose M here: the
+ module will be called cpufreq_schedutil.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"
config CPUFREQ_DT
Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -0,0 +1,531 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <trace/events/power.h>
+
+#include "sched.h"
+
+struct sugov_tunables {
+ struct gov_attr_set attr_set;
+ unsigned int rate_limit_us;
+};
+
+struct sugov_policy {
+ struct cpufreq_policy *policy;
+
+ struct sugov_tunables *tunables;
+ struct list_head tunables_hook;
+
+ raw_spinlock_t update_lock; /* For shared policies */
+ u64 last_freq_update_time;
+ s64 freq_update_delay_ns;
+ unsigned int next_freq;
+
+ /* The next fields are only needed if fast switch cannot be used. */
+ struct irq_work irq_work;
+ struct work_struct work;
+ struct mutex work_lock;
+ bool work_in_progress;
+
+ bool need_freq_update;
+};
+
+struct sugov_cpu {
+ struct update_util_data update_util;
+ struct sugov_policy *sg_policy;
+
+ /* The fields below are only needed when sharing a policy. */
+ unsigned long util;
+ unsigned long max;
+ u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
+ u64 delta_ns;
+
+ if (sg_policy->work_in_progress)
+ return false;
+
+ if (unlikely(sg_policy->need_freq_update)) {
+ sg_policy->need_freq_update = false;
+ return true;
+ }
+
+ delta_ns = time - sg_policy->last_freq_update_time;
+ return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
+}
+
+static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+ unsigned int next_freq)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+
+ if (next_freq > policy->max)
+ next_freq = policy->max;
+ else if (next_freq < policy->min)
+ next_freq = policy->min;
+
+ sg_policy->last_freq_update_time = time;
+ if (sg_policy->next_freq == next_freq) {
+ if (policy->fast_switch_enabled)
+ trace_cpu_frequency(policy->cur, smp_processor_id());
+
+ return;
+ }
+
+ sg_policy->next_freq = next_freq;
+ if (policy->fast_switch_enabled) {
+ unsigned int freq;
+
+ freq = cpufreq_driver_fast_switch(policy, next_freq);
+ if (freq == CPUFREQ_ENTRY_INVALID)
+ return;
+
+ policy->cur = freq;
+ trace_cpu_frequency(freq, smp_processor_id());
+ } else {
+ sg_policy->work_in_progress = true;
+ irq_work_queue(&sg_policy->irq_work);
+ }
+}
+
+/**
+ * get_next_freq - Compute a new frequency for a given cpufreq policy.
+ * @policy: cpufreq policy object to compute the new frequency for.
+ * @util: Current CPU utilization.
+ * @max: CPU capacity.
+ *
+ * If the utilization is frequency-invariant, choose the new frequency to be
+ * proportional to it, that is
+ *
+ * next_freq = C * max_freq * util / max
+ *
+ * Otherwise, approximate the would-be frequency-invariant utilization by
+ * util_raw * (curr_freq / max_freq) which leads to
+ *
+ * next_freq = C * curr_freq * util_raw / max
+ *
+ * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
+ */
+static unsigned int get_next_freq(struct cpufreq_policy *policy,
+ unsigned long util, unsigned long max)
+{
+ unsigned int freq = arch_scale_freq_invariant() ?
+ policy->cpuinfo.max_freq : policy->cur;
+
+ return (freq + (freq >> 2)) * util / max;
+}
+
+static void sugov_update_single(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int next_f;
+
+ if (!sugov_should_update_freq(sg_policy, time))
+ return;
+
+ next_f = util <= max ?
+ get_next_freq(policy, util, max) : policy->cpuinfo.max_freq;
+ sugov_update_commit(sg_policy, time, next_f);
+}
+
+static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
+ unsigned long util, unsigned long max)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int max_f = policy->cpuinfo.max_freq;
+ u64 last_freq_update_time = sg_policy->last_freq_update_time;
+ unsigned int j;
+
+ if (util > max)
+ return max_f;
+
+ for_each_cpu(j, policy->cpus) {
+ struct sugov_cpu *j_sg_cpu;
+ unsigned long j_util, j_max;
+ u64 delta_ns;
+
+ if (j == smp_processor_id())
+ continue;
+
+ j_sg_cpu = &per_cpu(sugov_cpu, j);
+ /*
+ * If the CPU utilization was last updated before the previous
+ * frequency update and the time elapsed between the last update
+ * of the CPU utilization and the last frequency update is long
+ * enough, don't take the CPU into account as it probably is
+ * idle now.
+ */
+ delta_ns = last_freq_update_time - j_sg_cpu->last_update;
+ if ((s64)delta_ns > NSEC_PER_SEC / HZ)
+ continue;
+
+ j_util = j_sg_cpu->util;
+ j_max = j_sg_cpu->max;
+ if (j_util > j_max)
+ return max_f;
+
+ if (j_util * max > j_max * util) {
+ util = j_util;
+ max = j_max;
+ }
+ }
+
+ return get_next_freq(policy, util, max);
+}
+
+static void sugov_update_shared(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int next_f;
+
+ raw_spin_lock(&sg_policy->update_lock);
+
+ sg_cpu->util = util;
+ sg_cpu->max = max;
+ sg_cpu->last_update = time;
+
+ if (sugov_should_update_freq(sg_policy, time)) {
+ next_f = sugov_next_freq_shared(sg_policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+ }
+
+ raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_work(struct work_struct *work)
+{
+ struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+
+ mutex_lock(&sg_policy->work_lock);
+ __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+ CPUFREQ_RELATION_L);
+ mutex_unlock(&sg_policy->work_lock);
+
+ sg_policy->work_in_progress = false;
+}
+
+static void sugov_irq_work(struct irq_work *irq_work)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
+ schedule_work(&sg_policy->work);
+}
+
+/************************** sysfs interface ************************/
+
+static struct sugov_tunables *global_tunables;
+static DEFINE_MUTEX(global_tunables_lock);
+
+static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct sugov_tunables, attr_set);
+}
+
+static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+ return sprintf(buf, "%u\n", tunables->rate_limit_us);
+}
+
+static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+ struct sugov_policy *sg_policy;
+ unsigned int rate_limit_us;
+ int ret;
+
+ ret = sscanf(buf, "%u", &rate_limit_us);
+ if (ret != 1)
+ return -EINVAL;
+
+ tunables->rate_limit_us = rate_limit_us;
+
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+ sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+
+ return count;
+}
+
+static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+
+static struct attribute *sugov_attributes[] = {
+ &rate_limit_us.attr,
+ NULL
+};
+
+static struct kobj_type sugov_tunables_ktype = {
+ .default_attrs = sugov_attributes,
+ .sysfs_ops = &governor_sysfs_ops,
+};
+
+/********************** cpufreq governor interface *********************/
+
+static struct cpufreq_governor schedutil_gov;
+
+static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+ if (!sg_policy)
+ return NULL;
+
+ sg_policy->policy = policy;
+ init_irq_work(&sg_policy->irq_work, sugov_irq_work);
+ INIT_WORK(&sg_policy->work, sugov_work);
+ mutex_init(&sg_policy->work_lock);
+ raw_spin_lock_init(&sg_policy->update_lock);
+ return sg_policy;
+}
+
+static void sugov_policy_free(struct sugov_policy *sg_policy)
+{
+ mutex_destroy(&sg_policy->work_lock);
+ kfree(sg_policy);
+}
+
+static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
+{
+ struct sugov_tunables *tunables;
+
+ tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
+ if (tunables)
+ gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
+
+ return tunables;
+}
+
+static void sugov_tunables_free(struct sugov_tunables *tunables)
+{
+ if (!have_governor_per_policy())
+ global_tunables = NULL;
+
+ kfree(tunables);
+}
+
+static int sugov_init(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+ struct sugov_tunables *tunables;
+ unsigned int lat;
+ int ret = 0;
+
+ /* State should be equivalent to EXIT */
+ if (policy->governor_data)
+ return -EBUSY;
+
+ sg_policy = sugov_policy_alloc(policy);
+ if (!sg_policy)
+ return -ENOMEM;
+
+ mutex_lock(&global_tunables_lock);
+
+ if (global_tunables) {
+ if (WARN_ON(have_governor_per_policy())) {
+ ret = -EINVAL;
+ goto free_sg_policy;
+ }
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = global_tunables;
+
+ gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
+ goto out;
+ }
+
+ tunables = sugov_tunables_alloc(sg_policy);
+ if (!tunables) {
+ ret = -ENOMEM;
+ goto free_sg_policy;
+ }
+
+ tunables->rate_limit_us = LATENCY_MULTIPLIER;
+ lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
+ if (lat)
+ tunables->rate_limit_us *= lat;
+
+ if (!have_governor_per_policy())
+ global_tunables = tunables;
+
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = tunables;
+
+ ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
+ get_governor_parent_kobj(policy), "%s",
+ schedutil_gov.name);
+ if (!ret)
+ goto out;
+
+ /* Failure, so roll back. */
+ policy->governor_data = NULL;
+ sugov_tunables_free(tunables);
+
+ free_sg_policy:
+ pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
+ sugov_policy_free(sg_policy);
+
+ out:
+ mutex_unlock(&global_tunables_lock);
+ return ret;
+}
+
+static int sugov_exit(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ struct sugov_tunables *tunables = sg_policy->tunables;
+ unsigned int count;
+
+ mutex_lock(&global_tunables_lock);
+
+ count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
+ policy->governor_data = NULL;
+ if (!count)
+ sugov_tunables_free(tunables);
+
+ mutex_unlock(&global_tunables_lock);
+
+ sugov_policy_free(sg_policy);
+ return 0;
+}
+
+static int sugov_start(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ cpufreq_enable_fast_switch(policy);
+
+ sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->last_freq_update_time = 0;
+ sg_policy->next_freq = UINT_MAX;
+ sg_policy->work_in_progress = false;
+ sg_policy->need_freq_update = false;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+ sg_cpu->sg_policy = sg_policy;
+ if (policy_is_shared(policy)) {
+ sg_cpu->util = ULONG_MAX;
+ sg_cpu->max = 0;
+ sg_cpu->last_update = 0;
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_shared);
+ } else {
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_single);
+ }
+ }
+ return 0;
+}
+
+static int sugov_stop(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ cpufreq_remove_update_util_hook(cpu);
+
+ synchronize_sched();
+
+ irq_work_sync(&sg_policy->irq_work);
+ cancel_work_sync(&sg_policy->work);
+ return 0;
+}
+
+static int sugov_limits(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+
+ if (!policy->fast_switch_enabled) {
+ mutex_lock(&sg_policy->work_lock);
+
+ if (policy->max < policy->cur)
+ __cpufreq_driver_target(policy, policy->max,
+ CPUFREQ_RELATION_H);
+ else if (policy->min > policy->cur)
+ __cpufreq_driver_target(policy, policy->min,
+ CPUFREQ_RELATION_L);
+
+ mutex_unlock(&sg_policy->work_lock);
+ }
+
+ sg_policy->need_freq_update = true;
+ return 0;
+}
+
+int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
+{
+ if (event == CPUFREQ_GOV_POLICY_INIT) {
+ return sugov_init(policy);
+ } else if (policy->governor_data) {
+ switch (event) {
+ case CPUFREQ_GOV_POLICY_EXIT:
+ return sugov_exit(policy);
+ case CPUFREQ_GOV_START:
+ return sugov_start(policy);
+ case CPUFREQ_GOV_STOP:
+ return sugov_stop(policy);
+ case CPUFREQ_GOV_LIMITS:
+ return sugov_limits(policy);
+ }
+ }
+ return -EINVAL;
+}
+
+static struct cpufreq_governor schedutil_gov = {
+ .name = "schedutil",
+ .governor = sugov_governor,
+ .owner = THIS_MODULE,
+};
+
+static int __init sugov_module_init(void)
+{
+ return cpufreq_register_governor(&schedutil_gov);
+}
+
+static void __exit sugov_module_exit(void)
+{
+ cpufreq_unregister_governor(&schedutil_gov);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+struct cpufreq_governor *cpufreq_default_governor(void)
+{
+ return &schedutil_gov;
+}
+
+fs_initcall(sugov_module_init);
+#else
+module_init(sugov_module_init);
+#endif
+module_exit(sugov_module_exit);
Index: linux-pm/kernel/sched/Makefile
===================================================================
--- linux-pm.orig/kernel/sched/Makefile
+++ linux-pm/kernel/sched/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -1786,3 +1786,11 @@ static inline void cpufreq_trigger_updat
static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {}
static inline void cpufreq_trigger_update(u64 time) {}
#endif /* CONFIG_CPU_FREQ */
+
+#ifdef arch_scale_freq_capacity
+#ifndef arch_scale_freq_invariant
+#define arch_scale_freq_invariant() (true)
+#endif
+#else /* arch_scale_freq_capacity */
+#define arch_scale_freq_invariant() (false)
+#endif
From: Rafael J. Wysocki <[email protected]>
Move abstract code related to struct gov_attr_set to a separate (new)
file so it can be shared with (future) goverernors that won't share
more code with "ondemand" and "conservative".
No intentional functional changes.
Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
No changes from the previous version.
---
drivers/cpufreq/Kconfig | 4 +
drivers/cpufreq/Makefile | 1
drivers/cpufreq/cpufreq_governor.c | 82 ---------------------------
drivers/cpufreq/cpufreq_governor.h | 6 ++
drivers/cpufreq/cpufreq_governor_attr_set.c | 84 ++++++++++++++++++++++++++++
5 files changed, 95 insertions(+), 82 deletions(-)
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -18,7 +18,11 @@ config CPU_FREQ
if CPU_FREQ
+config CPU_FREQ_GOV_ATTR_SET
+ bool
+
config CPU_FREQ_GOV_COMMON
+ select CPU_FREQ_GOV_ATTR_SET
select IRQ_WORK
bool
Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) +=
obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o
obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o
obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
+obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o
obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -112,53 +112,6 @@ void gov_update_cpu_data(struct dbs_data
}
EXPORT_SYMBOL_GPL(gov_update_cpu_data);
-static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
-{
- return container_of(kobj, struct gov_attr_set, kobj);
-}
-
-static inline struct governor_attr *to_gov_attr(struct attribute *attr)
-{
- return container_of(attr, struct governor_attr, attr);
-}
-
-static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
- char *buf)
-{
- struct governor_attr *gattr = to_gov_attr(attr);
-
- return gattr->show(to_gov_attr_set(kobj), buf);
-}
-
-static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
- const char *buf, size_t count)
-{
- struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
- struct governor_attr *gattr = to_gov_attr(attr);
- int ret = -EBUSY;
-
- mutex_lock(&attr_set->update_lock);
-
- if (attr_set->usage_count)
- ret = gattr->store(attr_set, buf, count);
-
- mutex_unlock(&attr_set->update_lock);
-
- return ret;
-}
-
-/*
- * Sysfs Ops for accessing governor attributes.
- *
- * All show/store invocations for governor specific sysfs attributes, will first
- * call the below show/store callbacks and the attribute specific callback will
- * be called from within it.
- */
-static const struct sysfs_ops governor_sysfs_ops = {
- .show = governor_show,
- .store = governor_store,
-};
-
unsigned int dbs_update(struct cpufreq_policy *policy)
{
struct policy_dbs_info *policy_dbs = policy->governor_data;
@@ -425,41 +378,6 @@ static void free_policy_dbs_info(struct
gov->free(policy_dbs);
}
-static void gov_attr_set_init(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- INIT_LIST_HEAD(&attr_set->policy_list);
- mutex_init(&attr_set->update_lock);
- attr_set->usage_count = 1;
- list_add(list_node, &attr_set->policy_list);
-}
-
-static void gov_attr_set_get(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- mutex_lock(&attr_set->update_lock);
- attr_set->usage_count++;
- list_add(list_node, &attr_set->policy_list);
- mutex_unlock(&attr_set->update_lock);
-}
-
-static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set,
- struct list_head *list_node)
-{
- unsigned int count;
-
- mutex_lock(&attr_set->update_lock);
- list_del(list_node);
- count = --attr_set->usage_count;
- mutex_unlock(&attr_set->update_lock);
- if (count)
- return count;
-
- kobject_put(&attr_set->kobj);
- mutex_destroy(&attr_set->update_lock);
- return 0;
-}
-
static int cpufreq_governor_init(struct cpufreq_policy *policy)
{
struct dbs_governor *gov = dbs_governor_of(policy);
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -48,6 +48,12 @@ struct gov_attr_set {
int usage_count;
};
+extern const struct sysfs_ops governor_sysfs_ops;
+
+void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node);
+void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node);
+unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node);
+
/*
* Abbreviations:
* dbs: used as a shortform for demand based switching It helps to keep variable
Index: linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c
@@ -0,0 +1,84 @@
+/*
+ * Abstract code for CPUFreq governor tunable sysfs attributes.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "cpufreq_governor.h"
+
+static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj)
+{
+ return container_of(kobj, struct gov_attr_set, kobj);
+}
+
+static inline struct governor_attr *to_gov_attr(struct attribute *attr)
+{
+ return container_of(attr, struct governor_attr, attr);
+}
+
+static ssize_t governor_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+{
+ struct governor_attr *gattr = to_gov_attr(attr);
+
+ return gattr->show(to_gov_attr_set(kobj), buf);
+}
+
+static ssize_t governor_store(struct kobject *kobj, struct attribute *attr,
+ const char *buf, size_t count)
+{
+ struct gov_attr_set *attr_set = to_gov_attr_set(kobj);
+ struct governor_attr *gattr = to_gov_attr(attr);
+ int ret;
+
+ mutex_lock(&attr_set->update_lock);
+ ret = attr_set->usage_count ? gattr->store(attr_set, buf, count) : -EBUSY;
+ mutex_unlock(&attr_set->update_lock);
+ return ret;
+}
+
+const struct sysfs_ops governor_sysfs_ops = {
+ .show = governor_show,
+ .store = governor_store,
+};
+EXPORT_SYMBOL_GPL(governor_sysfs_ops);
+
+void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ INIT_LIST_HEAD(&attr_set->policy_list);
+ mutex_init(&attr_set->update_lock);
+ attr_set->usage_count = 1;
+ list_add(list_node, &attr_set->policy_list);
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_init);
+
+void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ mutex_lock(&attr_set->update_lock);
+ attr_set->usage_count++;
+ list_add(list_node, &attr_set->policy_list);
+ mutex_unlock(&attr_set->update_lock);
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_get);
+
+unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node)
+{
+ unsigned int count;
+
+ mutex_lock(&attr_set->update_lock);
+ list_del(list_node);
+ count = --attr_set->usage_count;
+ mutex_unlock(&attr_set->update_lock);
+ if (count)
+ return count;
+
+ kobject_put(&attr_set->kobj);
+ mutex_destroy(&attr_set->update_lock);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(gov_attr_set_put);
Could you please start a new thread for each posting? I only
accidentally saw this.
On Wed, Mar 16, 2016 at 03:52:28PM +0100, Rafael J. Wysocki wrote:
> +/**
> + * cpufreq_enable_fast_switch - Enable fast frequency switching for policy.
> + * @policy: cpufreq policy to enable fast frequency switching for.
> + *
> + * Try to enable fast frequency switching for @policy.
> + *
> + * The attempt will fail if there is at least one transition notifier registered
> + * at this point, as fast frequency switching is quite fundamentally at odds
> + * with transition notifiers. Thus if successful, it will make registration of
> + * transition notifiers fail going forward.
> + *
> + * Call under policy->rwsem.
Nobody reads a comment..
> + */
> +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
> +{
lockdep_assert_held(&policy->rwsem);
While everybody complains when there's a big nasty splat in their dmesg
;-)
> + mutex_lock(&cpufreq_fast_switch_lock);
> + if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) {
> + cpufreq_fast_switch_count++;
> + policy->fast_switch_enabled = true;
> + } else {
> + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n",
> + policy->cpu);
> + }
> + mutex_unlock(&cpufreq_fast_switch_lock);
> +}
On Wed, Mar 16, 2016 at 03:52:28PM +0100, Rafael J. Wysocki wrote:
> +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
> +{
> + mutex_lock(&cpufreq_fast_switch_lock);
> + if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) {
> + cpufreq_fast_switch_count++;
> + policy->fast_switch_enabled = true;
> + } else {
> + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n",
> + policy->cpu);
This happens because there's transition notifiers, right? Would it make
sense to iterate the notifier here and print the notifier function
symbol for each? That way we've got a clue as to where to start looking
when this happens.
> + }
> + mutex_unlock(&cpufreq_fast_switch_lock);
> +}
> @@ -1653,8 +1703,18 @@ int cpufreq_register_notifier(struct not
>
> switch (list) {
> case CPUFREQ_TRANSITION_NOTIFIER:
> + mutex_lock(&cpufreq_fast_switch_lock);
> +
> + if (cpufreq_fast_switch_count > 0) {
> + mutex_unlock(&cpufreq_fast_switch_lock);
So while theoretically (it has a return code)
cpufreq_register_notifier() could fail, it never actually did. Now we
do. Do we want to add a WARN here?
> + return -EPERM;
> + }
> ret = srcu_notifier_chain_register(
> &cpufreq_transition_notifier_list, nb);
> + if (!ret)
> + cpufreq_fast_switch_count--;
> +
> + mutex_unlock(&cpufreq_fast_switch_lock);
> break;
> case CPUFREQ_POLICY_NOTIFIER:
> ret = blocking_notifier_chain_register(
On Wed, Mar 16, 2016 at 4:27 PM, Peter Zijlstra <[email protected]> wrote:
>
>
> Could you please start a new thread for each posting? I only
> accidentally saw this.
I will in the future.
On Wed, Mar 16, 2016 at 4:43 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Mar 16, 2016 at 03:52:28PM +0100, Rafael J. Wysocki wrote:
>> +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
>> +{
>> + mutex_lock(&cpufreq_fast_switch_lock);
>> + if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) {
>> + cpufreq_fast_switch_count++;
>> + policy->fast_switch_enabled = true;
>> + } else {
>> + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n",
>> + policy->cpu);
>
> This happens because there's transition notifiers, right? Would it make
> sense to iterate the notifier here and print the notifier function
> symbol for each? That way we've got a clue as to where to start looking
> when this happens.
OK
>> + }
>> + mutex_unlock(&cpufreq_fast_switch_lock);
>> +}
>
>> @@ -1653,8 +1703,18 @@ int cpufreq_register_notifier(struct not
>>
>> switch (list) {
>> case CPUFREQ_TRANSITION_NOTIFIER:
>> + mutex_lock(&cpufreq_fast_switch_lock);
>> +
>> + if (cpufreq_fast_switch_count > 0) {
>> + mutex_unlock(&cpufreq_fast_switch_lock);
>
> So while theoretically (it has a return code)
> cpufreq_register_notifier() could fail, it never actually did. Now we
> do. Do we want to add a WARN here?
Like if (WARN_ON(cpufreq_fast_switch_count > 0)) {
That can be done. :-)
On Wed, Mar 16, 2016 at 4:35 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Mar 16, 2016 at 03:52:28PM +0100, Rafael J. Wysocki wrote:
>> +/**
>> + * cpufreq_enable_fast_switch - Enable fast frequency switching for policy.
>> + * @policy: cpufreq policy to enable fast frequency switching for.
>> + *
>> + * Try to enable fast frequency switching for @policy.
>> + *
>> + * The attempt will fail if there is at least one transition notifier registered
>> + * at this point, as fast frequency switching is quite fundamentally at odds
>> + * with transition notifiers. Thus if successful, it will make registration of
>> + * transition notifiers fail going forward.
>> + *
>> + * Call under policy->rwsem.
>
> Nobody reads a comment..
>
>> + */
>> +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
>> +{
>
> lockdep_assert_held(&policy->rwsem);
>
> While everybody complains when there's a big nasty splat in their dmesg
> ;-)
OK
On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
> +static unsigned int get_next_freq(struct cpufreq_policy *policy,
> + unsigned long util, unsigned long max)
> +{
> + unsigned int freq = arch_scale_freq_invariant() ?
> + policy->cpuinfo.max_freq : policy->cur;
> +
> + return (freq + (freq >> 2)) * util / max;
> +}
> +
> +static void sugov_update_single(struct update_util_data *hook, u64 time,
> + unsigned long util, unsigned long max)
> +{
> + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> + struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> + struct cpufreq_policy *policy = sg_policy->policy;
> + unsigned int next_f;
> +
> + if (!sugov_should_update_freq(sg_policy, time))
> + return;
> +
> + next_f = util <= max ?
> + get_next_freq(policy, util, max) : policy->cpuinfo.max_freq;
I'm not sure that is correct, would not something like this be more
accurate?
if (util > max)
util = max;
next_f = get_next_freq(policy, util, max);
After all, if we clip util we will still only increment to the next freq
with our multiplication factor.
Hmm, or was this meant to deal with the DL/RT stuff?
Would then not something like:
/* ULONG_MAX is used to force max_freq for Real-Time policies */
if (util == ULONG_MAX) {
next_f = policy->cpuinfo.max_freq;
} else {
if (util > max)
util = max;
next_f = get_next_freq(policy, util, max);
}
Be clearer?
> + sugov_update_commit(sg_policy, time, next_f);
> +}
On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
> + if ((s64)delta_ns > NSEC_PER_SEC / HZ)
That's TICK_NSEC
On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
> +static void sugov_work(struct work_struct *work)
> +{
> + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> +
> + mutex_lock(&sg_policy->work_lock);
> + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> + CPUFREQ_RELATION_L);
> + mutex_unlock(&sg_policy->work_lock);
> +
Be aware that the below store can creep up and become visible before the
unlock. AFAICT that doesn't really matter, but still.
> + sg_policy->work_in_progress = false;
> +}
> +
> +static void sugov_irq_work(struct irq_work *irq_work)
> +{
> + struct sugov_policy *sg_policy;
> +
> + sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
> + schedule_work(&sg_policy->work);
> +}
If you care what cpu the work runs on, you should schedule_work_on(),
regular schedule_work() can end up on any random cpu (although typically
it does not).
In particular schedule_work() -> queue_work() -> queue_work_on(.cpu =
WORK_CPU_UNBOUND) -> __queue_work() if (req_cpu == UNBOUND) cpu =
wq_select_unbound_cpu(), which has a Round-Robin 'feature' to detect
just such dependencies.
On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> + unsigned int next_freq)
> +{
> + struct cpufreq_policy *policy = sg_policy->policy;
> +
> + if (next_freq > policy->max)
> + next_freq = policy->max;
> + else if (next_freq < policy->min)
> + next_freq = policy->min;
I'm still very much undecided on these policy min/max thresholds. I
don't particularly like them.
On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> + unsigned int next_freq)
> +{
> + struct cpufreq_policy *policy = sg_policy->policy;
> +
> + if (next_freq > policy->max)
> + next_freq = policy->max;
> + else if (next_freq < policy->min)
> + next_freq = policy->min;
> +
> + sg_policy->last_freq_update_time = time;
> + if (sg_policy->next_freq == next_freq) {
> + if (policy->fast_switch_enabled)
> + trace_cpu_frequency(policy->cur, smp_processor_id());
> +
> + return;
> + }
> +
> + sg_policy->next_freq = next_freq;
> + if (policy->fast_switch_enabled) {
> + unsigned int freq;
> +
> + freq = cpufreq_driver_fast_switch(policy, next_freq);
So you're assuming a RELATION_L for ->fast_switch() ?
> + if (freq == CPUFREQ_ENTRY_INVALID)
> + return;
> +
> + policy->cur = freq;
> + trace_cpu_frequency(freq, smp_processor_id());
> + } else {
> + sg_policy->work_in_progress = true;
> + irq_work_queue(&sg_policy->irq_work);
> + }
> +}
> +static void sugov_work(struct work_struct *work)
> +{
> + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> +
> + mutex_lock(&sg_policy->work_lock);
> + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> + CPUFREQ_RELATION_L);
As per here, which I assume matches semantics on that point.
> + mutex_unlock(&sg_policy->work_lock);
> +
> + sg_policy->work_in_progress = false;
> +}
On Wednesday, March 16, 2016 06:36:46 PM Peter Zijlstra wrote:
> On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
> > + if ((s64)delta_ns > NSEC_PER_SEC / HZ)
>
> That's TICK_NSEC
OK (I didn't know we had a separate symbol for that)
On Wednesday, March 16, 2016 06:52:11 PM Peter Zijlstra wrote:
> On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
> > +static void sugov_work(struct work_struct *work)
> > +{
> > + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> > +
> > + mutex_lock(&sg_policy->work_lock);
> > + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> > + CPUFREQ_RELATION_L);
> > + mutex_unlock(&sg_policy->work_lock);
> > +
>
> Be aware that the below store can creep up and become visible before the
> unlock. AFAICT that doesn't really matter, but still.
It doesn't matter. :-)
Had it mattered, I would have used memory barriers.
> > + sg_policy->work_in_progress = false;
> > +}
> > +
> > +static void sugov_irq_work(struct irq_work *irq_work)
> > +{
> > + struct sugov_policy *sg_policy;
> > +
> > + sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
> > + schedule_work(&sg_policy->work);
> > +}
>
> If you care what cpu the work runs on, you should schedule_work_on(),
> regular schedule_work() can end up on any random cpu (although typically
> it does not).
I know, but I don't care too much.
"ondemand" and "conservative" use schedule_work() for the same thing, so
drivers need to cope with that if they need things to run on a particular
CPU.
That said I guess things would be a bit more efficient if the work was
scheduled on the same CPU that had queued up the irq_work. It also wouldn't
be too difficult to implement, so I'll make that change.
On Wednesday, March 16, 2016 07:14:20 PM Peter Zijlstra wrote:
> On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
> > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> > + unsigned int next_freq)
> > +{
> > + struct cpufreq_policy *policy = sg_policy->policy;
> > +
> > + if (next_freq > policy->max)
> > + next_freq = policy->max;
> > + else if (next_freq < policy->min)
> > + next_freq = policy->min;
> > +
> > + sg_policy->last_freq_update_time = time;
> > + if (sg_policy->next_freq == next_freq) {
> > + if (policy->fast_switch_enabled)
> > + trace_cpu_frequency(policy->cur, smp_processor_id());
> > +
> > + return;
> > + }
> > +
> > + sg_policy->next_freq = next_freq;
> > + if (policy->fast_switch_enabled) {
> > + unsigned int freq;
> > +
> > + freq = cpufreq_driver_fast_switch(policy, next_freq);
>
> So you're assuming a RELATION_L for ->fast_switch() ?
Yes, I am.
> > + if (freq == CPUFREQ_ENTRY_INVALID)
> > + return;
> > +
> > + policy->cur = freq;
> > + trace_cpu_frequency(freq, smp_processor_id());
> > + } else {
> > + sg_policy->work_in_progress = true;
> > + irq_work_queue(&sg_policy->irq_work);
> > + }
> > +}
>
>
> > +static void sugov_work(struct work_struct *work)
> > +{
> > + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> > +
> > + mutex_lock(&sg_policy->work_lock);
> > + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> > + CPUFREQ_RELATION_L);
>
> As per here, which I assume matches semantics on that point.
Correct.
On Wednesday, March 16, 2016 06:35:41 PM Peter Zijlstra wrote:
> On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
>
> > +static unsigned int get_next_freq(struct cpufreq_policy *policy,
> > + unsigned long util, unsigned long max)
> > +{
> > + unsigned int freq = arch_scale_freq_invariant() ?
> > + policy->cpuinfo.max_freq : policy->cur;
> > +
> > + return (freq + (freq >> 2)) * util / max;
> > +}
> > +
> > +static void sugov_update_single(struct update_util_data *hook, u64 time,
> > + unsigned long util, unsigned long max)
> > +{
> > + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> > + struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> > + struct cpufreq_policy *policy = sg_policy->policy;
> > + unsigned int next_f;
> > +
> > + if (!sugov_should_update_freq(sg_policy, time))
> > + return;
> > +
> > + next_f = util <= max ?
> > + get_next_freq(policy, util, max) : policy->cpuinfo.max_freq;
>
> I'm not sure that is correct, would not something like this be more
> accurate?
>
> if (util > max)
> util = max;
> next_f = get_next_freq(policy, util, max);
>
> After all, if we clip util we will still only increment to the next freq
> with our multiplication factor.
>
> Hmm, or was this meant to deal with the DL/RT stuff?
Yes, it was.
> Would then not something like:
>
> /* ULONG_MAX is used to force max_freq for Real-Time policies */
> if (util == ULONG_MAX) {
> next_f = policy->cpuinfo.max_freq;
> } else {
> if (util > max)
That cannot happen given the way CFS deals with max before passing it
to cpufreq_update_util().
> util = max;
> next_f = get_next_freq(policy, util, max);
> }
>
> Be clearer?
>
> > + sugov_update_commit(sg_policy, time, next_f);
> > +}
So essentially I can replace the util > max check with the util == ULONG_MAX one
(here and in some other places) if that helps to understand the code, but
functionally that won't change anything.
On Wednesday, March 16, 2016 06:53:41 PM Peter Zijlstra wrote:
> On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
> > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> > + unsigned int next_freq)
> > +{
> > + struct cpufreq_policy *policy = sg_policy->policy;
> > +
> > + if (next_freq > policy->max)
> > + next_freq = policy->max;
> > + else if (next_freq < policy->min)
> > + next_freq = policy->min;
>
> I'm still very much undecided on these policy min/max thresholds. I
> don't particularly like them.
These are for consistency mostly.
It actually occurs to me that __cpufreq_driver_target() does that already
anyway, so they can be moved into the "fast switch" branch. Which means
that the code needs to be rearranged a bit here.
On Wed, Mar 16, 2016 at 10:38:14PM +0100, Rafael J. Wysocki wrote:
> > If you care what cpu the work runs on, you should schedule_work_on(),
> > regular schedule_work() can end up on any random cpu (although typically
> > it does not).
>
> I know, but I don't care too much.
>
> "ondemand" and "conservative" use schedule_work() for the same thing, so
> drivers need to cope with that if they need things to run on a particular
> CPU.
Or are just plain buggy -- like a lot of code that uses schedule_work()
for per-cpu thingies; that is, its a fairly common bug and only recently
did we add that RR thing.
On Wed, Mar 16, 2016 at 10:38:55PM +0100, Rafael J. Wysocki wrote:
> On Wednesday, March 16, 2016 07:14:20 PM Peter Zijlstra wrote:
> > On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
> > > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> > > + unsigned int next_freq)
> > > +{
> > > + struct cpufreq_policy *policy = sg_policy->policy;
> > > +
> > > + if (next_freq > policy->max)
> > > + next_freq = policy->max;
> > > + else if (next_freq < policy->min)
> > > + next_freq = policy->min;
> > > +
> > > + sg_policy->last_freq_update_time = time;
> > > + if (sg_policy->next_freq == next_freq) {
> > > + if (policy->fast_switch_enabled)
> > > + trace_cpu_frequency(policy->cur, smp_processor_id());
> > > +
> > > + return;
> > > + }
> > > +
> > > + sg_policy->next_freq = next_freq;
> > > + if (policy->fast_switch_enabled) {
> > > + unsigned int freq;
> > > +
> > > + freq = cpufreq_driver_fast_switch(policy, next_freq);
> >
> > So you're assuming a RELATION_L for ->fast_switch() ?
>
> Yes, I am.
Should we document that fact somewhere? Or alternatively, if you already
did, I simply missed it.
On Wednesday, March 16, 2016 11:40:54 PM Peter Zijlstra wrote:
> On Wed, Mar 16, 2016 at 10:38:55PM +0100, Rafael J. Wysocki wrote:
> > On Wednesday, March 16, 2016 07:14:20 PM Peter Zijlstra wrote:
> > > On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote:
> > > > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> > > > + unsigned int next_freq)
> > > > +{
> > > > + struct cpufreq_policy *policy = sg_policy->policy;
> > > > +
> > > > + if (next_freq > policy->max)
> > > > + next_freq = policy->max;
> > > > + else if (next_freq < policy->min)
> > > > + next_freq = policy->min;
> > > > +
> > > > + sg_policy->last_freq_update_time = time;
> > > > + if (sg_policy->next_freq == next_freq) {
> > > > + if (policy->fast_switch_enabled)
> > > > + trace_cpu_frequency(policy->cur, smp_processor_id());
> > > > +
> > > > + return;
> > > > + }
> > > > +
> > > > + sg_policy->next_freq = next_freq;
> > > > + if (policy->fast_switch_enabled) {
> > > > + unsigned int freq;
> > > > +
> > > > + freq = cpufreq_driver_fast_switch(policy, next_freq);
> > >
> > > So you're assuming a RELATION_L for ->fast_switch() ?
> >
> > Yes, I am.
>
> Should we document that fact somewhere? Or alternatively, if you already
> did, I simply missed it.
I thought I did, but clearly that's not the case (I think I wrote about that
in a changelog comments somewhere).
I'll document it in the kerneldoc for cpufreq_driver_fast_switch() (patch [6/7]).
From: Rafael J. Wysocki <[email protected]>
Subject: [PATCH] cpufreq: Support for fast frequency switching
Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
to support that method if available.
Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context by (future)
governors supporting that feature via (new) helper function
cpufreq_driver_fast_switch().
Add two new policy flags, fast_switch_possible, to be set by the
cpufreq driver if fast frequency switching can be used for the
given policy and fast_switch_enabled, to be set by the governor
if it is going to use fast frequency switching for the given
policy. Also add a helper for setting the latter.
Since fast frequency switching is inherently incompatible with
cpufreq transition notifiers, make it possible to set the
fast_switch_enabled only if there are no transition notifiers
already registered and make the registration of new transition
notifiers fail if fast_switch_enabled is set for at least one
policy.
Implement the ->fast_switch callback in the ACPI cpufreq driver
and make it set fast_switch_possible during policy initialization
as appropriate.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Addressing comments from Peter.
Changes from v4:
- If cpufreq_enable_fast_switch() is about to fail, it will print the list
of currently registered transition notifiers.
- Added lock_assert_held(&policy->rwsem) to cpufreq_enable_fast_switch().
- Added WARN_ON() to the (cpufreq_fast_switch_count > 0) check in
cpufreq_register_notifier().
- Modified the kerneldoc comment of cpufreq_driver_fast_switch() to
mention the RELATION_L expectation regarding the ->fast_switch callback.
Changes from v3:
- New fast_switch_enabled field in struct cpufreq_policy to help
avoid affecting existing setups by setting the fast_switch_possible
flag in the driver.
- __cpufreq_get() skips the policy->cur check if fast_switch_enabled is set.
Changes from v2:
- The driver ->fast_switch callback and cpufreq_driver_fast_switch()
don't need the relation argument as they will always do RELATION_L now.
- New mechanism to make fast switch and cpufreq notifiers mutually
exclusive.
- cpufreq_driver_fast_switch() doesn't do anything in addition to
invoking the driver callback and returns its return value.
---
drivers/cpufreq/acpi-cpufreq.c | 41 +++++++++++++
drivers/cpufreq/cpufreq.c | 127 ++++++++++++++++++++++++++++++++++++++---
include/linux/cpufreq.h | 9 ++
3 files changed, 168 insertions(+), 9 deletions(-)
Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp
return result;
}
+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ struct acpi_cpufreq_data *data = policy->driver_data;
+ struct acpi_processor_performance *perf;
+ struct cpufreq_frequency_table *entry;
+ unsigned int next_perf_state, next_freq, freq;
+
+ /*
+ * Find the closest frequency above target_freq.
+ *
+ * The table is sorted in the reverse order with respect to the
+ * frequency and all of the entries are valid (see the initialization).
+ */
+ entry = data->freq_table;
+ do {
+ entry++;
+ freq = entry->frequency;
+ } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
+ entry--;
+ next_freq = entry->frequency;
+ next_perf_state = entry->driver_data;
+
+ perf = to_perf_data(data);
+ if (perf->state == next_perf_state) {
+ if (unlikely(data->resume))
+ data->resume = 0;
+ else
+ return next_freq;
+ }
+
+ data->cpu_freq_write(&perf->control_register,
+ perf->states[next_perf_state].control);
+ perf->state = next_perf_state;
+ return next_freq;
+}
+
static unsigned long
acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
{
@@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct
goto err_unreg;
}
+ policy->fast_switch_possible = !acpi_pstate_strict &&
+ !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
+
data->freq_table = kzalloc(sizeof(*data->freq_table) *
(perf->state_count+1), GFP_KERNEL);
if (!data->freq_table) {
@@ -874,6 +914,7 @@ static struct freq_attr *acpi_cpufreq_at
static struct cpufreq_driver acpi_cpufreq_driver = {
.verify = cpufreq_generic_frequency_table_verify,
.target_index = acpi_cpufreq_target,
+ .fast_switch = acpi_cpufreq_fast_switch,
.bios_limit = acpi_processor_get_bios_limit,
.init = acpi_cpufreq_cpu_init,
.exit = acpi_cpufreq_cpu_exit,
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -102,6 +102,10 @@ struct cpufreq_policy {
*/
struct rw_semaphore rwsem;
+ /* Fast switch flags */
+ bool fast_switch_possible; /* Set by the driver. */
+ bool fast_switch_enabled;
+
/* Synchronization for frequency transitions */
bool transition_ongoing; /* Tracks transition status */
spinlock_t transition_lock;
@@ -156,6 +160,7 @@ int cpufreq_get_policy(struct cpufreq_po
int cpufreq_update_policy(unsigned int cpu);
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
#else
static inline unsigned int cpufreq_get(unsigned int cpu)
{
@@ -236,6 +241,8 @@ struct cpufreq_driver {
unsigned int relation); /* Deprecated */
int (*target_index)(struct cpufreq_policy *policy,
unsigned int index);
+ unsigned int (*fast_switch)(struct cpufreq_policy *policy,
+ unsigned int target_freq);
/*
* Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
* unset.
@@ -464,6 +471,8 @@ struct cpufreq_governor {
};
/* Pass a target to the cpufreq driver */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq);
int cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -428,6 +428,54 @@ void cpufreq_freq_transition_end(struct
}
EXPORT_SYMBOL_GPL(cpufreq_freq_transition_end);
+/*
+ * Fast frequency switching status count. Positive means "enabled", negative
+ * means "disabled" and 0 means "not decided yet".
+ */
+static int cpufreq_fast_switch_count;
+static DEFINE_MUTEX(cpufreq_fast_switch_lock);
+
+static void cpufreq_list_transition_notifiers(void)
+{
+ struct notifier_block *nb;
+
+ pr_info("cpufreq: Registered transition notifiers:\n");
+
+ mutex_lock(&cpufreq_transition_notifier_list.mutex);
+
+ for (nb = cpufreq_transition_notifier_list.head; nb; nb = nb->next)
+ pr_info("cpufreq: %pF\n", nb->notifier_call);
+
+ mutex_unlock(&cpufreq_transition_notifier_list.mutex);
+}
+
+/**
+ * cpufreq_enable_fast_switch - Enable fast frequency switching for policy.
+ * @policy: cpufreq policy to enable fast frequency switching for.
+ *
+ * Try to enable fast frequency switching for @policy.
+ *
+ * The attempt will fail if there is at least one transition notifier registered
+ * at this point, as fast frequency switching is quite fundamentally at odds
+ * with transition notifiers. Thus if successful, it will make registration of
+ * transition notifiers fail going forward.
+ */
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
+{
+ lockdep_assert_held(&policy->rwsem);
+
+ mutex_lock(&cpufreq_fast_switch_lock);
+ if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) {
+ cpufreq_fast_switch_count++;
+ policy->fast_switch_enabled = true;
+ } else {
+ pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n",
+ policy->cpu);
+ cpufreq_list_transition_notifiers();
+ }
+ mutex_unlock(&cpufreq_fast_switch_lock);
+}
+EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch);
/*********************************************************************
* SYSFS INTERFACE *
@@ -1083,6 +1131,24 @@ static void cpufreq_policy_free(struct c
kfree(policy);
}
+static void cpufreq_driver_exit_policy(struct cpufreq_policy *policy)
+{
+ if (policy->fast_switch_enabled) {
+ mutex_lock(&cpufreq_fast_switch_lock);
+
+ policy->fast_switch_enabled = false;
+ if (!WARN_ON(cpufreq_fast_switch_count <= 0))
+ cpufreq_fast_switch_count--;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
+ }
+
+ if (cpufreq_driver->exit) {
+ cpufreq_driver->exit(policy);
+ policy->freq_table = NULL;
+ }
+}
+
static int cpufreq_online(unsigned int cpu)
{
struct cpufreq_policy *policy;
@@ -1236,8 +1302,7 @@ static int cpufreq_online(unsigned int c
out_exit_policy:
up_write(&policy->rwsem);
- if (cpufreq_driver->exit)
- cpufreq_driver->exit(policy);
+ cpufreq_driver_exit_policy(policy);
out_free_policy:
cpufreq_policy_free(policy, !new_policy);
return ret;
@@ -1334,10 +1399,7 @@ static void cpufreq_offline(unsigned int
* since this is a core component, and is essential for the
* subsequent light-weight ->init() to succeed.
*/
- if (cpufreq_driver->exit) {
- cpufreq_driver->exit(policy);
- policy->freq_table = NULL;
- }
+ cpufreq_driver_exit_policy(policy);
unlock:
up_write(&policy->rwsem);
@@ -1444,8 +1506,12 @@ static unsigned int __cpufreq_get(struct
ret_freq = cpufreq_driver->get(policy->cpu);
- /* Updating inactive policies is invalid, so avoid doing that. */
- if (unlikely(policy_is_inactive(policy)))
+ /*
+ * Updating inactive policies is invalid, so avoid doing that. Also
+ * if fast frequency switching is used with the given policy, the check
+ * against policy->cur is pointless, so skip it in that case too.
+ */
+ if (unlikely(policy_is_inactive(policy)) || policy->fast_switch_enabled)
return ret_freq;
if (ret_freq && policy->cur &&
@@ -1457,7 +1523,6 @@ static unsigned int __cpufreq_get(struct
schedule_work(&policy->update);
}
}
-
return ret_freq;
}
@@ -1653,8 +1718,18 @@ int cpufreq_register_notifier(struct not
switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
+ if (WARN_ON(cpufreq_fast_switch_count > 0)) {
+ mutex_unlock(&cpufreq_fast_switch_lock);
+ return -EPERM;
+ }
ret = srcu_notifier_chain_register(
&cpufreq_transition_notifier_list, nb);
+ if (!ret)
+ cpufreq_fast_switch_count--;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_register(
@@ -1687,8 +1762,14 @@ int cpufreq_unregister_notifier(struct n
switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
ret = srcu_notifier_chain_unregister(
&cpufreq_transition_notifier_list, nb);
+ if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0))
+ cpufreq_fast_switch_count++;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_unregister(
@@ -1707,6 +1788,34 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
* GOVERNORS *
*********************************************************************/
+/**
+ * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
+ * @policy: cpufreq policy to switch the frequency for.
+ * @target_freq: New frequency to set (may be approximate).
+ *
+ * Carry out a fast frequency switch from interrupt context.
+ *
+ * The driver's ->fast_switch() callback invoked by this function is expected to
+ * select the minimum available frequency greater than or equal to @target_freq
+ * (CPUFREQ_RELATION_L).
+ *
+ * This function must not be called if policy->fast_switch_enabled is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same policy and that it will never be called in
+ * parallel with either ->target() or ->target_index() for the same policy.
+ *
+ * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
+ * callback to indicate an error condition, the hardware configuration must be
+ * preserved.
+ */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ return cpufreq_driver->fast_switch(policy, target_freq);
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
+
/* Must set freqs->new to intermediate frequency */
static int __target_intermediate(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, int index)
From: Rafael J. Wysocki <[email protected]>
Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.
Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.
The new governor is relatively simple.
The frequency selection formula used by it depends on whether or not
the utilization is frequency-invariant. In the frequency-invariant
case the new CPU frequency is given by
next_freq = 1.25 * max_freq * util / max
where util and max are the last two arguments of cpufreq_update_util().
In turn, if util is not frequency-invariant, the maximum frequency in
the above formula is replaced with the current frequency of the CPU:
next_freq = 1.25 * curr_freq * util / max
The coefficient 1.25 corresponds to the frequency tipping point at
(util / max) = 0.8.
All of the computations are carried out in the utilization update
handlers provided by the new governor. One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).
The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers. If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).
Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes. That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.
The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Addressing comments from Peter.
Changes from v4:
- Use TICK_NSEC in sugov_next_freq_shared().
- Use schedule_work_on() to schedule work items and replace
work_in_progress with work_cpu (which is used both for scheduling
work items and as a "work in progress" marker).
- Rearrange sugov_update_commit() to only check policy->min/max if
fast switching is enabled.
- Replace util > max checks with util == ULONG_MAX checks to make
it clear that they are about a special case (RT/DL).
Changes from v3:
- The "next frequency" formula based on
http://marc.info/?l=linux-acpi&m=145756618321500&w=4 and
http://marc.info/?l=linux-kernel&m=145760739700716&w=4
- The governor goes into kernel/sched/ (again).
Changes from v2:
- The governor goes into drivers/cpufreq/.
- The "next frequency" formula has an additional 1.1 factor to allow
more util/max values to map onto the top-most frequency in case the
distance between that and the previous one is unproportionally small.
- sugov_update_commit() traces CPU frequency even if the new one is
the same as the previous one (otherwise, if the system is 100% loaded
for long enough, powertop starts to report that all CPUs are 100% idle).
---
drivers/cpufreq/Kconfig | 26 +
kernel/sched/Makefile | 1
kernel/sched/cpufreq_schedutil.c | 527 +++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 8
4 files changed, 562 insertions(+)
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+ bool "schedutil"
+ select CPU_FREQ_GOV_SCHEDUTIL
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the 'schedutil' CPUFreq governor by default. If unsure,
+ have a look at the help section of that governor. The fallback
+ governor will be 'performance'.
+
endchoice
config CPU_FREQ_GOV_PERFORMANCE
@@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE
If in doubt, say N.
+config CPU_FREQ_GOV_SCHEDUTIL
+ tristate "'schedutil' cpufreq policy governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_ATTR_SET
+ select IRQ_WORK
+ help
+ The frequency selection formula used by this governor is analogous
+ to the one used by 'ondemand', but instead of computing CPU load
+ as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+ utilization data provided by the scheduler as input.
+
+ To compile this driver as a module, choose M here: the
+ module will be called cpufreq_schedutil.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"
config CPUFREQ_DT
Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -0,0 +1,527 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <trace/events/power.h>
+
+#include "sched.h"
+
+struct sugov_tunables {
+ struct gov_attr_set attr_set;
+ unsigned int rate_limit_us;
+};
+
+struct sugov_policy {
+ struct cpufreq_policy *policy;
+
+ struct sugov_tunables *tunables;
+ struct list_head tunables_hook;
+
+ raw_spinlock_t update_lock; /* For shared policies */
+ u64 last_freq_update_time;
+ s64 freq_update_delay_ns;
+ unsigned int next_freq;
+
+ /* The next fields are only needed if fast switch cannot be used. */
+ struct irq_work irq_work;
+ struct work_struct work;
+ struct mutex work_lock;
+ unsigned int work_cpu;
+
+ bool need_freq_update;
+};
+
+struct sugov_cpu {
+ struct update_util_data update_util;
+ struct sugov_policy *sg_policy;
+
+ /* The fields below are only needed when sharing a policy. */
+ unsigned long util;
+ unsigned long max;
+ u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
+ u64 delta_ns;
+
+ if (sg_policy->work_cpu != UINT_MAX)
+ return false;
+
+ if (unlikely(sg_policy->need_freq_update)) {
+ sg_policy->need_freq_update = false;
+ return true;
+ }
+
+ delta_ns = time - sg_policy->last_freq_update_time;
+ return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
+}
+
+static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+ unsigned int next_freq)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+
+ sg_policy->last_freq_update_time = time;
+
+ if (policy->fast_switch_enabled) {
+ if (next_freq > policy->max)
+ next_freq = policy->max;
+ else if (next_freq < policy->min)
+ next_freq = policy->min;
+
+ if (sg_policy->next_freq == next_freq) {
+ trace_cpu_frequency(policy->cur, smp_processor_id());
+ return;
+ }
+ sg_policy->next_freq = next_freq;
+ next_freq = cpufreq_driver_fast_switch(policy, next_freq);
+ if (next_freq == CPUFREQ_ENTRY_INVALID)
+ return;
+
+ policy->cur = next_freq;
+ trace_cpu_frequency(next_freq, smp_processor_id());
+ } else if (sg_policy->next_freq != next_freq) {
+ sg_policy->work_cpu = smp_processor_id();
+ irq_work_queue(&sg_policy->irq_work);
+ }
+}
+
+/**
+ * get_next_freq - Compute a new frequency for a given cpufreq policy.
+ * @policy: cpufreq policy object to compute the new frequency for.
+ * @util: Current CPU utilization.
+ * @max: CPU capacity.
+ *
+ * If the utilization is frequency-invariant, choose the new frequency to be
+ * proportional to it, that is
+ *
+ * next_freq = C * max_freq * util / max
+ *
+ * Otherwise, approximate the would-be frequency-invariant utilization by
+ * util_raw * (curr_freq / max_freq) which leads to
+ *
+ * next_freq = C * curr_freq * util_raw / max
+ *
+ * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
+ */
+static unsigned int get_next_freq(struct cpufreq_policy *policy,
+ unsigned long util, unsigned long max)
+{
+ unsigned int freq = arch_scale_freq_invariant() ?
+ policy->cpuinfo.max_freq : policy->cur;
+
+ return (freq + (freq >> 2)) * util / max;
+}
+
+static void sugov_update_single(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int next_f;
+
+ if (!sugov_should_update_freq(sg_policy, time))
+ return;
+
+ next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
+ get_next_freq(policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+}
+
+static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
+ unsigned long util, unsigned long max)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int max_f = policy->cpuinfo.max_freq;
+ u64 last_freq_update_time = sg_policy->last_freq_update_time;
+ unsigned int j;
+
+ if (util == ULONG_MAX)
+ return max_f;
+
+ for_each_cpu(j, policy->cpus) {
+ struct sugov_cpu *j_sg_cpu;
+ unsigned long j_util, j_max;
+ u64 delta_ns;
+
+ if (j == smp_processor_id())
+ continue;
+
+ j_sg_cpu = &per_cpu(sugov_cpu, j);
+ /*
+ * If the CPU utilization was last updated before the previous
+ * frequency update and the time elapsed between the last update
+ * of the CPU utilization and the last frequency update is long
+ * enough, don't take the CPU into account as it probably is
+ * idle now.
+ */
+ delta_ns = last_freq_update_time - j_sg_cpu->last_update;
+ if ((s64)delta_ns > TICK_NSEC)
+ continue;
+
+ j_util = j_sg_cpu->util;
+ if (j_util == ULONG_MAX)
+ return max_f;
+
+ j_max = j_sg_cpu->max;
+ if (j_util * max > j_max * util) {
+ util = j_util;
+ max = j_max;
+ }
+ }
+
+ return get_next_freq(policy, util, max);
+}
+
+static void sugov_update_shared(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int next_f;
+
+ raw_spin_lock(&sg_policy->update_lock);
+
+ sg_cpu->util = util;
+ sg_cpu->max = max;
+ sg_cpu->last_update = time;
+
+ if (sugov_should_update_freq(sg_policy, time)) {
+ next_f = sugov_next_freq_shared(sg_policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+ }
+
+ raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_work(struct work_struct *work)
+{
+ struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+
+ mutex_lock(&sg_policy->work_lock);
+ __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+ CPUFREQ_RELATION_L);
+ mutex_unlock(&sg_policy->work_lock);
+
+ sg_policy->work_cpu = UINT_MAX;
+}
+
+static void sugov_irq_work(struct irq_work *irq_work)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
+ schedule_work_on(sg_policy->work_cpu, &sg_policy->work);
+}
+
+/************************** sysfs interface ************************/
+
+static struct sugov_tunables *global_tunables;
+static DEFINE_MUTEX(global_tunables_lock);
+
+static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct sugov_tunables, attr_set);
+}
+
+static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+ return sprintf(buf, "%u\n", tunables->rate_limit_us);
+}
+
+static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+ struct sugov_policy *sg_policy;
+ unsigned int rate_limit_us;
+ int ret;
+
+ ret = sscanf(buf, "%u", &rate_limit_us);
+ if (ret != 1)
+ return -EINVAL;
+
+ tunables->rate_limit_us = rate_limit_us;
+
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+ sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+
+ return count;
+}
+
+static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+
+static struct attribute *sugov_attributes[] = {
+ &rate_limit_us.attr,
+ NULL
+};
+
+static struct kobj_type sugov_tunables_ktype = {
+ .default_attrs = sugov_attributes,
+ .sysfs_ops = &governor_sysfs_ops,
+};
+
+/********************** cpufreq governor interface *********************/
+
+static struct cpufreq_governor schedutil_gov;
+
+static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+ if (!sg_policy)
+ return NULL;
+
+ sg_policy->policy = policy;
+ init_irq_work(&sg_policy->irq_work, sugov_irq_work);
+ INIT_WORK(&sg_policy->work, sugov_work);
+ mutex_init(&sg_policy->work_lock);
+ raw_spin_lock_init(&sg_policy->update_lock);
+ return sg_policy;
+}
+
+static void sugov_policy_free(struct sugov_policy *sg_policy)
+{
+ mutex_destroy(&sg_policy->work_lock);
+ kfree(sg_policy);
+}
+
+static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
+{
+ struct sugov_tunables *tunables;
+
+ tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
+ if (tunables)
+ gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
+
+ return tunables;
+}
+
+static void sugov_tunables_free(struct sugov_tunables *tunables)
+{
+ if (!have_governor_per_policy())
+ global_tunables = NULL;
+
+ kfree(tunables);
+}
+
+static int sugov_init(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+ struct sugov_tunables *tunables;
+ unsigned int lat;
+ int ret = 0;
+
+ /* State should be equivalent to EXIT */
+ if (policy->governor_data)
+ return -EBUSY;
+
+ sg_policy = sugov_policy_alloc(policy);
+ if (!sg_policy)
+ return -ENOMEM;
+
+ mutex_lock(&global_tunables_lock);
+
+ if (global_tunables) {
+ if (WARN_ON(have_governor_per_policy())) {
+ ret = -EINVAL;
+ goto free_sg_policy;
+ }
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = global_tunables;
+
+ gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
+ goto out;
+ }
+
+ tunables = sugov_tunables_alloc(sg_policy);
+ if (!tunables) {
+ ret = -ENOMEM;
+ goto free_sg_policy;
+ }
+
+ tunables->rate_limit_us = LATENCY_MULTIPLIER;
+ lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
+ if (lat)
+ tunables->rate_limit_us *= lat;
+
+ if (!have_governor_per_policy())
+ global_tunables = tunables;
+
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = tunables;
+
+ ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
+ get_governor_parent_kobj(policy), "%s",
+ schedutil_gov.name);
+ if (!ret)
+ goto out;
+
+ /* Failure, so roll back. */
+ policy->governor_data = NULL;
+ sugov_tunables_free(tunables);
+
+ free_sg_policy:
+ pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
+ sugov_policy_free(sg_policy);
+
+ out:
+ mutex_unlock(&global_tunables_lock);
+ return ret;
+}
+
+static int sugov_exit(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ struct sugov_tunables *tunables = sg_policy->tunables;
+ unsigned int count;
+
+ mutex_lock(&global_tunables_lock);
+
+ count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
+ policy->governor_data = NULL;
+ if (!count)
+ sugov_tunables_free(tunables);
+
+ mutex_unlock(&global_tunables_lock);
+
+ sugov_policy_free(sg_policy);
+ return 0;
+}
+
+static int sugov_start(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ cpufreq_enable_fast_switch(policy);
+
+ sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->last_freq_update_time = 0;
+ sg_policy->next_freq = UINT_MAX;
+ sg_policy->work_cpu = UINT_MAX;
+ sg_policy->need_freq_update = false;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+ sg_cpu->sg_policy = sg_policy;
+ if (policy_is_shared(policy)) {
+ sg_cpu->util = ULONG_MAX;
+ sg_cpu->max = 0;
+ sg_cpu->last_update = 0;
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_shared);
+ } else {
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_single);
+ }
+ }
+ return 0;
+}
+
+static int sugov_stop(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ cpufreq_remove_update_util_hook(cpu);
+
+ synchronize_sched();
+
+ irq_work_sync(&sg_policy->irq_work);
+ cancel_work_sync(&sg_policy->work);
+ return 0;
+}
+
+static int sugov_limits(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+
+ if (!policy->fast_switch_enabled) {
+ mutex_lock(&sg_policy->work_lock);
+
+ if (policy->max < policy->cur)
+ __cpufreq_driver_target(policy, policy->max,
+ CPUFREQ_RELATION_H);
+ else if (policy->min > policy->cur)
+ __cpufreq_driver_target(policy, policy->min,
+ CPUFREQ_RELATION_L);
+
+ mutex_unlock(&sg_policy->work_lock);
+ }
+
+ sg_policy->need_freq_update = true;
+ return 0;
+}
+
+int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
+{
+ if (event == CPUFREQ_GOV_POLICY_INIT) {
+ return sugov_init(policy);
+ } else if (policy->governor_data) {
+ switch (event) {
+ case CPUFREQ_GOV_POLICY_EXIT:
+ return sugov_exit(policy);
+ case CPUFREQ_GOV_START:
+ return sugov_start(policy);
+ case CPUFREQ_GOV_STOP:
+ return sugov_stop(policy);
+ case CPUFREQ_GOV_LIMITS:
+ return sugov_limits(policy);
+ }
+ }
+ return -EINVAL;
+}
+
+static struct cpufreq_governor schedutil_gov = {
+ .name = "schedutil",
+ .governor = sugov_governor,
+ .owner = THIS_MODULE,
+};
+
+static int __init sugov_module_init(void)
+{
+ return cpufreq_register_governor(&schedutil_gov);
+}
+
+static void __exit sugov_module_exit(void)
+{
+ cpufreq_unregister_governor(&schedutil_gov);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+struct cpufreq_governor *cpufreq_default_governor(void)
+{
+ return &schedutil_gov;
+}
+
+fs_initcall(sugov_module_init);
+#else
+module_init(sugov_module_init);
+#endif
+module_exit(sugov_module_exit);
Index: linux-pm/kernel/sched/Makefile
===================================================================
--- linux-pm.orig/kernel/sched/Makefile
+++ linux-pm/kernel/sched/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -1786,3 +1786,11 @@ static inline void cpufreq_trigger_updat
static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {}
static inline void cpufreq_trigger_update(u64 time) {}
#endif /* CONFIG_CPU_FREQ */
+
+#ifdef arch_scale_freq_capacity
+#ifndef arch_scale_freq_invariant
+#define arch_scale_freq_invariant() (true)
+#endif
+#else /* arch_scale_freq_capacity */
+#define arch_scale_freq_invariant() (false)
+#endif
Hi Rafael,
On 17/03/16 01:01, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
[...]
> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> + unsigned int next_freq)
> +{
> + struct cpufreq_policy *policy = sg_policy->policy;
> +
> + sg_policy->last_freq_update_time = time;
> +
> + if (policy->fast_switch_enabled) {
> + if (next_freq > policy->max)
> + next_freq = policy->max;
> + else if (next_freq < policy->min)
> + next_freq = policy->min;
> +
> + if (sg_policy->next_freq == next_freq) {
> + trace_cpu_frequency(policy->cur, smp_processor_id());
> + return;
> + }
> + sg_policy->next_freq = next_freq;
> + next_freq = cpufreq_driver_fast_switch(policy, next_freq);
> + if (next_freq == CPUFREQ_ENTRY_INVALID)
> + return;
> +
> + policy->cur = next_freq;
> + trace_cpu_frequency(next_freq, smp_processor_id());
> + } else if (sg_policy->next_freq != next_freq) {
> + sg_policy->work_cpu = smp_processor_id();
+ sg_policy->next_freq = next_freq;
> + irq_work_queue(&sg_policy->irq_work);
> + }
> +}
Or we remain at max_f :-).
Best,
- Juri
Hi,
On 17/03/16 00:51, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
> Subject: [PATCH] cpufreq: Support for fast frequency switching
>
[...]
> +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
> +{
> + lockdep_assert_held(&policy->rwsem);
> +
> + mutex_lock(&cpufreq_fast_switch_lock);
> + if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) {
> + cpufreq_fast_switch_count++;
> + policy->fast_switch_enabled = true;
> + } else {
> + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n",
Ultra-minor nit: s/freqnency/frequency/
Also, is this really a warning or just a debug message? (everything
seems to work fine on Juno even if this is printed :-)).
Best,
- Juri
On Thu, Mar 17, 2016 at 01:01:45AM +0100, Rafael J. Wysocki wrote:
> + } else if (sg_policy->next_freq != next_freq) {
> + sg_policy->work_cpu = smp_processor_id();
> + irq_work_queue(&sg_policy->irq_work);
> + }
> +}
> +static void sugov_irq_work(struct irq_work *irq_work)
> +{
> + struct sugov_policy *sg_policy;
> +
> + sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
> + schedule_work_on(sg_policy->work_cpu, &sg_policy->work);
> +}
Not sure I see the point of ->work_cpu, irq_work_queue() does guarantee
the same CPU, so the above is identical to:
schedule_work_on(smp_processor_id(), &sq_policy->work);
On Thu, Mar 17, 2016 at 11:35:07AM +0000, Juri Lelli wrote:
> > + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n",
>
> Ultra-minor nit: s/freqnency/frequency/
>
> Also, is this really a warning or just a debug message? (everything
> seems to work fine on Juno even if this is printed :-)).
I would consider it a warn; this _should_ not happen. If your platform
supports fast_switch, then you really rather want to use it.
On 17/03/16 12:40, Peter Zijlstra wrote:
> On Thu, Mar 17, 2016 at 11:35:07AM +0000, Juri Lelli wrote:
>
> > > + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n",
> >
> > Ultra-minor nit: s/freqnency/frequency/
> >
> > Also, is this really a warning or just a debug message? (everything
> > seems to work fine on Juno even if this is printed :-)).
>
> I would consider it a warn; this _should_ not happen. If your platform
> supports fast_switch, then you really rather want to use it.
>
Mmm, right. So, something seems not correct here, as I get this warning
when I select schedutil on Juno (that doesn't support fast_switch).
On Thu, Mar 17, 2016 at 12:48 PM, Juri Lelli <[email protected]> wrote:
> On 17/03/16 12:40, Peter Zijlstra wrote:
>> On Thu, Mar 17, 2016 at 11:35:07AM +0000, Juri Lelli wrote:
>>
>> > > + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n",
>> >
>> > Ultra-minor nit: s/freqnency/frequency/
>> >
>> > Also, is this really a warning or just a debug message? (everything
>> > seems to work fine on Juno even if this is printed :-)).
>>
>> I would consider it a warn; this _should_ not happen. If your platform
>> supports fast_switch, then you really rather want to use it.
>>
>
> Mmm, right. So, something seems not correct here, as I get this warning
> when I select schedutil on Juno (that doesn't support fast_switch).
There is a mistake here. The message should not be printed if
policy->fast_switch_possible is not set.
Will fix.
On Thu, Mar 17, 2016 at 12:30 PM, Juri Lelli <[email protected]> wrote:
> Hi Rafael,
>
> On 17/03/16 01:01, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>
> [...]
>
>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>> + unsigned int next_freq)
>> +{
>> + struct cpufreq_policy *policy = sg_policy->policy;
>> +
>> + sg_policy->last_freq_update_time = time;
>> +
>> + if (policy->fast_switch_enabled) {
>> + if (next_freq > policy->max)
>> + next_freq = policy->max;
>> + else if (next_freq < policy->min)
>> + next_freq = policy->min;
>> +
>> + if (sg_policy->next_freq == next_freq) {
>> + trace_cpu_frequency(policy->cur, smp_processor_id());
>> + return;
>> + }
>> + sg_policy->next_freq = next_freq;
>> + next_freq = cpufreq_driver_fast_switch(policy, next_freq);
>> + if (next_freq == CPUFREQ_ENTRY_INVALID)
>> + return;
>> +
>> + policy->cur = next_freq;
>> + trace_cpu_frequency(next_freq, smp_processor_id());
>> + } else if (sg_policy->next_freq != next_freq) {
>> + sg_policy->work_cpu = smp_processor_id();
>
> + sg_policy->next_freq = next_freq;
>
Doh.
>> + irq_work_queue(&sg_policy->irq_work);
>> + }
>> +}
>
> Or we remain at max_f :-).
Sure, thanks!
Will fix.
On Thu, Mar 17, 2016 at 12:36 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Mar 17, 2016 at 01:01:45AM +0100, Rafael J. Wysocki wrote:
>> + } else if (sg_policy->next_freq != next_freq) {
>> + sg_policy->work_cpu = smp_processor_id();
>> + irq_work_queue(&sg_policy->irq_work);
>> + }
>> +}
>
>> +static void sugov_irq_work(struct irq_work *irq_work)
>> +{
>> + struct sugov_policy *sg_policy;
>> +
>> + sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
>> + schedule_work_on(sg_policy->work_cpu, &sg_policy->work);
>> +}
>
> Not sure I see the point of ->work_cpu, irq_work_queue() does guarantee
> the same CPU, so the above is identical to:
>
> schedule_work_on(smp_processor_id(), &sq_policy->work);
OK
I'll do that and restore work_in_progress, then.
From: Rafael J. Wysocki <[email protected]>
Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
to support that method if available.
Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context by (future)
governors supporting that feature via (new) helper function
cpufreq_driver_fast_switch().
Add two new policy flags, fast_switch_possible, to be set by the
cpufreq driver if fast frequency switching can be used for the
given policy and fast_switch_enabled, to be set by the governor
if it is going to use fast frequency switching for the given
policy. Also add a helper for setting the latter.
Since fast frequency switching is inherently incompatible with
cpufreq transition notifiers, make it possible to set the
fast_switch_enabled only if there are no transition notifiers
already registered and make the registration of new transition
notifiers fail if fast_switch_enabled is set for at least one
policy.
Implement the ->fast_switch callback in the ACPI cpufreq driver
and make it set fast_switch_possible during policy initialization
as appropriate.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Addressing comments, fixes.
Changes from v5:
- cpufreq_enable_fast_switch() fixed to avoid printing a confusing message
if fast_switch_possible is not set for the policy.
- Fixed a typo in that message.
- Removed the WARN_ON() from the (cpufreq_fast_switch_count > 0) check in
cpufreq_register_notifier(), because it triggered false-positive warnings
from the cpufreq_stats module (cpufreq_stats don't work with the fast
switching, because it is based on notifiers).
Changes from v4:
- If cpufreq_enable_fast_switch() is about to fail, it will print the list
of currently registered transition notifiers.
- Added lock_assert_held(&policy->rwsem) to cpufreq_enable_fast_switch().
- Added WARN_ON() to the (cpufreq_fast_switch_count > 0) check in
cpufreq_register_notifier().
- Modified the kerneldoc comment of cpufreq_driver_fast_switch() to
mention the RELATION_L expectation regarding the ->fast_switch callback.
Changes from v3:
- New fast_switch_enabled field in struct cpufreq_policy to help
avoid affecting existing setups by setting the fast_switch_possible
flag in the driver.
- __cpufreq_get() skips the policy->cur check if fast_switch_enabled is set.
Changes from v2:
- The driver ->fast_switch callback and cpufreq_driver_fast_switch()
don't need the relation argument as they will always do RELATION_L now.
- New mechanism to make fast switch and cpufreq notifiers mutually
exclusive.
- cpufreq_driver_fast_switch() doesn't do anything in addition to
invoking the driver callback and returns its return value.
---
drivers/cpufreq/acpi-cpufreq.c | 41 ++++++++++++
drivers/cpufreq/cpufreq.c | 130 ++++++++++++++++++++++++++++++++++++++---
include/linux/cpufreq.h | 9 ++
3 files changed, 171 insertions(+), 9 deletions(-)
Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp
return result;
}
+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ struct acpi_cpufreq_data *data = policy->driver_data;
+ struct acpi_processor_performance *perf;
+ struct cpufreq_frequency_table *entry;
+ unsigned int next_perf_state, next_freq, freq;
+
+ /*
+ * Find the closest frequency above target_freq.
+ *
+ * The table is sorted in the reverse order with respect to the
+ * frequency and all of the entries are valid (see the initialization).
+ */
+ entry = data->freq_table;
+ do {
+ entry++;
+ freq = entry->frequency;
+ } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
+ entry--;
+ next_freq = entry->frequency;
+ next_perf_state = entry->driver_data;
+
+ perf = to_perf_data(data);
+ if (perf->state == next_perf_state) {
+ if (unlikely(data->resume))
+ data->resume = 0;
+ else
+ return next_freq;
+ }
+
+ data->cpu_freq_write(&perf->control_register,
+ perf->states[next_perf_state].control);
+ perf->state = next_perf_state;
+ return next_freq;
+}
+
static unsigned long
acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu)
{
@@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct
goto err_unreg;
}
+ policy->fast_switch_possible = !acpi_pstate_strict &&
+ !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY);
+
data->freq_table = kzalloc(sizeof(*data->freq_table) *
(perf->state_count+1), GFP_KERNEL);
if (!data->freq_table) {
@@ -874,6 +914,7 @@ static struct freq_attr *acpi_cpufreq_at
static struct cpufreq_driver acpi_cpufreq_driver = {
.verify = cpufreq_generic_frequency_table_verify,
.target_index = acpi_cpufreq_target,
+ .fast_switch = acpi_cpufreq_fast_switch,
.bios_limit = acpi_processor_get_bios_limit,
.init = acpi_cpufreq_cpu_init,
.exit = acpi_cpufreq_cpu_exit,
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -102,6 +102,10 @@ struct cpufreq_policy {
*/
struct rw_semaphore rwsem;
+ /* Fast switch flags */
+ bool fast_switch_possible; /* Set by the driver. */
+ bool fast_switch_enabled;
+
/* Synchronization for frequency transitions */
bool transition_ongoing; /* Tracks transition status */
spinlock_t transition_lock;
@@ -156,6 +160,7 @@ int cpufreq_get_policy(struct cpufreq_po
int cpufreq_update_policy(unsigned int cpu);
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
#else
static inline unsigned int cpufreq_get(unsigned int cpu)
{
@@ -236,6 +241,8 @@ struct cpufreq_driver {
unsigned int relation); /* Deprecated */
int (*target_index)(struct cpufreq_policy *policy,
unsigned int index);
+ unsigned int (*fast_switch)(struct cpufreq_policy *policy,
+ unsigned int target_freq);
/*
* Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
* unset.
@@ -464,6 +471,8 @@ struct cpufreq_governor {
};
/* Pass a target to the cpufreq driver */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq);
int cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -428,6 +428,57 @@ void cpufreq_freq_transition_end(struct
}
EXPORT_SYMBOL_GPL(cpufreq_freq_transition_end);
+/*
+ * Fast frequency switching status count. Positive means "enabled", negative
+ * means "disabled" and 0 means "not decided yet".
+ */
+static int cpufreq_fast_switch_count;
+static DEFINE_MUTEX(cpufreq_fast_switch_lock);
+
+static void cpufreq_list_transition_notifiers(void)
+{
+ struct notifier_block *nb;
+
+ pr_info("cpufreq: Registered transition notifiers:\n");
+
+ mutex_lock(&cpufreq_transition_notifier_list.mutex);
+
+ for (nb = cpufreq_transition_notifier_list.head; nb; nb = nb->next)
+ pr_info("cpufreq: %pF\n", nb->notifier_call);
+
+ mutex_unlock(&cpufreq_transition_notifier_list.mutex);
+}
+
+/**
+ * cpufreq_enable_fast_switch - Enable fast frequency switching for policy.
+ * @policy: cpufreq policy to enable fast frequency switching for.
+ *
+ * Try to enable fast frequency switching for @policy.
+ *
+ * The attempt will fail if there is at least one transition notifier registered
+ * at this point, as fast frequency switching is quite fundamentally at odds
+ * with transition notifiers. Thus if successful, it will make registration of
+ * transition notifiers fail going forward.
+ */
+void cpufreq_enable_fast_switch(struct cpufreq_policy *policy)
+{
+ lockdep_assert_held(&policy->rwsem);
+
+ if (!policy->fast_switch_possible)
+ return;
+
+ mutex_lock(&cpufreq_fast_switch_lock);
+ if (cpufreq_fast_switch_count >= 0) {
+ cpufreq_fast_switch_count++;
+ policy->fast_switch_enabled = true;
+ } else {
+ pr_warn("cpufreq: CPU%u: Fast frequency switching not enabled\n",
+ policy->cpu);
+ cpufreq_list_transition_notifiers();
+ }
+ mutex_unlock(&cpufreq_fast_switch_lock);
+}
+EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch);
/*********************************************************************
* SYSFS INTERFACE *
@@ -1083,6 +1134,24 @@ static void cpufreq_policy_free(struct c
kfree(policy);
}
+static void cpufreq_driver_exit_policy(struct cpufreq_policy *policy)
+{
+ if (policy->fast_switch_enabled) {
+ mutex_lock(&cpufreq_fast_switch_lock);
+
+ policy->fast_switch_enabled = false;
+ if (!WARN_ON(cpufreq_fast_switch_count <= 0))
+ cpufreq_fast_switch_count--;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
+ }
+
+ if (cpufreq_driver->exit) {
+ cpufreq_driver->exit(policy);
+ policy->freq_table = NULL;
+ }
+}
+
static int cpufreq_online(unsigned int cpu)
{
struct cpufreq_policy *policy;
@@ -1236,8 +1305,7 @@ static int cpufreq_online(unsigned int c
out_exit_policy:
up_write(&policy->rwsem);
- if (cpufreq_driver->exit)
- cpufreq_driver->exit(policy);
+ cpufreq_driver_exit_policy(policy);
out_free_policy:
cpufreq_policy_free(policy, !new_policy);
return ret;
@@ -1334,10 +1402,7 @@ static void cpufreq_offline(unsigned int
* since this is a core component, and is essential for the
* subsequent light-weight ->init() to succeed.
*/
- if (cpufreq_driver->exit) {
- cpufreq_driver->exit(policy);
- policy->freq_table = NULL;
- }
+ cpufreq_driver_exit_policy(policy);
unlock:
up_write(&policy->rwsem);
@@ -1444,8 +1509,12 @@ static unsigned int __cpufreq_get(struct
ret_freq = cpufreq_driver->get(policy->cpu);
- /* Updating inactive policies is invalid, so avoid doing that. */
- if (unlikely(policy_is_inactive(policy)))
+ /*
+ * Updating inactive policies is invalid, so avoid doing that. Also
+ * if fast frequency switching is used with the given policy, the check
+ * against policy->cur is pointless, so skip it in that case too.
+ */
+ if (unlikely(policy_is_inactive(policy)) || policy->fast_switch_enabled)
return ret_freq;
if (ret_freq && policy->cur &&
@@ -1457,7 +1526,6 @@ static unsigned int __cpufreq_get(struct
schedule_work(&policy->update);
}
}
-
return ret_freq;
}
@@ -1653,8 +1721,18 @@ int cpufreq_register_notifier(struct not
switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
+ if (cpufreq_fast_switch_count > 0) {
+ mutex_unlock(&cpufreq_fast_switch_lock);
+ return -EBUSY;
+ }
ret = srcu_notifier_chain_register(
&cpufreq_transition_notifier_list, nb);
+ if (!ret)
+ cpufreq_fast_switch_count--;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_register(
@@ -1687,8 +1765,14 @@ int cpufreq_unregister_notifier(struct n
switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
+ mutex_lock(&cpufreq_fast_switch_lock);
+
ret = srcu_notifier_chain_unregister(
&cpufreq_transition_notifier_list, nb);
+ if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0))
+ cpufreq_fast_switch_count++;
+
+ mutex_unlock(&cpufreq_fast_switch_lock);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_unregister(
@@ -1707,6 +1791,34 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie
* GOVERNORS *
*********************************************************************/
+/**
+ * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch.
+ * @policy: cpufreq policy to switch the frequency for.
+ * @target_freq: New frequency to set (may be approximate).
+ *
+ * Carry out a fast frequency switch from interrupt context.
+ *
+ * The driver's ->fast_switch() callback invoked by this function is expected to
+ * select the minimum available frequency greater than or equal to @target_freq
+ * (CPUFREQ_RELATION_L).
+ *
+ * This function must not be called if policy->fast_switch_enabled is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same policy and that it will never be called in
+ * parallel with either ->target() or ->target_index() for the same policy.
+ *
+ * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch()
+ * callback to indicate an error condition, the hardware configuration must be
+ * preserved.
+ */
+unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+ return cpufreq_driver->fast_switch(policy, target_freq);
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
+
/* Must set freqs->new to intermediate frequency */
static int __target_intermediate(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, int index)
From: Rafael J. Wysocki <[email protected]>
Subject: [PATCH] cpufreq: schedutil: New governor based on scheduler utilization data
Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.
Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.
The new governor is relatively simple.
The frequency selection formula used by it depends on whether or not
the utilization is frequency-invariant. In the frequency-invariant
case the new CPU frequency is given by
next_freq = 1.25 * max_freq * util / max
where util and max are the last two arguments of cpufreq_update_util().
In turn, if util is not frequency-invariant, the maximum frequency in
the above formula is replaced with the current frequency of the CPU:
next_freq = 1.25 * curr_freq * util / max
The coefficient 1.25 corresponds to the frequency tipping point at
(util / max) = 0.8.
All of the computations are carried out in the utilization update
handlers provided by the new governor. One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).
The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers. If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).
Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes. That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.
The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Addressing comments from Peter and Juri, fixes.
Changes from v5:
- Fixed sugov_update_commit() to set sg_policy->next_freq properly
in the "work item" branch.
- Used smp_processor_id() in sugov_irq_work() and restored work_in_progress.
Changes from v4:
- Use TICK_NSEC in sugov_next_freq_shared().
- Use schedule_work_on() to schedule work items and replace
work_in_progress with work_cpu (which is used both for scheduling
work items and as a "work in progress" marker).
- Rearrange sugov_update_commit() to only check policy->min/max if
fast switching is enabled.
- Replace util > max checks with util == ULONG_MAX checks to make
it clear that they are about a special case (RT/DL).
Changes from v3:
- The "next frequency" formula based on
http://marc.info/?l=linux-acpi&m=145756618321500&w=4 and
http://marc.info/?l=linux-kernel&m=145760739700716&w=4
- The governor goes into kernel/sched/ (again).
Changes from v2:
- The governor goes into drivers/cpufreq/.
- The "next frequency" formula has an additional 1.1 factor to allow
more util/max values to map onto the top-most frequency in case the
distance between that and the previous one is unproportionally small.
- sugov_update_commit() traces CPU frequency even if the new one is
the same as the previous one (otherwise, if the system is 100% loaded
for long enough, powertop starts to report that all CPUs are 100% idle).
---
drivers/cpufreq/Kconfig | 26 +
kernel/sched/Makefile | 1
kernel/sched/cpufreq_schedutil.c | 528 +++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 8
4 files changed, 563 insertions(+)
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+ bool "schedutil"
+ select CPU_FREQ_GOV_SCHEDUTIL
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the 'schedutil' CPUFreq governor by default. If unsure,
+ have a look at the help section of that governor. The fallback
+ governor will be 'performance'.
+
endchoice
config CPU_FREQ_GOV_PERFORMANCE
@@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE
If in doubt, say N.
+config CPU_FREQ_GOV_SCHEDUTIL
+ tristate "'schedutil' cpufreq policy governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_ATTR_SET
+ select IRQ_WORK
+ help
+ The frequency selection formula used by this governor is analogous
+ to the one used by 'ondemand', but instead of computing CPU load
+ as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+ utilization data provided by the scheduler as input.
+
+ To compile this driver as a module, choose M here: the
+ module will be called cpufreq_schedutil.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"
config CPUFREQ_DT
Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -0,0 +1,528 @@
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <trace/events/power.h>
+
+#include "sched.h"
+
+struct sugov_tunables {
+ struct gov_attr_set attr_set;
+ unsigned int rate_limit_us;
+};
+
+struct sugov_policy {
+ struct cpufreq_policy *policy;
+
+ struct sugov_tunables *tunables;
+ struct list_head tunables_hook;
+
+ raw_spinlock_t update_lock; /* For shared policies */
+ u64 last_freq_update_time;
+ s64 freq_update_delay_ns;
+ unsigned int next_freq;
+
+ /* The next fields are only needed if fast switch cannot be used. */
+ struct irq_work irq_work;
+ struct work_struct work;
+ struct mutex work_lock;
+ bool work_in_progress;
+
+ bool need_freq_update;
+};
+
+struct sugov_cpu {
+ struct update_util_data update_util;
+ struct sugov_policy *sg_policy;
+
+ /* The fields below are only needed when sharing a policy. */
+ unsigned long util;
+ unsigned long max;
+ u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
+ u64 delta_ns;
+
+ if (sg_policy->work_in_progress)
+ return false;
+
+ if (unlikely(sg_policy->need_freq_update)) {
+ sg_policy->need_freq_update = false;
+ return true;
+ }
+
+ delta_ns = time - sg_policy->last_freq_update_time;
+ return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
+}
+
+static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+ unsigned int next_freq)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+
+ sg_policy->last_freq_update_time = time;
+
+ if (policy->fast_switch_enabled) {
+ if (next_freq > policy->max)
+ next_freq = policy->max;
+ else if (next_freq < policy->min)
+ next_freq = policy->min;
+
+ if (sg_policy->next_freq == next_freq) {
+ trace_cpu_frequency(policy->cur, smp_processor_id());
+ return;
+ }
+ sg_policy->next_freq = next_freq;
+ next_freq = cpufreq_driver_fast_switch(policy, next_freq);
+ if (next_freq == CPUFREQ_ENTRY_INVALID)
+ return;
+
+ policy->cur = next_freq;
+ trace_cpu_frequency(next_freq, smp_processor_id());
+ } else if (sg_policy->next_freq != next_freq) {
+ sg_policy->next_freq = next_freq;
+ sg_policy->work_in_progress = true;
+ irq_work_queue(&sg_policy->irq_work);
+ }
+}
+
+/**
+ * get_next_freq - Compute a new frequency for a given cpufreq policy.
+ * @policy: cpufreq policy object to compute the new frequency for.
+ * @util: Current CPU utilization.
+ * @max: CPU capacity.
+ *
+ * If the utilization is frequency-invariant, choose the new frequency to be
+ * proportional to it, that is
+ *
+ * next_freq = C * max_freq * util / max
+ *
+ * Otherwise, approximate the would-be frequency-invariant utilization by
+ * util_raw * (curr_freq / max_freq) which leads to
+ *
+ * next_freq = C * curr_freq * util_raw / max
+ *
+ * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
+ */
+static unsigned int get_next_freq(struct cpufreq_policy *policy,
+ unsigned long util, unsigned long max)
+{
+ unsigned int freq = arch_scale_freq_invariant() ?
+ policy->cpuinfo.max_freq : policy->cur;
+
+ return (freq + (freq >> 2)) * util / max;
+}
+
+static void sugov_update_single(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int next_f;
+
+ if (!sugov_should_update_freq(sg_policy, time))
+ return;
+
+ next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
+ get_next_freq(policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+}
+
+static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
+ unsigned long util, unsigned long max)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ unsigned int max_f = policy->cpuinfo.max_freq;
+ u64 last_freq_update_time = sg_policy->last_freq_update_time;
+ unsigned int j;
+
+ if (util == ULONG_MAX)
+ return max_f;
+
+ for_each_cpu(j, policy->cpus) {
+ struct sugov_cpu *j_sg_cpu;
+ unsigned long j_util, j_max;
+ u64 delta_ns;
+
+ if (j == smp_processor_id())
+ continue;
+
+ j_sg_cpu = &per_cpu(sugov_cpu, j);
+ /*
+ * If the CPU utilization was last updated before the previous
+ * frequency update and the time elapsed between the last update
+ * of the CPU utilization and the last frequency update is long
+ * enough, don't take the CPU into account as it probably is
+ * idle now.
+ */
+ delta_ns = last_freq_update_time - j_sg_cpu->last_update;
+ if ((s64)delta_ns > TICK_NSEC)
+ continue;
+
+ j_util = j_sg_cpu->util;
+ if (j_util == ULONG_MAX)
+ return max_f;
+
+ j_max = j_sg_cpu->max;
+ if (j_util * max > j_max * util) {
+ util = j_util;
+ max = j_max;
+ }
+ }
+
+ return get_next_freq(policy, util, max);
+}
+
+static void sugov_update_shared(struct update_util_data *hook, u64 time,
+ unsigned long util, unsigned long max)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int next_f;
+
+ raw_spin_lock(&sg_policy->update_lock);
+
+ sg_cpu->util = util;
+ sg_cpu->max = max;
+ sg_cpu->last_update = time;
+
+ if (sugov_should_update_freq(sg_policy, time)) {
+ next_f = sugov_next_freq_shared(sg_policy, util, max);
+ sugov_update_commit(sg_policy, time, next_f);
+ }
+
+ raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_work(struct work_struct *work)
+{
+ struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+
+ mutex_lock(&sg_policy->work_lock);
+ __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+ CPUFREQ_RELATION_L);
+ mutex_unlock(&sg_policy->work_lock);
+
+ sg_policy->work_in_progress = false;
+}
+
+static void sugov_irq_work(struct irq_work *irq_work)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
+ schedule_work_on(smp_processor_id(), &sg_policy->work);
+}
+
+/************************** sysfs interface ************************/
+
+static struct sugov_tunables *global_tunables;
+static DEFINE_MUTEX(global_tunables_lock);
+
+static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
+{
+ return container_of(attr_set, struct sugov_tunables, attr_set);
+}
+
+static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+ return sprintf(buf, "%u\n", tunables->rate_limit_us);
+}
+
+static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
+ size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+ struct sugov_policy *sg_policy;
+ unsigned int rate_limit_us;
+ int ret;
+
+ ret = sscanf(buf, "%u", &rate_limit_us);
+ if (ret != 1)
+ return -EINVAL;
+
+ tunables->rate_limit_us = rate_limit_us;
+
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+ sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+
+ return count;
+}
+
+static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+
+static struct attribute *sugov_attributes[] = {
+ &rate_limit_us.attr,
+ NULL
+};
+
+static struct kobj_type sugov_tunables_ktype = {
+ .default_attrs = sugov_attributes,
+ .sysfs_ops = &governor_sysfs_ops,
+};
+
+/********************** cpufreq governor interface *********************/
+
+static struct cpufreq_governor schedutil_gov;
+
+static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+
+ sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+ if (!sg_policy)
+ return NULL;
+
+ sg_policy->policy = policy;
+ init_irq_work(&sg_policy->irq_work, sugov_irq_work);
+ INIT_WORK(&sg_policy->work, sugov_work);
+ mutex_init(&sg_policy->work_lock);
+ raw_spin_lock_init(&sg_policy->update_lock);
+ return sg_policy;
+}
+
+static void sugov_policy_free(struct sugov_policy *sg_policy)
+{
+ mutex_destroy(&sg_policy->work_lock);
+ kfree(sg_policy);
+}
+
+static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
+{
+ struct sugov_tunables *tunables;
+
+ tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
+ if (tunables)
+ gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
+
+ return tunables;
+}
+
+static void sugov_tunables_free(struct sugov_tunables *tunables)
+{
+ if (!have_governor_per_policy())
+ global_tunables = NULL;
+
+ kfree(tunables);
+}
+
+static int sugov_init(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy;
+ struct sugov_tunables *tunables;
+ unsigned int lat;
+ int ret = 0;
+
+ /* State should be equivalent to EXIT */
+ if (policy->governor_data)
+ return -EBUSY;
+
+ sg_policy = sugov_policy_alloc(policy);
+ if (!sg_policy)
+ return -ENOMEM;
+
+ mutex_lock(&global_tunables_lock);
+
+ if (global_tunables) {
+ if (WARN_ON(have_governor_per_policy())) {
+ ret = -EINVAL;
+ goto free_sg_policy;
+ }
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = global_tunables;
+
+ gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
+ goto out;
+ }
+
+ tunables = sugov_tunables_alloc(sg_policy);
+ if (!tunables) {
+ ret = -ENOMEM;
+ goto free_sg_policy;
+ }
+
+ tunables->rate_limit_us = LATENCY_MULTIPLIER;
+ lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
+ if (lat)
+ tunables->rate_limit_us *= lat;
+
+ if (!have_governor_per_policy())
+ global_tunables = tunables;
+
+ policy->governor_data = sg_policy;
+ sg_policy->tunables = tunables;
+
+ ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
+ get_governor_parent_kobj(policy), "%s",
+ schedutil_gov.name);
+ if (!ret)
+ goto out;
+
+ /* Failure, so roll back. */
+ policy->governor_data = NULL;
+ sugov_tunables_free(tunables);
+
+ free_sg_policy:
+ pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
+ sugov_policy_free(sg_policy);
+
+ out:
+ mutex_unlock(&global_tunables_lock);
+ return ret;
+}
+
+static int sugov_exit(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ struct sugov_tunables *tunables = sg_policy->tunables;
+ unsigned int count;
+
+ mutex_lock(&global_tunables_lock);
+
+ count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
+ policy->governor_data = NULL;
+ if (!count)
+ sugov_tunables_free(tunables);
+
+ mutex_unlock(&global_tunables_lock);
+
+ sugov_policy_free(sg_policy);
+ return 0;
+}
+
+static int sugov_start(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ cpufreq_enable_fast_switch(policy);
+
+ sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->last_freq_update_time = 0;
+ sg_policy->next_freq = UINT_MAX;
+ sg_policy->work_in_progress = false;
+ sg_policy->need_freq_update = false;
+
+ for_each_cpu(cpu, policy->cpus) {
+ struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+ sg_cpu->sg_policy = sg_policy;
+ if (policy_is_shared(policy)) {
+ sg_cpu->util = ULONG_MAX;
+ sg_cpu->max = 0;
+ sg_cpu->last_update = 0;
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_shared);
+ } else {
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ sugov_update_single);
+ }
+ }
+ return 0;
+}
+
+static int sugov_stop(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+ unsigned int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ cpufreq_remove_update_util_hook(cpu);
+
+ synchronize_sched();
+
+ irq_work_sync(&sg_policy->irq_work);
+ cancel_work_sync(&sg_policy->work);
+ return 0;
+}
+
+static int sugov_limits(struct cpufreq_policy *policy)
+{
+ struct sugov_policy *sg_policy = policy->governor_data;
+
+ if (!policy->fast_switch_enabled) {
+ mutex_lock(&sg_policy->work_lock);
+
+ if (policy->max < policy->cur)
+ __cpufreq_driver_target(policy, policy->max,
+ CPUFREQ_RELATION_H);
+ else if (policy->min > policy->cur)
+ __cpufreq_driver_target(policy, policy->min,
+ CPUFREQ_RELATION_L);
+
+ mutex_unlock(&sg_policy->work_lock);
+ }
+
+ sg_policy->need_freq_update = true;
+ return 0;
+}
+
+int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
+{
+ if (event == CPUFREQ_GOV_POLICY_INIT) {
+ return sugov_init(policy);
+ } else if (policy->governor_data) {
+ switch (event) {
+ case CPUFREQ_GOV_POLICY_EXIT:
+ return sugov_exit(policy);
+ case CPUFREQ_GOV_START:
+ return sugov_start(policy);
+ case CPUFREQ_GOV_STOP:
+ return sugov_stop(policy);
+ case CPUFREQ_GOV_LIMITS:
+ return sugov_limits(policy);
+ }
+ }
+ return -EINVAL;
+}
+
+static struct cpufreq_governor schedutil_gov = {
+ .name = "schedutil",
+ .governor = sugov_governor,
+ .owner = THIS_MODULE,
+};
+
+static int __init sugov_module_init(void)
+{
+ return cpufreq_register_governor(&schedutil_gov);
+}
+
+static void __exit sugov_module_exit(void)
+{
+ cpufreq_unregister_governor(&schedutil_gov);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <[email protected]>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+struct cpufreq_governor *cpufreq_default_governor(void)
+{
+ return &schedutil_gov;
+}
+
+fs_initcall(sugov_module_init);
+#else
+module_init(sugov_module_init);
+#endif
+module_exit(sugov_module_exit);
Index: linux-pm/kernel/sched/Makefile
===================================================================
--- linux-pm.orig/kernel/sched/Makefile
+++ linux-pm/kernel/sched/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -1786,3 +1786,11 @@ static inline void cpufreq_trigger_updat
static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {}
static inline void cpufreq_trigger_update(u64 time) {}
#endif /* CONFIG_CPU_FREQ */
+
+#ifdef arch_scale_freq_capacity
+#ifndef arch_scale_freq_invariant
+#define arch_scale_freq_invariant() (true)
+#endif
+#else /* arch_scale_freq_capacity */
+#define arch_scale_freq_invariant() (false)
+#endif
Hi Rafael, all,
I have (yet another) consideration regarding the definition of the
margin for the frequency selection.
On 17-Mar 17:01, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
> Subject: [PATCH] cpufreq: schedutil: New governor based on scheduler utilization data
>
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
>
> Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add
> mechanism for registering utilization update callbacks) that
> introduced cpufreq_update_util() called by the scheduler on
> utilization changes (from CFS) and RT/DL task status updates.
> In particular, CPU frequency scaling decisions may be based on
> the the utilization data passed to cpufreq_update_util() by CFS.
>
> The new governor is relatively simple.
>
> The frequency selection formula used by it depends on whether or not
> the utilization is frequency-invariant. In the frequency-invariant
> case the new CPU frequency is given by
>
> next_freq = 1.25 * max_freq * util / max
>
> where util and max are the last two arguments of cpufreq_update_util().
> In turn, if util is not frequency-invariant, the maximum frequency in
> the above formula is replaced with the current frequency of the CPU:
>
> next_freq = 1.25 * curr_freq * util / max
>
> The coefficient 1.25 corresponds to the frequency tipping point at
> (util / max) = 0.8.
In both this formulas the OPP jump is driven by a margin which is
effectively proportional to the capacity of the current OPP.
For example, if we consider a simple system with this set of OPPs:
[200,400,600,800,1000) MHz
and we apply the formula for the frequency-invariant case, we get:
util/max min_opp min_util margin
1.0 1000 0.80 20%
0.8 800 0.64 16%
0.6 600 0.48 12%
0.4 400 0.32 8%
0.2 200 0.16 4%
Where:
- min_opp: is the minimum OPP which can satisfy (util/max) capacity
request
- min_util: is the minimum utilization value which effectively trigger
a switch to the upper OPP
- margin: is the effective capacity margin to remain at min_opp
This means that when running at the lower OPP we can build up to 16%
utilization (i.e. 4% less than the capacity of the min_opp) before
jumping to the next OPP. But, for example, switching at the 800MHz
OPP we need to build up just 4% utilization (i.e. 16% less than the
capacity of that OPP) to jump up.
This is a really simple example, with OPPs that are equally distributed.
However, the question is: does is really make sense to have different
effective margins for different starting OPPs?
AFAIU, this solution is biasing the frequency selection to higher
OPPs. The bigger the utilization of a CPU the more we are likely to
run at an higher the minimum OPP.
The advantage is a reduce time to reach the highest OPP, which can be
beneficial for performance oriented workload. The disadvantage is
instead a quite likely reduction of residencies on mid-range OPPs.
We should consider also that, at least in its current implementation,
PELT "builds up" slower when running at lower OPPs, which further
amplify this unbalance on OPP residencies.
IMO, biasing the selection of an OPP over another is something which
sound more like a "policy" than a "mechanism". Since here the goal
should be to provide just a mechanism, perhaps a different approach
can be evaluated.
Have we ever considered to use a "constant margin" for each OPP?
The value of such a margin can still be defined as a (configurable)
percentage of the max (or min) OPP. But once defined, the same
margin can be used to decide whenever to switch to the next OPP.
In the previous example, considering a 5% margin wrt the max capacity,
these are the new margins:
util/max min_opp min_util margin
1.0 1000 0.95 5%
0.8 800 0.75 5%
0.6 600 0.55 5%
0.4 400 0.35 5%
0.2 200 0.15 5%
That means that when running both at the lowest OPP or in a mid-range
one, we always need to build up the same amount of utilization before
switching to the next one.
What is the translation in residencies time? This is still affected by
the PELT behaviors when running at different OPPs but IMO it should
improve a bit the fairness on OPP selections.
Moreover, from an implementation standpoint, what is now a couple of
multiplications and comparison, can potentially be reduced to a single
comparison, e.g.
next_freq = util > (curr_cap - margin)
? curr_freq + 1
: curr_freq
where margin is pre-computed to be for example 51 (i.e. 5% of 1024) as
well as (curr_cap - margin), which can be cached at each OPP change.
--
#include <best/regards.h>
Patrick Bellasi