Running a well-known power performance benchmark, current ondemand governor is
not power efficiency. Even when workload is at 10%~20% of full capability, the
CPU will also run much of time at highest frequency. In fact, in this situation,
the lowest frequency often can meet user requirement. When running this
benchmark on turbo mode enable machine, I compare the result of different
governors, the results of ondemand and performance governors are the closest.
There is no much power saving between ondemand and performance governor. If we
can ignore the little power saving, the perfomance governor even better than
ondemand governor, at leaset for better performance.
One potential reason for ondemand governor is not power efficiency is that
ondemand governor decide the next target frequency by instant requirement during
sampling interval (10ms or possible a little longer for deferrable timer in idle
tickless). The instant requirement can response quickly to workload change, but
it does not usually reflect workload real CPU usage requirement in a small
longer time and it possibly causes frequently change between highest and lowest
frequency.
This patchset add a sampling window for percpu ondemand thread. Each sampling
window with max 150 record items which slide every sampling interval and use to
track the workload requirement during latest sampling window timeframe.
The average of workload during latest sample windows will be used to decide next
target frequency. The sampling window targets to be more truly reflects workload
requirement of CPU usage.
The sampling window size can be set by user and default max sampling window
is one second. When it is set to default sampling rate, the sampling window will
roll back to original behaviour.
The sampling window size also can be dynamicly changed in according to current
system workload busy situation. The more idle, the smaller sampling window; the
more busy, the larger sampling window. It will increase the respnose speed by
decrease sampling window, while it will keep CPU working at high speed when busy
by increase sampling window and also avoid unefficiently dangle between highest
and lowest frequency in original ondemand.
We set to up_threshold to 80 and down_differential to 20, so when workload reach
80% of current frequency, it will increase to highest frequency. When workload
decrease to below (up_threshold - down_differential)60% of current frequency
capability, it will decrease the frequency, which ensure that CPU work above 60%
of its current capability, otherwise lowest frequency will be used.
The Turbo Mode (P0) will comsume much more power compare with second largest
frequency (P1) and P1 frequency is often double, even more, with Pn lowest
frequency; Current logic will increase sharply to highest frequency Turbo Mode
when workload reach to up_threshold of current frequency capacity, even current
frequency at lowest frequency. In this patchset, it will firstly evaluate P1 if
it is enough to support current workload before directly enter into Turbo Mode.
If P1 can meet workload requirement, it will save power compare of being Turbo
Mode.
On my test platform with two sockets Westmere-EP server and run the well-known
power performance benchmark, when workload is low, the patched governor is
power saving like powersave governor; while workload is high, the patched
governor is as good as performance governor but the patched governor consume
less power than performance governor. Along with other patches in this patchset,
the patched governor power efficiey is improved about 10%, while the performance
has no apparently decrease.
Running other benchmarks in phoronix, kernel building save 5% power, while the
performance without decrease. compress-7zip save power 2%, while the performance
also does not apparently decrease. However, apache benchmark saves power but its
performance decrease a lot.
When sampling window is set to default sampling rate, the sampling window will
roll back to orignal behaviour. The up_threshold and down_differential also set
to original ones.
Signed-off-by: Youquan Song <[email protected]>
---
drivers/cpufreq/cpufreq_ondemand.c | 20 +++++++++++++++++---
1 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index 87aec7f..5242d24 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -469,11 +469,24 @@ static ssize_t store_sampling_window(struct kobject *a, struct attribute *b,
if (input > 1000000)
input = 1000000;
- if (input < 10000)
- input = 10000;
+ if (input < dbs_tuners_ins.sampling_rate)
+ input = dbs_tuners_ins.sampling_rate;
mutex_lock(&dbs_mutex);
dbs_tuners_ins.sampling_window = input;
+ /* User set sampling window equal default sampling rate,
+ * It means that user want disable sampling window, so
+ * return to legacy sampling mode
+ * */
+ if (input == dbs_tuners_ins.sampling_rate) {
+ dbs_tuners_ins.up_threshold = MICRO_FREQUENCY_UP_THRESHOLD;
+ dbs_tuners_ins.down_differential =
+ MICRO_FREQUENCY_DOWN_DIFFERENTIAL;
+ } else {
+ dbs_tuners_ins.up_threshold = SAMPLING_WINDOW_UP_THRESHOLD;
+ dbs_tuners_ins.down_differential =
+ SAMPLING_WINDOW_DOWN_DIFFERENTIAL;
+ }
mutex_unlock(&dbs_mutex);
return count;
@@ -720,7 +733,8 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
}
}
- if (sampling_window_enable)
+ if (sampling_window_enable && (dbs_tuners_ins.sampling_window !=
+ dbs_tuners_ins.sampling_rate))
/* Get the average load in the lastest sampling window */
max_load_freq = get_load_freq_during_sampling_window(
this_dbs_info, max_load_freq,
--
1.6.4.2
The Turbo Mode (P0) will comsume much more power compare with second largest
frequency (P1) and P1 frequency is often double, even more, with Pn lowest
frequency; Current logic will increase sharply to highest frequency Turbo Mode
when workload reach to up_threshold of current frequency capacity, even current
frequency at lowest frequency. In this patchset, it will firstly evaluate P1 if
it is enough to support current workload before directly enter into Turbo Mode.
If P1 can meet workload requirement, it will save power compare of being Turbo
Mode.
Signed-off-by: Youquan Song <[email protected]>
---
drivers/cpufreq/cpufreq_ondemand.c | 7 ++++++-
drivers/cpufreq/freq_table.c | 9 +++++++++
include/linux/cpufreq.h | 1 +
3 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index 85ca136..682b2ea 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -797,7 +797,12 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
if (policy->cur < policy->max)
this_dbs_info->rate_mult =
dbs_tuners_ins.sampling_down_factor;
- dbs_freq_increase(policy, policy->max);
+ /* Before go to turbo, try P1 first */
+ if ((policy->max > policy->sec_max) &&
+ (policy->cur == policy->sec_max))
+ dbs_freq_increase(policy, policy->max);
+ else
+ dbs_freq_increase(policy, policy->sec_max);
return;
}
diff --git a/drivers/cpufreq/freq_table.c b/drivers/cpufreq/freq_table.c
index 0543221..d7dc010 100644
--- a/drivers/cpufreq/freq_table.c
+++ b/drivers/cpufreq/freq_table.c
@@ -26,6 +26,7 @@ int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
{
unsigned int min_freq = ~0;
unsigned int max_freq = 0;
+ unsigned int sec_max_freq = 0;
unsigned int i;
for (i = 0; (table[i].frequency != CPUFREQ_TABLE_END); i++) {
@@ -41,10 +42,18 @@ int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
min_freq = freq;
if (freq > max_freq)
max_freq = freq;
+ /* Find the second largest frequency */
+ if ((freq < max_freq) && (freq > sec_max_freq))
+ sec_max_freq = freq;
}
policy->min = policy->cpuinfo.min_freq = min_freq;
policy->max = policy->cpuinfo.max_freq = max_freq;
+ /* Check CPU turbo mode enabled */
+ if (max_freq - sec_max_freq == 1000)
+ policy->sec_max = sec_max_freq;
+ else
+ policy->sec_max = max_freq;
if (policy->min == ~0)
return -EINVAL;
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index c3e9de8..0087e56 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -92,6 +92,7 @@ struct cpufreq_policy {
unsigned int min; /* in kHz */
unsigned int max; /* in kHz */
+ unsigned int sec_max; /* in kHz*/
unsigned int cur; /* in kHz, only needed if cpufreq
* governors are used */
unsigned int policy; /* see above */
--
1.6.4.2
Add sampling_window tunable and user can set sampling window size by this
interface. The default max sampling window is one second.
Signed-off-by: Youquan Song <[email protected]>
---
drivers/cpufreq/cpufreq_ondemand.c | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index e49b2e1..87aec7f 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -295,6 +295,7 @@ show_one(up_threshold, up_threshold);
show_one(sampling_down_factor, sampling_down_factor);
show_one(ignore_nice_load, ignore_nice);
show_one(powersave_bias, powersave_bias);
+show_one(sampling_window, sampling_window);
/*** delete after deprecation time ***/
@@ -455,12 +456,36 @@ static ssize_t store_powersave_bias(struct kobject *a, struct attribute *b,
return count;
}
+static ssize_t store_sampling_window(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+
+ if (ret != 1)
+ return -EINVAL;
+
+ if (input > 1000000)
+ input = 1000000;
+
+ if (input < 10000)
+ input = 10000;
+
+ mutex_lock(&dbs_mutex);
+ dbs_tuners_ins.sampling_window = input;
+ mutex_unlock(&dbs_mutex);
+
+ return count;
+}
+
define_one_global_rw(sampling_rate);
define_one_global_rw(io_is_busy);
define_one_global_rw(up_threshold);
define_one_global_rw(sampling_down_factor);
define_one_global_rw(ignore_nice_load);
define_one_global_rw(powersave_bias);
+define_one_global_rw(sampling_window);
static struct attribute *dbs_attributes[] = {
&sampling_rate_max.attr,
@@ -471,6 +496,7 @@ static struct attribute *dbs_attributes[] = {
&ignore_nice_load.attr,
&powersave_bias.attr,
&io_is_busy.attr,
+ &sampling_window.attr,
NULL
};
--
1.6.4.2
Add down_differential tuable for user adjust ondemand governor decrease
frequency threshold. down_differential is used in original ondemand governor
code, but it does not add it as tunable, so I add it as tunable in this patch
Signed-off-by: Youquan Song <[email protected]>
---
drivers/cpufreq/cpufreq_ondemand.c | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index 5dd3770..85ca136 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -292,6 +292,7 @@ static ssize_t show_##file_name \
show_one(sampling_rate, sampling_rate);
show_one(io_is_busy, io_is_busy);
show_one(up_threshold, up_threshold);
+show_one(down_differential, down_differential);
show_one(sampling_down_factor, sampling_down_factor);
show_one(ignore_nice_load, ignore_nice);
show_one(powersave_bias, powersave_bias);
@@ -513,9 +514,33 @@ static ssize_t store_window_is_dynamic(struct kobject *a, struct attribute *b,
return count;
}
+static ssize_t store_down_differential(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+
+ if (ret != 1)
+ return -EINVAL;
+
+ if (input > 30)
+ input = 30;
+
+ if (input < 0)
+ input = 0;
+
+ mutex_lock(&dbs_mutex);
+ dbs_tuners_ins.down_differential = input;
+ mutex_unlock(&dbs_mutex);
+
+ return count;
+}
+
define_one_global_rw(sampling_rate);
define_one_global_rw(io_is_busy);
define_one_global_rw(up_threshold);
+define_one_global_rw(down_differential);
define_one_global_rw(sampling_down_factor);
define_one_global_rw(ignore_nice_load);
define_one_global_rw(powersave_bias);
@@ -527,6 +552,7 @@ static struct attribute *dbs_attributes[] = {
&sampling_rate_min.attr,
&sampling_rate.attr,
&up_threshold.attr,
+ &down_differential.attr,
&sampling_down_factor.attr,
&ignore_nice_load.attr,
&powersave_bias.attr,
--
1.6.4.2
Add window_is_dynamic tuable, which will use to enable or disable dynamic
adjust current sampling window size.
It is useful. Such as in my test platform, kernel build benchmark will achieve
better power performance when dynamic sampling window is disabled. While
compress-7zip will achieve better perforamance when dynamic sampling window is
enabled.
Signed-off-by: Youquan Song <[email protected]>
---
drivers/cpufreq/cpufreq_ondemand.c | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index 5242d24..5dd3770 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -296,6 +296,7 @@ show_one(sampling_down_factor, sampling_down_factor);
show_one(ignore_nice_load, ignore_nice);
show_one(powersave_bias, powersave_bias);
show_one(sampling_window, sampling_window);
+show_one(window_is_dynamic, window_is_dynamic);
/*** delete after deprecation time ***/
@@ -492,6 +493,26 @@ static ssize_t store_sampling_window(struct kobject *a, struct attribute *b,
return count;
}
+static ssize_t store_window_is_dynamic(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+
+ if (ret != 1)
+ return -EINVAL;
+
+ if (input != 0)
+ input = 1;
+
+ mutex_lock(&dbs_mutex);
+ dbs_tuners_ins.window_is_dynamic = input;
+ mutex_unlock(&dbs_mutex);
+
+ return count;
+}
+
define_one_global_rw(sampling_rate);
define_one_global_rw(io_is_busy);
define_one_global_rw(up_threshold);
@@ -499,6 +520,7 @@ define_one_global_rw(sampling_down_factor);
define_one_global_rw(ignore_nice_load);
define_one_global_rw(powersave_bias);
define_one_global_rw(sampling_window);
+define_one_global_rw(window_is_dynamic);
static struct attribute *dbs_attributes[] = {
&sampling_rate_max.attr,
@@ -510,6 +532,7 @@ static struct attribute *dbs_attributes[] = {
&powersave_bias.attr,
&io_is_busy.attr,
&sampling_window.attr,
+ &window_is_dynamic.attr,
NULL
};
--
1.6.4.2
Running a well-known power performance benchmark, current ondemand governor is
not power efficiency. Even when workload is at 10%~20% of full capability, the
CPU will also run much of time at highest frequency. In fact, in this situation,
the lowest frequency often can meet user requirement. When running this
this benchmark on turbo mode enable machine, I compare the result of different
governors, the results of ondemand and performance governors are the closest.
There is no much power saving between ondemand and performance governor. If we
can ignore the little power saving, the perfomance governor even better than
ondemand governor, at leaset for better performance.
One potential reason for ondemand governor is not power efficiency is that
ondemand governor decide the next target frequency by instant requirement during
sampling interval (10ms or possible a little longer for deferrable timer in idle
tickless). The instant requirement can response quickly to workload change, but
it does not usually reflect workload real CPU usage requirement in a small
longer time and it possibly causes frequently change between highest and lowest
frequency.
This patch add a sampling window for percpu ondemand thread. Each sampling
window with max 150 record items which slide every sampling interval and use to
track the workload requirement during latest sampling window timeframe.
The average of workload during latest sample windows will be used to decide next
target frequency. The sampling window targets to be more truly reflects workload
requirement of CPU usage in the short recent.
The sampling window size also can be dynamicly changed in according to current
system workload busy situation. The more idle, the smaller sampling window; the
more busy, the larger sampling window. It will increase the respnose speed by
decrease sampling window, while it will keep CPU working at high speed when busy
by increase sampling window and also avoid unefficiently dangle between highest
and lowest frequency in original ondemand.
It set up_threshold to 80 and down_differential to 20, so when workload reach
80% of current frequency, it will increase to highest frequency. When workload
decrease to below (up_threshold - down_differential)60% of current frequency
capability, it will decrease the frequency, which ensure that CPU work above 60%
of its current capability, otherwise lowest frequency will be used.
On my test platform with two sockets Westmere-EP server and run the well-known
power performance benchmark, when workload is low, the patched governor is
power saving like powersave governor; while workload is high, the patched
governor is as good as performance governor but the patched governor consume
less power than performance governor. Along with other patches in this patchset,
the patched governor power efficiey is improved about 10%, while the performance
has no apparently decrease.
Running other benchmarks in phoronix, kernel building save 5% power, while the
performance without decrease. compress-7zip save power 2%, while the performance
also does not apparently decrease. However, apache benchmark saves power but its
performance decrease a lot.
Signed-off-by: Youquan Song <[email protected]>
---
drivers/cpufreq/cpufreq_ondemand.c | 177 ++++++++++++++++++++++++++++++++++-
1 files changed, 171 insertions(+), 6 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index c631f27..e49b2e1 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -22,6 +22,7 @@
#include <linux/tick.h>
#include <linux/ktime.h>
#include <linux/sched.h>
+#include <linux/slab.h>
/*
* dbs is used in this file as a shortform for demandbased switching
@@ -37,6 +38,14 @@
#define MICRO_FREQUENCY_MIN_SAMPLE_RATE (10000)
#define MIN_FREQUENCY_UP_THRESHOLD (11)
#define MAX_FREQUENCY_UP_THRESHOLD (100)
+/*Default sampling window : 1 second */
+#define DEF_SAMPLING_WINDOW (1000000)
+
+/* Max number of history records */
+#define MAX_LOAD_RECORD_NUM (150)
+
+#define SAMPLING_WINDOW_UP_THRESHOLD (80)
+#define SAMPLING_WINDOW_DOWN_DIFFERENTIAL (20)
/*
* The polling frequency of this governor depends on the capability of
@@ -73,6 +82,13 @@ struct cpufreq_governor cpufreq_gov_ondemand = {
/* Sampling types */
enum {DBS_NORMAL_SAMPLE, DBS_SUB_SAMPLE};
+/* Sampling record */
+struct load_record {
+ unsigned long load_freq;
+ unsigned int wall_time;
+ unsigned int idle_time;
+};
+
struct cpu_dbs_info_s {
cputime64_t prev_cpu_idle;
cputime64_t prev_cpu_iowait;
@@ -81,6 +97,13 @@ struct cpu_dbs_info_s {
struct cpufreq_policy *cur_policy;
struct delayed_work work;
struct cpufreq_frequency_table *freq_table;
+ struct load_record *lr; /* Load history record */
+ unsigned long total_load; /* Sum of load in sampling window */
+ unsigned int total_wtime; /* Sum of time in sampling window */
+ unsigned int total_itime; /* Sum of idle time in sampling window*/
+ unsigned int start_p; /* Start position of sampling window */
+ unsigned int cur_p; /* Current position of sampling window*/
+ unsigned int cur_sw; /* Current sampling window size */
unsigned int freq_lo;
unsigned int freq_lo_jiffies;
unsigned int freq_hi_jiffies;
@@ -97,6 +120,7 @@ struct cpu_dbs_info_s {
static DEFINE_PER_CPU(struct cpu_dbs_info_s, od_cpu_dbs_info);
static unsigned int dbs_enable; /* number of CPUs using this policy */
+static unsigned int sampling_window_enable; /* only use in HW_ALL */
/*
* dbs_mutex protects data in dbs_tuners_ins from concurrent changes on
@@ -114,12 +138,16 @@ static struct dbs_tuners {
unsigned int sampling_down_factor;
unsigned int powersave_bias;
unsigned int io_is_busy;
+ unsigned int sampling_window;
+ unsigned int window_is_dynamic;
} dbs_tuners_ins = {
.up_threshold = DEF_FREQUENCY_UP_THRESHOLD,
.sampling_down_factor = DEF_SAMPLING_DOWN_FACTOR,
.down_differential = DEF_FREQUENCY_DOWN_DIFFERENTIAL,
.ignore_nice = 0,
.powersave_bias = 0,
+ .sampling_window = DEF_SAMPLING_WINDOW,
+ .window_is_dynamic = 1,
};
static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
@@ -501,9 +529,79 @@ static void dbs_freq_increase(struct cpufreq_policy *p, unsigned int freq)
CPUFREQ_RELATION_L : CPUFREQ_RELATION_H);
}
+/* According to workload busy status to dynamic change sampling window,
+ * more idle, sampling window is smaller in proportion of current sampling
+ * window
+ */
+static unsigned int get_dynamic_sampling_window(struct cpu_dbs_info_s *dbs)
+{
+ unsigned int sampling_window = 0;
+ unsigned int busy_rate = 0;
+
+ if (dbs_tuners_ins.window_is_dynamic) {
+ busy_rate = (dbs->total_wtime - dbs->total_itime)
+ * 100 / dbs->total_wtime;
+
+ sampling_window = (dbs_tuners_ins.sampling_window * busy_rate)
+ / 100;
+
+ if (sampling_window < dbs_tuners_ins.sampling_rate)
+ sampling_window = dbs_tuners_ins.sampling_rate;
+ } else
+ sampling_window = dbs_tuners_ins.sampling_window;
+
+ return sampling_window;
+}
+
+/* Get the average load during one sampling window */
+static unsigned long get_load_freq_during_sampling_window(
+ struct cpu_dbs_info_s *this_dbs_info, unsigned long load_freq,
+ unsigned int wall_time, unsigned int idle_time)
+{
+
+ unsigned int cur_p = 0, start_p = 0;
+
+ cur_p = this_dbs_info->cur_p;
+ start_p = this_dbs_info->start_p;
+ /* Record current sampling result */
+ this_dbs_info->lr[cur_p].load_freq = load_freq;
+ this_dbs_info->lr[cur_p].wall_time = wall_time;
+ this_dbs_info->lr[cur_p].idle_time = idle_time;
+ /* Cumulate records in sampling windows */
+ this_dbs_info->total_load += load_freq;
+ this_dbs_info->total_wtime += wall_time;
+ this_dbs_info->total_itime += idle_time;
+ this_dbs_info->cur_p = (cur_p + 1) % MAX_LOAD_RECORD_NUM;
+
+ /* Dynamicly get sampling window if sampling_is_dynamic set */
+ this_dbs_info->cur_sw = get_dynamic_sampling_window(this_dbs_info);
+
+ /* Find work load during the lastest sampling window */
+ while (this_dbs_info->total_wtime - this_dbs_info->lr[start_p].wall_time
+ > this_dbs_info->cur_sw) {
+
+ this_dbs_info->total_wtime -=
+ this_dbs_info->lr[start_p].wall_time;
+ this_dbs_info->total_itime -=
+ this_dbs_info->lr[start_p].idle_time;
+ this_dbs_info->total_load -=
+ this_dbs_info->lr[start_p].load_freq;
+ start_p = (start_p + 1) % MAX_LOAD_RECORD_NUM;
+ this_dbs_info->start_p = start_p;
+ }
+
+ /* Get the average load in the lastest sampling window */
+ load_freq = this_dbs_info->total_load / this_dbs_info->total_wtime;
+
+ load_freq *= 100;
+ return load_freq;
+}
+
static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
{
- unsigned int max_load_freq;
+ unsigned long max_load_freq;
+ unsigned int max_wall_time;
+ unsigned int max_idle_time;
struct cpufreq_policy *policy;
unsigned int j;
@@ -525,12 +623,14 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
/* Get Absolute Load - in terms of freq */
max_load_freq = 0;
+ max_wall_time = 0;
+ max_idle_time = 0;
for_each_cpu(j, policy->cpus) {
struct cpu_dbs_info_s *j_dbs_info;
cputime64_t cur_wall_time, cur_idle_time, cur_iowait_time;
+ unsigned long load_freq, load;
unsigned int idle_time, wall_time, iowait_time;
- unsigned int load, load_freq;
int freq_avg;
j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
@@ -580,17 +680,28 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
if (unlikely(!wall_time || wall_time < idle_time))
continue;
- load = 100 * (wall_time - idle_time) / wall_time;
+ load = wall_time - idle_time;
freq_avg = __cpufreq_driver_getavg(policy, j);
if (freq_avg <= 0)
freq_avg = policy->cur;
load_freq = load * freq_avg;
- if (load_freq > max_load_freq)
+ if (load_freq > max_load_freq) {
max_load_freq = load_freq;
+ max_wall_time = wall_time;
+ max_idle_time = idle_time;
+ }
}
+ if (sampling_window_enable)
+ /* Get the average load in the lastest sampling window */
+ max_load_freq = get_load_freq_during_sampling_window(
+ this_dbs_info, max_load_freq,
+ max_wall_time, max_idle_time);
+ else
+ max_load_freq = (100 * max_load_freq) / max_wall_time;
+
/* Check for frequency increase */
if (max_load_freq > dbs_tuners_ins.up_threshold * policy->cur) {
/* If switching to max speed, apply sampling_down_factor */
@@ -713,6 +824,54 @@ static int should_io_be_busy(void)
return 0;
}
+/* Initialize dbs_info struct */
+static int dbs_info_init(struct cpu_dbs_info_s *this_dbs_info,
+ struct cpufreq_policy *policy, unsigned int cpu)
+{
+ this_dbs_info->cpu = cpu;
+ this_dbs_info->rate_mult = 1;
+ /* Sampling windows only used in HW_ALL coordination */
+ if (cpumask_weight(policy->cpus) > 1)
+ return 0;
+
+ this_dbs_info->start_p = 0;
+ this_dbs_info->cur_p = 1;
+ this_dbs_info->total_wtime = 0;
+ this_dbs_info->total_itime = 0;
+ this_dbs_info->total_load = 0;
+ /* Initiate the load record */
+ this_dbs_info->lr = kmalloc(sizeof(struct load_record) *
+ (MAX_LOAD_RECORD_NUM), GFP_KERNEL);
+ if (!this_dbs_info->lr) {
+ printk(KERN_ERR "Malloc DBS load record failed\n");
+ return -EFAULT;
+ }
+
+ this_dbs_info->lr[0].load_freq = 0;
+ this_dbs_info->lr[0].wall_time = 0;
+ this_dbs_info->lr[0].idle_time = 0;
+ sampling_window_enable = 1;
+ dbs_tuners_ins.up_threshold = SAMPLING_WINDOW_UP_THRESHOLD;
+ dbs_tuners_ins.down_differential = SAMPLING_WINDOW_DOWN_DIFFERENTIAL;
+ return 0;
+
+}
+
+
+/* Free the load record buffer */
+static void destroy_dbs_info(void)
+{
+ struct cpu_dbs_info_s *dbs_info = NULL;
+ int i;
+ if (!sampling_window_enable)
+ return;
+
+ for_each_online_cpu(i) {
+ dbs_info = &per_cpu(od_cpu_dbs_info, i);
+ kfree(dbs_info->lr);
+ }
+}
+
static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
unsigned int event)
{
@@ -749,8 +908,13 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
kstat_cpu(j).cpustat.nice;
}
}
- this_dbs_info->cpu = cpu;
- this_dbs_info->rate_mult = 1;
+
+ rc = dbs_info_init(this_dbs_info, policy, cpu);
+ if (rc) {
+ mutex_unlock(&dbs_mutex);
+ return rc;
+ }
+
ondemand_powersave_bias_init_cpu(cpu);
/*
* Start the timerschedule work, when this governor
@@ -854,6 +1018,7 @@ static void __exit cpufreq_gov_dbs_exit(void)
{
cpufreq_unregister_governor(&cpufreq_gov_ondemand);
destroy_workqueue(kondemand_wq);
+ destroy_dbs_info();
}
--
1.6.4.2
On Thu, Dec 23, 2010 at 01:50:45AM -0500, Niemi, David wrote:
> The ondemand governor does tend to go all or nothing with respect to CPU
> frequency. That is not entirely laziness, it has some logic to compute
> optimum frequency but doesn't generally use it. There is some evidence
> intermediate frequencies are a waste of effort.
Thanks a lot! David. Merry Christmas!
That'true, intermediate frequencies are waste of
effort. Pn ~ P1 actually have not much different conside of power saving.
In the patchset, It will not add the intermediate frequencies except
that I add P1 evaluation before direct go to P0, because I know that P0
(turbo mode) have much power consumption than P1 while P1 frequency often
double of the Pn(lowest) frequency, when Pn 100% workload only
reach 50% of P1 workload capability, so if P1 can meet the current
workload requirement, it will save some power.
I really test, the 6th patch: "Evaluate P1 before enter turbo
mode" has a little function to save power, but not much.
> Please consider a couple of things:
> 1) Most Intel CPUs do most of their power savings through C-states, not
> by reducing clock frequency. That may have something to do with why you
> see modest power savings between ondemand and performance. Recent AMD
> CPUs, on the other hand, rely a lot more on reducing clock frequency to
> save power. Down the road, we'll need to be doing both effectively.
> But even going to the very lowest clock frequency on a Nehalem EP will
> not save very much power -- and increased use of intermediate
> frequencies will help less. That said, minimizing turbo boost usage
> will likely save quite a bit of power (at the expense of reduced
> performance).
If the system is truly busy, my patchset will increase turbo mode usage,
and decrease the dangle between P0 and Pn. If systme is truly idle, not
instant busy, it will decrease the turbo mode usage.
>
> It would definitely be nice to see results on a variety of modern CPUs
> for a major patch like this.
I have no such test environment. Who can help?
> 2) Please consider the case where per performance really does matter
> when heavy loads are present, but we'd like to save power when the
> system is lightly loaded. This is different from the laptop case, where
> saving power under load is probably as important as the performance, and
> if you are truly idle you are turning things off altogether. Your claim
> of matching the performance governor's performance is a great aspiration
> but it'll need to be demonstrated on a variety of CPUs and workloads,
> this is not usually easy to accomplish.
>
In nature, my patchset only add sampling window for ondemand
governor,which will large the sampling rating 10ms to special sampling
window time frame. It will truly reflect the system workload busy or
idle and not instant phenomena, which will avoid instant busy cause
frequently change between turbo mode and lowest frequency.
patchset provide dynamic sampling window function, which will auto ajust
sampling window size according to current workload busy or idle.
When system is idle the sampling window will be samller, which will
response quickly to instant requirement, when system is busy, the
sampling window will bigger, it will keep CPU work at high frequency
without ajust to frequency for instant workload idle.
when sampling window equal to sampling rate (10ms), the sampling window
will roll back and works the same as the orignial ondemand governor.
When workload is high( > 80%), the patched ondemand work very close to
performance because its sampling window is large, the average workload
during the sampling window will also almost be above 60%, so the large
frequency will be continously used.
Maybe, you can really test it. when you have questions or comments, please
contact with me.
Thanks
-Youquan
Hey,
On Thu, Dec 23, 2010 at 02:23:38PM +0800, Youquan Song wrote:
> Running a well-known power performance benchmark, current ondemand governor is
> not power efficiency. Even when workload is at 10%~20% of full capability, the
> CPU will also run much of time at highest frequency. In fact, in this situation,
> the lowest frequency often can meet user requirement. When running this
> benchmark on turbo mode enable machine, I compare the result of different
> governors, the results of ondemand and performance governors are the closest.
> There is no much power saving between ondemand and performance governor. If we
> can ignore the little power saving, the perfomance governor even better than
> ondemand governor, at leaset for better performance.
>
> One potential reason for ondemand governor is not power efficiency is that
> ondemand governor decide the next target frequency by instant requirement during
> sampling interval (10ms or possible a little longer for deferrable timer in idle
> tickless). The instant requirement can response quickly to workload change, but
> it does not usually reflect workload real CPU usage requirement in a small
> longer time and it possibly causes frequently change between highest and lowest
> frequency.
>
> This patchset add a sampling window for percpu ondemand thread. Each sampling
> window with max 150 record items which slide every sampling interval and use to
> track the workload requirement during latest sampling window timeframe.
> The average of workload during latest sample windows will be used to decide next
> target frequency. The sampling window targets to be more truly reflects workload
> requirement of CPU usage.
>
> The sampling window size can be set by user and default max sampling window
> is one second. When it is set to default sampling rate, the sampling window will
> roll back to original behaviour.
>
> The sampling window size also can be dynamicly changed in according to current
> system workload busy situation. The more idle, the smaller sampling window; the
> more busy, the larger sampling window. It will increase the respnose speed by
> decrease sampling window, while it will keep CPU working at high speed when busy
> by increase sampling window and also avoid unefficiently dangle between highest
> and lowest frequency in original ondemand.
>
> We set to up_threshold to 80 and down_differential to 20, so when workload reach
> 80% of current frequency, it will increase to highest frequency. When workload
> decrease to below (up_threshold - down_differential)60% of current frequency
> capability, it will decrease the frequency, which ensure that CPU work above 60%
> of its current capability, otherwise lowest frequency will be used.
Interesting approach, but seems to be quite different from what "ondemand"
does at the moment. And, as David Niemi pointed out, it seems to be more
Intel-specific. Therefore, what do you think about adding this different
algorithm as a different governor, and keep the "ondemand" algorithm more or
less as it is?
Best,
Dominik
Hey,
On Thu, Dec 23, 2010 at 02:23:44PM +0800, Youquan Song wrote:
> diff --git a/drivers/cpufreq/freq_table.c b/drivers/cpufreq/freq_table.c
> index 0543221..d7dc010 100644
> --- a/drivers/cpufreq/freq_table.c
> +++ b/drivers/cpufreq/freq_table.c
> @@ -26,6 +26,7 @@ int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
> {
> unsigned int min_freq = ~0;
> unsigned int max_freq = 0;
> + unsigned int sec_max_freq = 0;
> unsigned int i;
>
> for (i = 0; (table[i].frequency != CPUFREQ_TABLE_END); i++) {
> @@ -41,10 +42,18 @@ int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
> min_freq = freq;
> if (freq > max_freq)
> max_freq = freq;
> + /* Find the second largest frequency */
> + if ((freq < max_freq) && (freq > sec_max_freq))
> + sec_max_freq = freq;
> }
>
> policy->min = policy->cpuinfo.min_freq = min_freq;
> policy->max = policy->cpuinfo.max_freq = max_freq;
> + /* Check CPU turbo mode enabled */
> + if (max_freq - sec_max_freq == 1000)
> + policy->sec_max = sec_max_freq;
> + else
> + policy->sec_max = max_freq;
>
> if (policy->min == ~0)
> return -EINVAL;
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index c3e9de8..0087e56 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -92,6 +92,7 @@ struct cpufreq_policy {
>
> unsigned int min; /* in kHz */
> unsigned int max; /* in kHz */
> + unsigned int sec_max; /* in kHz*/
> unsigned int cur; /* in kHz, only needed if cpufreq
> * governors are used */
> unsigned int policy; /* see above */
NACK. First of all, why is it only a "turbo mode" if it's 1000 kHz
difference? Second, I don't like to put such an additional level into the
generic cpufreq code -- it just looks to be too chipset/CPU-specific. Third,
it isn't open to different turbo modes, e.g. if future CPUs offer a
"super-turbo", a "turbo" and a "semi-turbo" mode. Finally, if (certain)
governors really really need to become aware of the individual
frequency steps -- something we avoided in the past -- we could extend
struct cpufreq_policy to optionally(!) contain a reference to the struct
cpufreq_frequency_table array. And add a "power_usage" parameter to that
struct, which could then be evaluated by the governor.
Best,
Dominik
> Interesting approach, but seems to be quite different from what "ondemand"
> does at the moment. And, as David Niemi pointed out, it seems to be more
> Intel-specific. Therefore, what do you think about adding this different
> algorithm as a different governor, and keep the "ondemand" algorithm more or
> less as it is?
Good suggestion! I will write a different governor soon if nobody
oppose.
Thanks
-Youquan
On Thu, Dec 23, 2010 at 11:57:30AM +0100, Dominik Brodowski wrote:
> NACK. First of all, why is it only a "turbo mode" if it's 1000 kHz
> difference?
I believe that that's how it's supposed to be defined for Intel systems,
but you're right that this doesn't belong in generic code. AMD have
support for enabling/disabling their equivalent functionality through
sysfs - I'd say that copying that interface and using it to limit the
set of p-states provided to the core makes more sense.
--
Matthew Garrett | [email protected]
We've generally been assuming (rightly or wrongly) that getting into
deep C states is preferable to being active (even at a lower frequency),
and so the current behaviour of tending to rapidly switch to the maximum
P state isn't inherently a problem. What kind of power savings are you
benchmarking with this, and do you still see a saving if you just
disable turbo mode?
--
Matthew Garrett | [email protected]
On Thu, Dec 23, 2010 at 02:42:20PM +0000, Matthew Garrett wrote:
> We've generally been assuming (rightly or wrongly) that getting into
> deep C states is preferable to being active (even at a lower frequency),
> and so the current behaviour of tending to rapidly switch to the maximum
> P state isn't inherently a problem. What kind of power savings are you
> benchmarking with this, and do you still see a saving if you just
> disable turbo mode?
Thanks a lot! Matthew.
Running the well-known power and performance benchmark,
performance/watts improve around 10%, the performance without drop.
Other benchmarks: kernel buiding and compress-7zip, power saving 2%~5%,
the performance also without drop.
Exception was: apache benchmark, it will let the performance drop
without much power save. These say that it depends on workload itself
to save how much power.
I also consider that this patchset is effective to save power when workload
intermediately be idle, during the idle period, the workload is purely
idle not waiting for something happened. Because the low frequency will
try to fill up these idle periods while Turbo frequency execute quickly
consume much more power than low frequency, then be purely idle. In
this situation, use low frequency will save power.
But if the workload is not purely idle,we low the frequency to execute,
it will sacrifice the performance while it also do not save power.
In this situation, I will try to decrease to sampling window to samping
rate (10ms), it will roll back and keep the same behaviour as original
ondemand does. I need more investigation about how to identify out
these purely idle or not. Do you have idea about it?
I run benchmark at two situations: one is userspace governor, set all cpu frequency
P1 and other is set to powersave, there is no much different between these two
result. So it say that there is not much value to tuning between Pn to
P1(no-turbo mode).
While I compare result of userspace with all P1 frequency and
Performance(Turbo Mode), there are much room to tuning. It is the
original drive for me to do this patchset.
Thanks
-Youquan
On Thu, Dec 23, 2010 at 12:00:20PM +0100, Dominik Brodowski wrote:
> Interesting approach, but seems to be quite different from what "ondemand"
> does at the moment. And, as David Niemi pointed out, it seems to be more
> Intel-specific. Therefore, what do you think about adding this different
> algorithm as a different governor, and keep the "ondemand" algorithm more or
> less as it is?
I'm hesitant to merge more governors. (We already have too many imo).
The userspace logic for automatically deciding which is the best one to use is
already pretty hairy, so any additional ones at this point would have to be
accompanied with some really compelling reasons why the existing ones can't
be fixed in an acceptable manner.
Dave
On Thu, Dec 23, 2010 at 6:38 AM, Matthew Garrett <[email protected]> wrote:
> On Thu, Dec 23, 2010 at 11:57:30AM +0100, Dominik Brodowski wrote:
>
>> NACK. First of all, why is it only a "turbo mode" if it's 1000 kHz
>> difference?
>
> I believe that that's how it's supposed to be defined for Intel systems,
> but you're right that this doesn't belong in generic code. AMD have
> support for enabling/disabling their equivalent functionality through
> sysfs - I'd say that copying that interface and using it to limit the
> set of p-states provided to the core makes more sense.
>
If this 1000kHz hack is needed, it should be in acpi-cpufreq driver
along with Intel CPU and Turbo mode capability check..
I had a change earlier and I don't think I ever pushed it out. But, it
was doing the max freq transition on turbo capable CPUs in 2 steps.
Something like:
- cpu is in one of the low freqs.
- ondemand asks for highest freq.
- acpi-cpufreq will check whether the CPU is turbo capable and will
only switch to non-turbo peak freq as first step.
- If ondemand asks for highest freq again, acpi-cpufreq will then
switch to turbo freq.
Something like that would probably help the problem here?
Thanks,
Venki
Thanks,
Venki
Hey,
On Thu, Dec 23, 2010 at 12:34:02PM -0500, Dave Jones wrote:
> On Thu, Dec 23, 2010 at 12:00:20PM +0100, Dominik Brodowski wrote:
> > Interesting approach, but seems to be quite different from what "ondemand"
> > does at the moment. And, as David Niemi pointed out, it seems to be more
> > Intel-specific. Therefore, what do you think about adding this different
> > algorithm as a different governor, and keep the "ondemand" algorithm more or
> > less as it is?
>
> I'm hesitant to merge more governors. (We already have too many imo).
AFAICS, we do have two in-kernel dynamic frequency scaling governors, one of
which doesn't seem to get much usage (conservative)...
> The userspace logic for automatically deciding which is the best one to use is
> already pretty hairy, so any additional ones at this point would have to be
> accompanied with some really compelling reasons why the existing ones can't
> be fixed in an acceptable manner.
Well, if the underlying alogrithm is fundamentally different -- such as when
looking at sampling windows instead of the past one window -- this seems to
be a "compelling" reason to me. Especially if one governor works well on
certain platforms, and the other one works better on other platforms.
Best,
Dominik
Hey,
On Thu, Dec 23, 2010 at 10:13:54AM -0800, Venkatesh Pallipadi wrote:
> On Thu, Dec 23, 2010 at 6:38 AM, Matthew Garrett <[email protected]> wrote:
> > On Thu, Dec 23, 2010 at 11:57:30AM +0100, Dominik Brodowski wrote:
> >
> >> NACK. First of all, why is it only a "turbo mode" if it's 1000 kHz
> >> difference?
> >
> > I believe that that's how it's supposed to be defined for Intel systems,
> > but you're right that this doesn't belong in generic code. AMD have
> > support for enabling/disabling their equivalent functionality through
> > sysfs - I'd say that copying that interface and using it to limit the
> > set of p-states provided to the core makes more sense.
> >
>
> If this 1000kHz hack is needed, it should be in acpi-cpufreq driver
> along with Intel CPU and Turbo mode capability check..
> I had a change earlier and I don't think I ever pushed it out. But, it
> was doing the max freq transition on turbo capable CPUs in 2 steps.
> Something like:
> - cpu is in one of the low freqs.
> - ondemand asks for highest freq.
> - acpi-cpufreq will check whether the CPU is turbo capable and will
> only switch to non-turbo peak freq as first step.
> - If ondemand asks for highest freq again, acpi-cpufreq will then
> switch to turbo freq.
>
> Something like that would probably help the problem here?
then we'd put policy in two places, instead of letting the decision be made
at one place. So I'd not favour such an approach... What should indeed be in
acpi-cpufreq is to mark certain frequency steps as being not power
efficient, though.
Best,
Dominik
>>>>> "DB" == Dominik Brodowski <[email protected]> writes:
DB> AFAICS, we have two in-kernel dynamic frequency scaling governors, one of
DB> which doesn't seem to get much usage (conservative)...
Isn't conservative the recommended choice for AMD processors?
Or has that changed?
-JimC
--
James Cloos <[email protected]> OpenPGP: 1024D/ED7DAEA6