This patchset adds new cpufreq governor named LAB(Legacy Application
Boost). Basically, this governor is based on ondemand governor.
** Introduce LAB (Legacy Application Boost) governor
<<Purpose>>
One of the problem of ondemand is that it considers the most busy
cpu only while doesn't care how many cpu is in busy state at the
moment. This may results in unnecessary power consumption, and it'll
be critical for the system having limited power source.
To get the best energy efficiency, LAB governor considers not only
idle time but also the number of idle cpus. It primarily focuses on
supplying adequate performance to users within the limited resource.
It checks the number of idle cpus then controls the maximum frequency
dynamically. And it also applies different frequency increasing level
depends on that information. In simple terms the less the number of
busy cpus, the better performance will be given.
In addition, stable system's power consumption in the busy state can
be achieved also with using LAB governor. This will help to manage and
estimate power consumption in certain system.
<<Algorithm>>
- Count the idle cpus :
Number of idle cpus are determined by historical result. It stores
cpu idle usage derived by dividing cpu idling time with a period
at every time when cpufreq timer is called. And it calculates average
of idle usage from the most recent stored data. It uses cpu state
usage information from per-cpu cpuidle_device. However, because that
cpu state usage information is updated only when cpu exits the state,
so the information may differ with the real one while cpu is in idle
at checking time. To detect whether cpu is idling, it uses timestamps
of cpuidle devices. It needs to re-calculate idle state usage for the
following cases.
The possible 3 cases which cpu is in idle at checking time.
(Shaded section represents staying in idle)
1) During the last period, idle state would be broken more than once.
|____________|__ _____|_______ |
|////////////|//|____|/////|//____________|
t1 t2 @ t4
2) During the last period, there was no idle state entered, current
idle is the first one.
| | ________|_______ |
|____________|____|////////|////__________|
t1 t2 @ t4
3) During the last whole period, the core is in idle state.
| ______|_____________|________ |
|_____|//////|/////////////|////__________|
t1 t2 @ t4
(@ : Current checking point)
After calculating idle state usage, it decides whether cpu was idle
in the last period with the idle threshold. If average of idle usage
is bigger than threshold, then it treats the cpu as a idle cpu.
For the test, I set default threshold value to 90.
- Limitation of maximum frequency :
With information got from counting the idle cpus phase, it sets maximum
frequency for the next period. By default, it limits the current policy's
maximum frequency by from 0 to 35% depends on the number of idle cpus.
- Setting next frequency
LAB governor changes the frequency step by step like the conservative
governor. However, in LAB governor, the step is changing dynamically
depending on how many cpus are in idle.
next_freq = current_freq + current_idle_cpus * increasing_step;
<<To do>>
The prototype of this feature was developed as a cpuidle driver, so it
it uses cpuidle framework information temporarily. I'd like to use the
data of per-cpu tick_sched variable, it has all information that I want
exactly but it can't be accessed from cpufreq side.
I tested this patch on pegasus qaud board.
Any comments are welcomed.
Jonghwa Lee (2):
cpuidle: Add idle enter/exit time stamp for notifying current idle
state.
cpufreq: Introduce new cpufreq governor, LAB(Legacy Application
Boost).
drivers/cpufreq/Kconfig | 26 ++
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/cpufreq_governor.h | 14 +
drivers/cpufreq/cpufreq_lab.c | 553 ++++++++++++++++++++++++++++++++++++
drivers/cpuidle/cpuidle.c | 8 +-
include/linux/cpufreq.h | 3 +
include/linux/cpuidle.h | 4 +
7 files changed, 605 insertions(+), 4 deletions(-)
create mode 100644 drivers/cpufreq/cpufreq_lab.c
--
1.7.9.5
This patch adds idle state time stamp to cpuidle device structure to
notify its current idle state. If last enter time is newer than last
exit time, then it means that the core is in idle now.
Signed-off-by: Jonghwa Lee <[email protected]>
---
drivers/cpuidle/cpuidle.c | 8 ++++----
include/linux/cpuidle.h | 4 ++++
2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index eba6929..1e830cc 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -233,18 +233,18 @@ int cpuidle_wrap_enter(struct cpuidle_device *dev,
int (*enter)(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index))
{
- ktime_t time_start, time_end;
s64 diff;
- time_start = ktime_get();
+ dev->last_idle_start = ktime_get();
index = enter(dev, drv, index);
- time_end = ktime_get();
+ dev->last_idle_end = ktime_get();
local_irq_enable();
- diff = ktime_to_us(ktime_sub(time_end, time_start));
+ diff = ktime_to_us(ktime_sub(dev->last_idle_end,
+ dev->last_idle_start));
if (diff > INT_MAX)
diff = INT_MAX;
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 480c14d..d1af05f 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -16,6 +16,7 @@
#include <linux/kobject.h>
#include <linux/completion.h>
#include <linux/hrtimer.h>
+#include <linux/ktime.h>
#define CPUIDLE_STATE_MAX 8
#define CPUIDLE_NAME_LEN 16
@@ -74,6 +75,9 @@ struct cpuidle_device {
struct kobject kobj;
struct completion kobj_unregister;
+ ktime_t last_idle_start;
+ ktime_t last_idle_end;
+
#ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
int safe_state_index;
cpumask_t coupled_cpus;
--
1.7.9.5
This patch introduces new cpufreq governor named 'LAB'.
LAB governor will use historical cpuidle state usage infomation to
determine how many cpus are in busy now. The results, the number of
idle cpus, will affect to next frequency dynamically.
For instance, we can assume that it is working on quard core processor.
When all 4 cores are busy then this governor throttles next frequency
in maximum. For 3 cores, throttling will be loosen. The less core are
busy, the more high frequency will be set. So, when only one core is in
busy, then it releases maximum frequency that system allows.
The name of 'Legacy Application Boost' came from the such above aspect
which system has the highest performance for single threaded process.
This is tested on Pegasus Quad board.
Signed-off-by: Jonghwa Lee <[email protected]>
Signed-off-by: Lukasz Majewski <[email protected]>
Signed-off-by: Myungjoo Ham <[email protected]>
---
drivers/cpufreq/Kconfig | 26 ++
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/cpufreq_governor.h | 14 +
drivers/cpufreq/cpufreq_lab.c | 553 ++++++++++++++++++++++++++++++++++++
include/linux/cpufreq.h | 3 +
5 files changed, 597 insertions(+)
create mode 100644 drivers/cpufreq/cpufreq_lab.c
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index cbcb21e..d0b22de 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,18 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_LAB
+ bool "lab"
+ select CPU_FREQ_GOV_LAB
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the CPUFreq governor 'lab' as default. This allows
+ you to get a full dynamic frequency capable system by simply
+ loading your cpufreq low-level hardware driver.
+ Be aware that not all cpufreq drivers support the lab governor.
+ If unsure have a look at the help section of the driver.
+ Fallback governor will be the performance governor.
endchoice
config CPU_FREQ_GOV_PERFORMANCE
@@ -184,6 +196,20 @@ config CPU_FREQ_GOV_CONSERVATIVE
If in doubt, say N.
+config CPU_FREQ_GOV_LAB
+ tristate "'lab' cpufreq policy governor"
+ select CPU_FREQ_TABLE
+ select CPU_FREQ_GOV_COMMON
+ help
+ 'lab' - This driver adds a dynamic cpufreq policy governor.
+
+ To compile this driver as a module, choose M here: the
+ module will be called cpufreq_ondemand.
+
+ For details, take a look at linux/Documentation/cpu-freq.
+
+ If in doubt, say N.
+
config GENERIC_CPUFREQ_CPU0
tristate "Generic CPU0 cpufreq driver"
depends on HAVE_CLK && REGULATOR && PM_OPP && OF
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 863fd18..2520fa7 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE) += cpufreq_powersave.o
obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) += cpufreq_userspace.o
obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o
obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o
+obj-$(CONFIG_CPU_FREQ_GOV_LAB) += cpufreq_lab.o
obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
# CPUfreq cross-arch helpers
diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h
index 46bde01..3062d60 100644
--- a/drivers/cpufreq/cpufreq_governor.h
+++ b/drivers/cpufreq/cpufreq_governor.h
@@ -103,6 +103,19 @@ struct cs_cpu_dbs_info_s {
unsigned int enable:1;
};
+struct lb_cpu_dbs_info_s {
+ struct cpu_dbs_common_info cdbs;
+ u64 prev_cpu_iowait;
+ struct cpufreq_frequency_table *freq_table;
+ unsigned int freq_lo;
+ unsigned int freq_lo_jiffies;
+ unsigned int freq_hi_jiffies;
+ unsigned int rate_mult;
+ unsigned int sample_type:1;
+
+ unsigned int last_sampling_rate;
+};
+
/* Governers sysfs tunables */
struct od_dbs_tuners {
unsigned int ignore_nice;
@@ -128,6 +141,7 @@ struct dbs_data {
/* Common across governors */
#define GOV_ONDEMAND 0
#define GOV_CONSERVATIVE 1
+ #define GOV_LAB 2
int governor;
unsigned int min_sampling_rate;
struct attribute_group *attr_group;
diff --git a/drivers/cpufreq/cpufreq_lab.c b/drivers/cpufreq/cpufreq_lab.c
new file mode 100644
index 0000000..7841a50
--- /dev/null
+++ b/drivers/cpufreq/cpufreq_lab.c
@@ -0,0 +1,553 @@
+/*
+ * drivers/cpufreq/cpufreq_lab.c
+ *
+ * LAB(Legacy Application Boost) cpufreq governor
+ *
+ * Copyright (C) SAMSUNG Electronics. CO.
+ * Jonghwa Lee <[email protected]>
+ * Lukasz Majewski <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/cpufreq.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/kernel_stat.h>
+#include <linux/kobject.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/percpu-defs.h>
+#include <linux/sysfs.h>
+#include <linux/tick.h>
+#include <linux/types.h>
+#include <linux/cpuidle.h>
+#include <linux/slab.h>
+
+#include "cpufreq_governor.h"
+
+/* On-demand governor macros */
+#define DEF_FREQUENCY_DOWN_DIFFERENTIAL (10)
+#define DEF_FREQUENCY_UP_THRESHOLD (80)
+#define DEF_SAMPLING_DOWN_FACTOR (1)
+#define MAX_SAMPLING_DOWN_FACTOR (100000)
+#define MICRO_FREQUENCY_DOWN_DIFFERENTIAL (3)
+#define MICRO_FREQUENCY_UP_THRESHOLD (95)
+#define MICRO_FREQUENCY_MIN_SAMPLE_RATE (10000)
+#define MIN_FREQUENCY_UP_THRESHOLD (11)
+#define MAX_FREQUENCY_UP_THRESHOLD (100)
+
+#define MAX_HIST 10
+#define FREQ_STEP 50000
+#define IDLE_THRESHOLD 90
+
+static unsigned int lab_limit_rate[5] = {35, 20, 10, 0, 0};
+static unsigned long *ref_time;
+static unsigned int *idle_avg;
+static unsigned int **idle_hist;
+
+static struct dbs_data lb_dbs_data;
+static DEFINE_PER_CPU(struct lb_cpu_dbs_info_s, lb_cpu_dbs_info);
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_LAB
+static struct cpufreq_governor cpufreq_gov_lab;
+#endif
+
+static struct od_dbs_tuners lb_tuners = {
+ .up_threshold = DEF_FREQUENCY_UP_THRESHOLD,
+ .sampling_down_factor = DEF_SAMPLING_DOWN_FACTOR,
+ .adj_up_threshold = DEF_FREQUENCY_UP_THRESHOLD -
+ DEF_FREQUENCY_DOWN_DIFFERENTIAL,
+ .ignore_nice = 0,
+};
+
+static void dbs_freq_increase(struct cpufreq_policy *p, unsigned int freq)
+{
+ if (p->cur == freq)
+ return;
+
+ __cpufreq_driver_target(p, freq, CPUFREQ_RELATION_L);
+}
+
+static inline int cpu_idle_calc_avg(unsigned int *p, int size)
+{
+ int i, sum;
+
+ for (i = 0, sum = 0; i < size; p++, i++)
+ sum += *p;
+
+ return (int) (sum / size);
+}
+
+/*
+ * Every sampling_rate, we check, if current idle time is less than 20%
+ * (default), then we try to increase frequency. Every sampling_rate, we look
+ * for the lowest frequency which can sustain the load while keeping idle time
+ * over 30%. If such a frequency exist, we try to decrease to this frequency.
+ *
+ * Any frequency increase takes it to the maximum frequency. Frequency reduction
+ * happens at minimum steps of 5% (default) of current frequency
+ */
+static void lb_check_cpu(int cpu, unsigned int load_freq)
+{
+ struct lb_cpu_dbs_info_s *dbs_info = &per_cpu(lb_cpu_dbs_info, cpu);
+ struct cpufreq_policy *policy = dbs_info->cdbs.cur_policy;
+ struct cpuidle_device *dev;
+ int i, idx, idle_cpus;
+ static int cnt;
+ unsigned int _max;
+
+ dbs_info->freq_lo = 0;
+
+ idle_cpus = 0;
+
+ idx = cnt++ % MAX_HIST;
+ /* Check for LAB limitation */
+ for_each_possible_cpu(i) {
+ ktime_t cur;
+ ktime_t last_idle_enter;
+ ktime_t last_idle_exit;
+ s64 delta_time;
+ unsigned int last_sampling_rate = dbs_info->last_sampling_rate;
+
+ dev = per_cpu(cpuidle_devices, i);
+ last_idle_enter = dev->last_idle_start;
+ last_idle_exit = dev->last_idle_end;
+
+ cur = ktime_get();
+
+ /* Check whether the i'th core is in idle */
+ if (ktime_to_us(ktime_sub(last_idle_enter,
+ last_idle_exit)) > 0) {
+ if (ktime_to_us(ktime_sub(cur, last_idle_enter))
+ > last_sampling_rate) {
+ delta_time = last_sampling_rate;
+ } else {
+ delta_time = ktime_to_us(ktime_sub(cur,
+ last_idle_enter));
+ if (ktime_to_us(ktime_sub(cur, last_idle_exit))
+ < last_sampling_rate)
+ delta_time += dev->states_usage[0].time
+ - ref_time[i];
+ }
+ } else {
+ delta_time = dev->states_usage[0].time - ref_time[i];
+ }
+
+ delta_time *= 100;
+ if (last_sampling_rate > 0)
+ delta_time = div_s64(delta_time, last_sampling_rate);
+ else
+ delta_time = 100;
+
+ if (delta_time > 100)
+ delta_time = 100;
+
+ idle_hist[i][idx] = delta_time;
+
+ ref_time[i] = dev->states_usage[0].time;
+
+ idle_avg[i] = cpu_idle_calc_avg(idle_hist[i],
+ cnt < 10 ? cnt : MAX_HIST);
+ if (idle_avg[i] > IDLE_THRESHOLD)
+ idle_cpus++;
+ }
+
+ _max = policy->max * (100 - lab_limit_rate[idle_cpus]);
+ _max /= 100;
+
+ if (!idx)
+ pr_debug("_max : %d, idle_cpus : %d, avg : %d %d %d %d\n", _max,
+ idle_cpus, idle_avg[0], idle_avg[1], idle_avg[2], idle_avg[3]);
+
+ /* Check for frequency increase */
+ if (load_freq > lb_tuners.up_threshold * policy->cur) {
+ unsigned int freq_next;
+ static unsigned int inc;
+
+ if (!idle_cpus)
+ inc += FREQ_STEP / 2;
+ else
+ inc += FREQ_STEP * idle_cpus;
+
+ freq_next = min(policy->cur + inc, _max);
+ if (freq_next == _max)
+ inc = 0;
+ /* If switching to max speed, apply sampling_down_factor */
+ if (policy->cur < _max)
+ dbs_info->rate_mult =
+ lb_tuners.sampling_down_factor;
+
+ dbs_freq_increase(policy, freq_next);
+ return;
+ }
+
+ /* Check for frequency decrease */
+ /* if we cannot reduce the frequency anymore, break out early */
+ if (policy->cur == policy->min)
+ return;
+
+ /*
+ * The optimal frequency is the frequency that is the lowest that can
+ * support the current CPU usage without triggering the up policy. To be
+ * safe, we focus 10 points under the threshold.
+ */
+ if (load_freq < lb_tuners.adj_up_threshold * policy->cur) {
+ unsigned int freq_next;
+ freq_next = load_freq / lb_tuners.adj_up_threshold;
+
+ /* No longer fully busy, reset rate_mult */
+ dbs_info->rate_mult = 1;
+
+ if (freq_next < policy->min)
+ freq_next = policy->min;
+
+ __cpufreq_driver_target(policy, freq_next,
+ CPUFREQ_RELATION_L);
+ }
+}
+
+static void lb_dbs_timer(struct work_struct *work)
+{
+ struct delayed_work *dw = to_delayed_work(work);
+ struct lb_cpu_dbs_info_s *dbs_info =
+ container_of(work, struct lb_cpu_dbs_info_s, cdbs.work.work);
+ unsigned int cpu = dbs_info->cdbs.cur_policy->cpu;
+ struct lb_cpu_dbs_info_s *core_dbs_info = &per_cpu(lb_cpu_dbs_info,
+ cpu);
+ int delay, sample_type = core_dbs_info->sample_type;
+ bool eval_load;
+
+ mutex_lock(&core_dbs_info->cdbs.timer_mutex);
+ eval_load = need_load_eval(&core_dbs_info->cdbs,
+ lb_tuners.sampling_rate);
+
+ /* Common NORMAL_SAMPLE setup */
+ core_dbs_info->sample_type = OD_NORMAL_SAMPLE;
+ if (sample_type == OD_SUB_SAMPLE) {
+ delay = core_dbs_info->freq_lo_jiffies;
+ if (eval_load)
+ __cpufreq_driver_target(core_dbs_info->cdbs.cur_policy,
+ core_dbs_info->freq_lo,
+ CPUFREQ_RELATION_H);
+ } else {
+ if (eval_load)
+ dbs_check_cpu(&lb_dbs_data, cpu);
+ if (core_dbs_info->freq_lo) {
+ /* Setup timer for SUB_SAMPLE */
+ core_dbs_info->sample_type = OD_SUB_SAMPLE;
+ delay = core_dbs_info->freq_hi_jiffies;
+ } else {
+ delay = delay_for_sampling_rate(lb_tuners.sampling_rate
+ * core_dbs_info->rate_mult);
+ }
+ }
+
+ dbs_info->last_sampling_rate = jiffies_to_usecs(delay);
+
+ schedule_delayed_work_on(smp_processor_id(), dw, delay);
+ mutex_unlock(&core_dbs_info->cdbs.timer_mutex);
+}
+
+/************************** sysfs interface ************************/
+
+static ssize_t show_sampling_rate_min(struct kobject *kobj,
+ struct attribute *attr, char *buf)
+{
+ return sprintf(buf, "%u\n", lb_dbs_data.min_sampling_rate);
+}
+
+/**
+ * update_sampling_rate - update sampling rate effective immediately if needed.
+ * @new_rate: new sampling rate
+ *
+ * If new rate is smaller than the old, simply updating
+ * dbs_tuners_int.sampling_rate might not be appropriate. For example, if the
+ * original sampling_rate was 1 second and the requested new sampling rate is 10
+ * ms because the user needs immediate reaction from lab governor, but not
+ * sure if higher frequency will be required or not, then, the governor may
+ * change the sampling rate too late; up to 1 second later. Thus, if we are
+ * reducing the sampling rate, we need to make the new value effective
+ * immediately.
+ */
+static void update_sampling_rate(unsigned int new_rate)
+{
+ int cpu;
+
+ lb_tuners.sampling_rate = new_rate = max(new_rate,
+ lb_dbs_data.min_sampling_rate);
+
+ for_each_online_cpu(cpu) {
+ struct cpufreq_policy *policy;
+ struct lb_cpu_dbs_info_s *dbs_info;
+ unsigned long next_sampling, appointed_at;
+
+ policy = cpufreq_cpu_get(cpu);
+ if (!policy)
+ continue;
+ if (policy->governor != &cpufreq_gov_lab) {
+ cpufreq_cpu_put(policy);
+ continue;
+ }
+ dbs_info = &per_cpu(lb_cpu_dbs_info, cpu);
+ cpufreq_cpu_put(policy);
+
+ mutex_lock(&dbs_info->cdbs.timer_mutex);
+
+ if (!delayed_work_pending(&dbs_info->cdbs.work)) {
+ mutex_unlock(&dbs_info->cdbs.timer_mutex);
+ continue;
+ }
+
+ next_sampling = jiffies + usecs_to_jiffies(new_rate);
+ appointed_at = dbs_info->cdbs.work.timer.expires;
+
+ if (time_before(next_sampling, appointed_at)) {
+
+ mutex_unlock(&dbs_info->cdbs.timer_mutex);
+ cancel_delayed_work_sync(&dbs_info->cdbs.work);
+ mutex_lock(&dbs_info->cdbs.timer_mutex);
+
+ schedule_delayed_work_on(cpu, &dbs_info->cdbs.work,
+ usecs_to_jiffies(new_rate));
+
+ }
+ mutex_unlock(&dbs_info->cdbs.timer_mutex);
+ }
+}
+
+static ssize_t store_sampling_rate(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+ if (ret != 1)
+ return -EINVAL;
+ update_sampling_rate(input);
+ return count;
+}
+
+static ssize_t store_io_is_busy(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+
+ ret = sscanf(buf, "%u", &input);
+ if (ret != 1)
+ return -EINVAL;
+ lb_tuners.io_is_busy = !!input;
+ return count;
+}
+
+static ssize_t store_up_threshold(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+
+ if (ret != 1 || input > MAX_FREQUENCY_UP_THRESHOLD ||
+ input < MIN_FREQUENCY_UP_THRESHOLD) {
+ return -EINVAL;
+ }
+ /* Calculate the new adj_up_threshold */
+ lb_tuners.adj_up_threshold += input;
+ lb_tuners.adj_up_threshold -= lb_tuners.up_threshold;
+
+ lb_tuners.up_threshold = input;
+ return count;
+}
+
+static ssize_t store_sampling_down_factor(struct kobject *a,
+ struct attribute *b, const char *buf, size_t count)
+{
+ unsigned int input, j;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+
+ if (ret != 1 || input > MAX_SAMPLING_DOWN_FACTOR || input < 1)
+ return -EINVAL;
+ lb_tuners.sampling_down_factor = input;
+
+ /* Reset down sampling multiplier in case it was active */
+ for_each_online_cpu(j) {
+ struct lb_cpu_dbs_info_s *dbs_info = &per_cpu(lb_cpu_dbs_info,
+ j);
+ dbs_info->rate_mult = 1;
+ }
+ return count;
+}
+
+static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+
+ unsigned int j;
+
+ ret = sscanf(buf, "%u", &input);
+ if (ret != 1)
+ return -EINVAL;
+
+ if (input > 1)
+ input = 1;
+
+ if (input == lb_tuners.ignore_nice) { /* nothing to do */
+ return count;
+ }
+ lb_tuners.ignore_nice = input;
+
+ /* we need to re-evaluate prev_cpu_idle */
+ for_each_online_cpu(j) {
+ struct lb_cpu_dbs_info_s *dbs_info;
+ dbs_info = &per_cpu(lb_cpu_dbs_info, j);
+ dbs_info->cdbs.prev_cpu_idle = get_cpu_idle_time(j,
+ &dbs_info->cdbs.prev_cpu_wall);
+ if (lb_tuners.ignore_nice)
+ dbs_info->cdbs.prev_cpu_nice =
+ kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+
+ }
+ return count;
+}
+
+show_one(lb, sampling_rate, sampling_rate);
+show_one(lb, io_is_busy, io_is_busy);
+show_one(lb, up_threshold, up_threshold);
+show_one(lb, sampling_down_factor, sampling_down_factor);
+show_one(lb, ignore_nice_load, ignore_nice);
+
+define_one_global_rw(sampling_rate);
+define_one_global_rw(io_is_busy);
+define_one_global_rw(up_threshold);
+define_one_global_rw(sampling_down_factor);
+define_one_global_rw(ignore_nice_load);
+define_one_global_ro(sampling_rate_min);
+
+static struct attribute *dbs_attributes[] = {
+ &sampling_rate_min.attr,
+ &sampling_rate.attr,
+ &up_threshold.attr,
+ &sampling_down_factor.attr,
+ &ignore_nice_load.attr,
+ &io_is_busy.attr,
+ NULL
+};
+
+static struct attribute_group lb_attr_group = {
+ .attrs = dbs_attributes,
+ .name = "lab",
+};
+
+/************************** sysfs end ************************/
+
+define_get_cpu_dbs_routines(lb_cpu_dbs_info);
+
+static struct od_ops lb_ops = {
+ .freq_increase = dbs_freq_increase,
+};
+
+static struct dbs_data lb_dbs_data = {
+ .governor = GOV_LAB,
+ .attr_group = &lb_attr_group,
+ .tuners = &lb_tuners,
+ .get_cpu_cdbs = get_cpu_cdbs,
+ .get_cpu_dbs_info_s = get_cpu_dbs_info_s,
+ .gov_dbs_timer = lb_dbs_timer,
+ .gov_check_cpu = lb_check_cpu,
+ .gov_ops = &lb_ops,
+};
+
+static int lb_cpufreq_governor_dbs(struct cpufreq_policy *policy,
+ unsigned int event)
+{
+ return cpufreq_governor_dbs(&lb_dbs_data, policy, event);
+}
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_LAB
+static
+#endif
+struct cpufreq_governor cpufreq_gov_lab = {
+ .name = "lab",
+ .governor = lb_cpufreq_governor_dbs,
+ .max_transition_latency = TRANSITION_LATENCY_LIMIT,
+ .owner = THIS_MODULE,
+};
+
+static int __init cpufreq_gov_dbs_init(void)
+{
+ u64 idle_time;
+ int i, cpu = get_cpu();
+
+ mutex_init(&lb_dbs_data.mutex);
+ idle_time = get_cpu_idle_time_us(cpu, NULL);
+ put_cpu();
+ if (idle_time != -1ULL) {
+ /* Idle micro accounting is supported. Use finer thresholds */
+ lb_tuners.up_threshold = MICRO_FREQUENCY_UP_THRESHOLD;
+ lb_tuners.adj_up_threshold = MICRO_FREQUENCY_UP_THRESHOLD -
+ MICRO_FREQUENCY_DOWN_DIFFERENTIAL;
+ /*
+ * In nohz/micro accounting case we set the minimum frequency
+ * not depending on HZ, but fixed (very low). The deferred
+ * timer might skip some samples if idle/sleeping as needed.
+ */
+ lb_dbs_data.min_sampling_rate = MICRO_FREQUENCY_MIN_SAMPLE_RATE;
+ } else {
+ /* For correct statistics, we need 10 ticks for each measure */
+ lb_dbs_data.min_sampling_rate = MIN_SAMPLING_RATE_RATIO *
+ jiffies_to_usecs(10);
+ }
+
+ /* Initialize arrays */
+ ref_time = kzalloc(GFP_KERNEL,
+ num_possible_cpus() * sizeof(unsigned long));
+ idle_avg = kzalloc(GFP_KERNEL,
+ num_possible_cpus() * sizeof(unsigned int));
+ idle_hist = kzalloc(GFP_KERNEL,
+ num_possible_cpus() * sizeof(unsigned int *));
+ for (i = 0; i < num_possible_cpus(); i++)
+ idle_hist[i] = kzalloc(GFP_KERNEL,
+ MAX_HIST * sizeof(unsigned int));
+
+ return cpufreq_register_governor(&cpufreq_gov_lab);
+}
+
+static void __exit cpufreq_gov_dbs_exit(void)
+{
+ int i;
+
+ if (!ref_time)
+ kfree(ref_time);
+ if (!idle_avg)
+ kfree(idle_avg);
+ if (!idle_hist) {
+ for (i = 0; i < num_possible_cpus(); i++) {
+ if (!idle_hist[i])
+ kfree(idle_hist[i]);
+ }
+ kfree(idle_hist);
+ }
+
+ cpufreq_unregister_governor(&cpufreq_gov_lab);
+}
+
+MODULE_AUTHOR("Jonghwa Lee <[email protected]>");
+MODULE_AUTHOR("Lukasz Majewski <[email protected]>");
+MODULE_DESCRIPTION("'cpufreq_lab' - A dynamic cpufreq governor for "
+ "Legacy Application Boosting");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_LAB
+fs_initcall(cpufreq_gov_dbs_init);
+#else
+module_init(cpufreq_gov_dbs_init);
+#endif
+module_exit(cpufreq_gov_dbs_exit);
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index a22944c..050f28b 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -382,6 +382,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
extern struct cpufreq_governor cpufreq_gov_conservative;
#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_LAB)
+extern struct cpufreq_governor cpufreq_gov_lab;
+#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_lab)
#endif
--
1.7.9.5
On Mon, Apr 1, 2013 at 1:54 PM, Jonghwa Lee <[email protected]> wrote:
> <<Purpose>>
> One of the problem of ondemand is that it considers the most busy
> cpu only while doesn't care how many cpu is in busy state at the
> moment. This may results in unnecessary power consumption, and it'll
> be critical for the system having limited power source.
>
> To get the best energy efficiency, LAB governor considers not only
> idle time but also the number of idle cpus. It primarily focuses on
> supplying adequate performance to users within the limited resource.
> It checks the number of idle cpus then controls the maximum frequency
> dynamically. And it also applies different frequency increasing level
> depends on that information. In simple terms the less the number of
> busy cpus, the better performance will be given.
> In addition, stable system's power consumption in the busy state can
> be achieved also with using LAB governor. This will help to manage and
> estimate power consumption in certain system.
Hi Jonghwa,
First of all, i should accept that i haven't got to the minute details
about your
patch until now but have done a broad review of it.
There are many things that i am concerned about:
- I don't want an additional governor to be added to cpufreq unless there is a
very very strong reason for it. See what happened to earlier attempts:
https://lkml.org/lkml/2012/2/7/504
But it doesn't mean you can't get it in. :)
- What the real logic behind your patchset: I haven't got it
completely with your
mails. So what you said is:
- The lesser the number of busy cpus: you want to run at higher freqs
- The more the number of busy cpus: you want to run at lower freqs
But the basic idea i had about this stuff was: The more the number of
busy cpus, the more loaded the system is, otherwise scheduler wouldn't
have used so many cpus and so there is need to run at higher frequency
rather than a lower one. Which would save power in a sense.. Finish work
early and let most of the cpus enter idle state as early as possible. But
with your solution we would run at lower frequencies and so these cpus
will take longer to get into idle state again. This will really kill
lot of power.
Think about it.
- In case you need some sort of support on this use case, why replicate ondemand
governor again by creating another governor. I have had some hard time removing
the amount of redundancy inside governors and you are again going towards that
direction. Modifying ondemand governor for this response would be a
better option.
- You haven't rebased of latest code from linux-next :)
--
viresh
On 04/01/2013 10:24 AM, Jonghwa Lee wrote:
> This patch adds idle state time stamp to cpuidle device structure to
> notify its current idle state. If last enter time is newer than last
> exit time, then it means that the core is in idle now.
>
> Signed-off-by: Jonghwa Lee <[email protected]>
> ---
The patch description does not explain what problem you want to solve,
how to solve it and the patch itself shows nothing.
Could you elaborate ?
Thanks
-- Daniel
> drivers/cpuidle/cpuidle.c | 8 ++++----
> include/linux/cpuidle.h | 4 ++++
> 2 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index eba6929..1e830cc 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -233,18 +233,18 @@ int cpuidle_wrap_enter(struct cpuidle_device *dev,
> int (*enter)(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index))
> {
> - ktime_t time_start, time_end;
> s64 diff;
>
> - time_start = ktime_get();
> + dev->last_idle_start = ktime_get();
>
> index = enter(dev, drv, index);
>
> - time_end = ktime_get();
> + dev->last_idle_end = ktime_get();
>
> local_irq_enable();
>
> - diff = ktime_to_us(ktime_sub(time_end, time_start));
> + diff = ktime_to_us(ktime_sub(dev->last_idle_end,
> + dev->last_idle_start));
> if (diff > INT_MAX)
> diff = INT_MAX;
>
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index 480c14d..d1af05f 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -16,6 +16,7 @@
> #include <linux/kobject.h>
> #include <linux/completion.h>
> #include <linux/hrtimer.h>
> +#include <linux/ktime.h>
>
> #define CPUIDLE_STATE_MAX 8
> #define CPUIDLE_NAME_LEN 16
> @@ -74,6 +75,9 @@ struct cpuidle_device {
> struct kobject kobj;
> struct completion kobj_unregister;
>
> + ktime_t last_idle_start;
> + ktime_t last_idle_end;
> +
> #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
> int safe_state_index;
> cpumask_t coupled_cpus;
>
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 2013년 04월 02일 14:00, Daniel Lezcano wrote:
> On 04/01/2013 10:24 AM, Jonghwa Lee wrote:
>> This patch adds idle state time stamp to cpuidle device structure to
>> notify its current idle state. If last enter time is newer than last
>> exit time, then it means that the core is in idle now.
>>
>> Signed-off-by: Jonghwa Lee <[email protected]>
>> ---
>
> The patch description does not explain what problem you want to solve,
> how to solve it and the patch itself shows nothing.
>
> Could you elaborate ?
I'm sorry for lacking description. I supplement more.
This patch does add time-stamp for idle enter/exit only nothing more.
The reason why I needed them is that I wanted to know current cpu idle
state. It is hard to know whether cpu is in idle or not now.
When I check the cpuidle state usage, sometimes the information is wrong.
Because it is updated only when the cpu exits the idle state. So while the
cpu is idling, the cpuidle state usage holds past one. Therefore I put
the time-stamp for cpuidle enter/exit for checking current idling and
calculating idle state usage correctly.
I just make this patch temporary for my cpufreq governor work. So, it just
use time-stamp for all idle state together. After RFC working, I have a plan
to update this patch to use timestamp for each idle state.
Thanks,
And it also can be used to calculate cpuidle state usage even the cpu is idling.
I knew the code looks worthless, but as this patchset is just
suggesting idea,
>
> Thanks
> -- Daniel
>
>> drivers/cpuidle/cpuidle.c | 8 ++++----
>> include/linux/cpuidle.h | 4 ++++
>> 2 files changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index eba6929..1e830cc 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -233,18 +233,18 @@ int cpuidle_wrap_enter(struct cpuidle_device *dev,
>> int (*enter)(struct cpuidle_device *dev,
>> struct cpuidle_driver *drv, int index))
>> {
>> - ktime_t time_start, time_end;
>> s64 diff;
>>
>> - time_start = ktime_get();
>> + dev->last_idle_start = ktime_get();
>>
>> index = enter(dev, drv, index);
>>
>> - time_end = ktime_get();
>> + dev->last_idle_end = ktime_get();
>>
>> local_irq_enable();
>>
>> - diff = ktime_to_us(ktime_sub(time_end, time_start));
>> + diff = ktime_to_us(ktime_sub(dev->last_idle_end,
>> + dev->last_idle_start));
>> if (diff > INT_MAX)
>> diff = INT_MAX;
>>
>> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
>> index 480c14d..d1af05f 100644
>> --- a/include/linux/cpuidle.h
>> +++ b/include/linux/cpuidle.h
>> @@ -16,6 +16,7 @@
>> #include <linux/kobject.h>
>> #include <linux/completion.h>
>> #include <linux/hrtimer.h>
>> +#include <linux/ktime.h>
>>
>> #define CPUIDLE_STATE_MAX 8
>> #define CPUIDLE_NAME_LEN 16
>> @@ -74,6 +75,9 @@ struct cpuidle_device {
>> struct kobject kobj;
>> struct completion kobj_unregister;
>>
>> + ktime_t last_idle_start;
>> + ktime_t last_idle_end;
>> +
>> #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
>> int safe_state_index;
>> cpumask_t coupled_cpus;
>>
>
>
On 04/02/2013 08:17 AM, [email protected] wrote:
> On 2013년 04월 02일 14:00, Daniel Lezcano wrote:
>
>> On 04/01/2013 10:24 AM, Jonghwa Lee wrote:
>>> This patch adds idle state time stamp to cpuidle device structure to
>>> notify its current idle state. If last enter time is newer than last
>>> exit time, then it means that the core is in idle now.
>>>
>>> Signed-off-by: Jonghwa Lee <[email protected]>
>>> ---
>>
>> The patch description does not explain what problem you want to solve,
>> how to solve it and the patch itself shows nothing.
>>
>> Could you elaborate ?
>
>
> I'm sorry for lacking description. I supplement more.
>
> This patch does add time-stamp for idle enter/exit only nothing more.
> The reason why I needed them is that I wanted to know current cpu idle
> state. It is hard to know whether cpu is in idle or not now.
Did you looked at:
include/linux/sched.h:extern int idle_cpu(int cpu);
?
> When I check the cpuidle state usage, sometimes the information is wrong.
> Because it is updated only when the cpu exits the idle state. So while the
> cpu is idling, the cpuidle state usage holds past one. Therefore I put
> the time-stamp for cpuidle enter/exit for checking current idling and
> calculating idle state usage correctly.
>
> I just make this patch temporary for my cpufreq governor work. So, it just
> use time-stamp for all idle state together. After RFC working, I have a plan
> to update this patch to use timestamp for each idle state.
I suggest you look at the enter_idle / exit_idle function and make your
governor to subscribe to the IDLE_START/EXIT notifiers.
arch/x86/kernel/process.c
These are defined for the x86 architecture, maybe worth to add it to
another architecture.
Thanks
-- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 2013년 04월 02일 16:34, Daniel Lezcano wrote:
> On 04/02/2013 08:17 AM, [email protected] wrote:
>> On 2013년 04월 02일 14:00, Daniel Lezcano wrote:
>>
>>> On 04/01/2013 10:24 AM, Jonghwa Lee wrote:
>>>> This patch adds idle state time stamp to cpuidle device structure to
>>>> notify its current idle state. If last enter time is newer than last
>>>> exit time, then it means that the core is in idle now.
>>>>
>>>> Signed-off-by: Jonghwa Lee <[email protected]>
>>>> ---
>>>
>>> The patch description does not explain what problem you want to solve,
>>> how to solve it and the patch itself shows nothing.
>>>
>>> Could you elaborate ?
>>
>>
>> I'm sorry for lacking description. I supplement more.
>>
>> This patch does add time-stamp for idle enter/exit only nothing more.
>> The reason why I needed them is that I wanted to know current cpu idle
>> state. It is hard to know whether cpu is in idle or not now.
>
> Did you looked at:
>
> include/linux/sched.h:extern int idle_cpu(int cpu);
>
Yes, I did.
> ?
>
>> When I check the cpuidle state usage, sometimes the information is wrong.
>> Because it is updated only when the cpu exits the idle state. So while the
>> cpu is idling, the cpuidle state usage holds past one. Therefore I put
>> the time-stamp for cpuidle enter/exit for checking current idling and
>> calculating idle state usage correctly.
>>
>> I just make this patch temporary for my cpufreq governor work. So, it just
>> use time-stamp for all idle state together. After RFC working, I have a plan
>> to update this patch to use timestamp for each idle state.
>
> I suggest you look at the enter_idle / exit_idle function and make your
> governor to subscribe to the IDLE_START/EXIT notifiers.
>
> arch/x86/kernel/process.c
>
> These are defined for the x86 architecture, maybe worth to add it to
> another architecture.
>
Thanks for your opinion.
Actually, I work on ARM architecture and I knew that the attempt of applying
idle notifier was failed. You probably knew it, because the link you gave me
before is that attempt. (https://lkml.org/lkml/2012/2/7/504) :) Currently, there
is only notifying call which is for led in arch/arm/kernel/process.c and I think
it isn't for me to use. Anyway, Do you really think it is better way to use
notifier than my way? Because I think it is too heavy for me. On my board,
sometimes entering idle happened hundreds times during the 100ms. I don't want
to call notifier that much time. IMO, just moving local variable to per-cpu
variable for showing the enter/exit time looks better although it requires code
modification on cpudile side. What do you think?
Thanks,
Jonghwa.
> Thanks
> -- Daniel
>
On 04/02/2013 11:37 AM, [email protected] wrote:
> On 2013년 04월 02일 16:34, Daniel Lezcano wrote:
>
>> On 04/02/2013 08:17 AM, [email protected] wrote:
>>> On 2013년 04월 02일 14:00, Daniel Lezcano wrote:
>>>
>>>> On 04/01/2013 10:24 AM, Jonghwa Lee wrote:
>>>>> This patch adds idle state time stamp to cpuidle device structure to
>>>>> notify its current idle state. If last enter time is newer than last
>>>>> exit time, then it means that the core is in idle now.
>>>>>
>>>>> Signed-off-by: Jonghwa Lee <[email protected]>
>>>>> ---
>>>>
>>>> The patch description does not explain what problem you want to solve,
>>>> how to solve it and the patch itself shows nothing.
>>>>
>>>> Could you elaborate ?
>>>
>>>
>>> I'm sorry for lacking description. I supplement more.
>>>
>>> This patch does add time-stamp for idle enter/exit only nothing more.
>>> The reason why I needed them is that I wanted to know current cpu idle
>>> state. It is hard to know whether cpu is in idle or not now.
>>
>> Did you looked at:
>>
>> include/linux/sched.h:extern int idle_cpu(int cpu);
>>
>
>
> Yes, I did.
>
>> ?
>>
>>> When I check the cpuidle state usage, sometimes the information is wrong.
>>> Because it is updated only when the cpu exits the idle state. So while the
>>> cpu is idling, the cpuidle state usage holds past one. Therefore I put
>>> the time-stamp for cpuidle enter/exit for checking current idling and
>>> calculating idle state usage correctly.
>>>
>>> I just make this patch temporary for my cpufreq governor work. So, it just
>>> use time-stamp for all idle state together. After RFC working, I have a plan
>>> to update this patch to use timestamp for each idle state.
>>
>> I suggest you look at the enter_idle / exit_idle function and make your
>> governor to subscribe to the IDLE_START/EXIT notifiers.
>>
>> arch/x86/kernel/process.c
>>
>> These are defined for the x86 architecture, maybe worth to add it to
>> another architecture.
>>
>
>
> Thanks for your opinion.
>
> Actually, I work on ARM architecture and I knew that the attempt of applying
> idle notifier was failed. You probably knew it, because the link you gave me
> before is that attempt. (https://lkml.org/lkml/2012/2/7/504) :)
Yeah, now I recall this thread.
> Currently, there
> is only notifying call which is for led in arch/arm/kernel/process.c and I think
> it isn't for me to use. Anyway, Do you really think it is better way to use
> notifier than my way? Because I think it is too heavy for me. On my board,
> sometimes entering idle happened hundreds times during the 100ms. I don't want
> to call notifier that much time. IMO, just moving local variable to per-cpu
> variable for showing the enter/exit time looks better although it requires code
> modification on cpudile side. What do you think?
Sorry, but it is hard to figure out what you are trying to achieve with
a single patch.
IIUC, you want to know how long the cpu is idle including the current
state, right ? So you need to know if the cpu is idle and when it
entered the idle state, correct ?
If the cpu is idle and the information is per cpu, how will you read
this value from another cpu without introducing a locking mechanism ?
Does it mean the cpufreq governor needs cpuidle ? I am wondering if
these informations shouldn't be retrieved from the scheduler, not from
cpuidle.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 2013년 04월 02일 19:08, Daniel Lezcano wrote:
> On 04/02/2013 11:37 AM, [email protected] wrote:
>> On 2013년 04월 02일 16:34, Daniel Lezcano wrote:
>>
>>> On 04/02/2013 08:17 AM, [email protected] wrote:
>>>> On 2013년 04월 02일 14:00, Daniel Lezcano wrote:
>>>>
>>>>> On 04/01/2013 10:24 AM, Jonghwa Lee wrote:
>>>>>> This patch adds idle state time stamp to cpuidle device structure to
>>>>>> notify its current idle state. If last enter time is newer than last
>>>>>> exit time, then it means that the core is in idle now.
>>>>>>
>>>>>> Signed-off-by: Jonghwa Lee <[email protected]>
>>>>>> ---
>>>>>
>>>>> The patch description does not explain what problem you want to solve,
>>>>> how to solve it and the patch itself shows nothing.
>>>>>
>>>>> Could you elaborate ?
>>>>
>>>>
>>>> I'm sorry for lacking description. I supplement more.
>>>>
>>>> This patch does add time-stamp for idle enter/exit only nothing more.
>>>> The reason why I needed them is that I wanted to know current cpu idle
>>>> state. It is hard to know whether cpu is in idle or not now.
>>>
>>> Did you looked at:
>>>
>>> include/linux/sched.h:extern int idle_cpu(int cpu);
>>>
>>
>>
>> Yes, I did.
>>
>>> ?
>>>
>>>> When I check the cpuidle state usage, sometimes the information is wrong.
>>>> Because it is updated only when the cpu exits the idle state. So while the
>>>> cpu is idling, the cpuidle state usage holds past one. Therefore I put
>>>> the time-stamp for cpuidle enter/exit for checking current idling and
>>>> calculating idle state usage correctly.
>>>>
>>>> I just make this patch temporary for my cpufreq governor work. So, it just
>>>> use time-stamp for all idle state together. After RFC working, I have a plan
>>>> to update this patch to use timestamp for each idle state.
>>>
>>> I suggest you look at the enter_idle / exit_idle function and make your
>>> governor to subscribe to the IDLE_START/EXIT notifiers.
>>>
>>> arch/x86/kernel/process.c
>>>
>>> These are defined for the x86 architecture, maybe worth to add it to
>>> another architecture.
>>>
>>
>>
>> Thanks for your opinion.
>>
>> Actually, I work on ARM architecture and I knew that the attempt of applying
>> idle notifier was failed. You probably knew it, because the link you gave me
>> before is that attempt. (https://lkml.org/lkml/2012/2/7/504) :)
>
> Yeah, now I recall this thread.
>
Oh my, I thought you gave the link but you didn't. It was Viresh Kumar from
other patch of the patchset. Sorry.
>> Currently, there
>> is only notifying call which is for led in arch/arm/kernel/process.c and I think
>> it isn't for me to use. Anyway, Do you really think it is better way to use
>> notifier than my way? Because I think it is too heavy for me. On my board,
>> sometimes entering idle happened hundreds times during the 100ms. I don't want
>> to call notifier that much time. IMO, just moving local variable to per-cpu
>> variable for showing the enter/exit time looks better although it requires code
>> modification on cpudile side. What do you think?
>
> Sorry, but it is hard to figure out what you are trying to achieve with
> a single patch.
>
> IIUC, you want to know how long the cpu is idle including the current
> state, right ? So you need to know if the cpu is idle and when it
> entered the idle state, correct ?
>
Exactly.
> If the cpu is idle and the information is per cpu, how will you read
> this value from another cpu without introducing a locking mechanism ?
>
I think it might be tolerated for incoherency of that data. Governor reads the
data only, and if recoded start time or end time are different in few usec with
real one then it doesn't effect to the result much.
> Does it mean the cpufreq governor needs cpuidle ? I am wondering if
> these informations shouldn't be retrieved from the scheduler, not from
> cpuidle.
>
Yes, tick_sched per-cpu variable has all information that I need. But it isn't
global variable. And I'm afraid to change static variable to global one as my
pleases.
Thanks,
Jonghwa
On 04/02/2013 01:07 PM, [email protected] wrote:
> On 2013년 04월 02일 19:08, Daniel Lezcano wrote:
>
>> On 04/02/2013 11:37 AM, [email protected] wrote:
>>> On 2013년 04월 02일 16:34, Daniel Lezcano wrote:
>>>
>>>> On 04/02/2013 08:17 AM, [email protected] wrote:
>>>>> On 2013년 04월 02일 14:00, Daniel Lezcano wrote:
>>>>>
>>>>>> On 04/01/2013 10:24 AM, Jonghwa Lee wrote:
>>>>>>> This patch adds idle state time stamp to cpuidle device structure to
>>>>>>> notify its current idle state. If last enter time is newer than last
>>>>>>> exit time, then it means that the core is in idle now.
>>>>>>>
>>>>>>> Signed-off-by: Jonghwa Lee <[email protected]>
>>>>>>> ---
>>>>>>
>>>>>> The patch description does not explain what problem you want to solve,
>>>>>> how to solve it and the patch itself shows nothing.
>>>>>>
>>>>>> Could you elaborate ?
>>>>>
>>>>>
>>>>> I'm sorry for lacking description. I supplement more.
>>>>>
>>>>> This patch does add time-stamp for idle enter/exit only nothing more.
>>>>> The reason why I needed them is that I wanted to know current cpu idle
>>>>> state. It is hard to know whether cpu is in idle or not now.
>>>>
>>>> Did you looked at:
>>>>
>>>> include/linux/sched.h:extern int idle_cpu(int cpu);
>>>>
>>>
>>>
>>> Yes, I did.
>>>
>>>> ?
>>>>
>>>>> When I check the cpuidle state usage, sometimes the information is wrong.
>>>>> Because it is updated only when the cpu exits the idle state. So while the
>>>>> cpu is idling, the cpuidle state usage holds past one. Therefore I put
>>>>> the time-stamp for cpuidle enter/exit for checking current idling and
>>>>> calculating idle state usage correctly.
>>>>>
>>>>> I just make this patch temporary for my cpufreq governor work. So, it just
>>>>> use time-stamp for all idle state together. After RFC working, I have a plan
>>>>> to update this patch to use timestamp for each idle state.
>>>>
>>>> I suggest you look at the enter_idle / exit_idle function and make your
>>>> governor to subscribe to the IDLE_START/EXIT notifiers.
>>>>
>>>> arch/x86/kernel/process.c
>>>>
>>>> These are defined for the x86 architecture, maybe worth to add it to
>>>> another architecture.
>>>>
>>>
>>>
>>> Thanks for your opinion.
>>>
>>> Actually, I work on ARM architecture and I knew that the attempt of applying
>>> idle notifier was failed. You probably knew it, because the link you gave me
>>> before is that attempt. (https://lkml.org/lkml/2012/2/7/504) :)
>>
>> Yeah, now I recall this thread.
>>
>
>
> Oh my, I thought you gave the link but you didn't. It was Viresh Kumar from
> other patch of the patchset. Sorry.
>
>>> Currently, there
>>> is only notifying call which is for led in arch/arm/kernel/process.c and I think
>>> it isn't for me to use. Anyway, Do you really think it is better way to use
>>> notifier than my way? Because I think it is too heavy for me. On my board,
>>> sometimes entering idle happened hundreds times during the 100ms. I don't want
>>> to call notifier that much time. IMO, just moving local variable to per-cpu
>>> variable for showing the enter/exit time looks better although it requires code
>>> modification on cpudile side. What do you think?
>>
>> Sorry, but it is hard to figure out what you are trying to achieve with
>> a single patch.
>>
>> IIUC, you want to know how long the cpu is idle including the current
>> state, right ? So you need to know if the cpu is idle and when it
>> entered the idle state, correct ?
>>
>
>
> Exactly.
>
>
>> If the cpu is idle and the information is per cpu, how will you read
>> this value from another cpu without introducing a locking mechanism ?
>>
>
>
> I think it might be tolerated for incoherency of that data. Governor reads the
> data only, and if recoded start time or end time are different in few usec with
> real one then it doesn't effect to the result much.
>
>
>> Does it mean the cpufreq governor needs cpuidle ? I am wondering if
>> these informations shouldn't be retrieved from the scheduler, not from
>> cpuidle.
>>
>
>
> Yes, tick_sched per-cpu variable has all information that I need. But it isn't
> global variable. And I'm afraid to change static variable to global one as my
> pleases.
It is a global variable but there is a function to get access:
extern struct tick_sched *tick_get_tick_sched(int cpu);
Does it fit better for what you want to achieve ?
Thanks
-- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 2013년 04월 02일 20:18, Daniel Lezcano wrote:
> On 04/02/2013 01:07 PM, [email protected] wrote:
>> On 2013년 04월 02일 19:08, Daniel Lezcano wrote:
>>
>>> On 04/02/2013 11:37 AM, [email protected] wrote:
>>>> On 2013년 04월 02일 16:34, Daniel Lezcano wrote:
>>>>
>>>>> On 04/02/2013 08:17 AM, [email protected] wrote:
>>>>>> On 2013년 04월 02일 14:00, Daniel Lezcano wrote:
>>>>>>
>>>>>>> On 04/01/2013 10:24 AM, Jonghwa Lee wrote:
>>>>>>>> This patch adds idle state time stamp to cpuidle device structure to
>>>>>>>> notify its current idle state. If last enter time is newer than last
>>>>>>>> exit time, then it means that the core is in idle now.
>>>>>>>>
>>>>>>>> Signed-off-by: Jonghwa Lee <[email protected]>
>>>>>>>> ---
>>>>>>>
>>>>>>> The patch description does not explain what problem you want to solve,
>>>>>>> how to solve it and the patch itself shows nothing.
>>>>>>>
>>>>>>> Could you elaborate ?
>>>>>>
>>>>>>
>>>>>> I'm sorry for lacking description. I supplement more.
>>>>>>
>>>>>> This patch does add time-stamp for idle enter/exit only nothing more.
>>>>>> The reason why I needed them is that I wanted to know current cpu idle
>>>>>> state. It is hard to know whether cpu is in idle or not now.
>>>>>
>>>>> Did you looked at:
>>>>>
>>>>> include/linux/sched.h:extern int idle_cpu(int cpu);
>>>>>
>>>>
>>>>
>>>> Yes, I did.
>>>>
>>>>> ?
>>>>>
>>>>>> When I check the cpuidle state usage, sometimes the information is wrong.
>>>>>> Because it is updated only when the cpu exits the idle state. So while the
>>>>>> cpu is idling, the cpuidle state usage holds past one. Therefore I put
>>>>>> the time-stamp for cpuidle enter/exit for checking current idling and
>>>>>> calculating idle state usage correctly.
>>>>>>
>>>>>> I just make this patch temporary for my cpufreq governor work. So, it just
>>>>>> use time-stamp for all idle state together. After RFC working, I have a plan
>>>>>> to update this patch to use timestamp for each idle state.
>>>>>
>>>>> I suggest you look at the enter_idle / exit_idle function and make your
>>>>> governor to subscribe to the IDLE_START/EXIT notifiers.
>>>>>
>>>>> arch/x86/kernel/process.c
>>>>>
>>>>> These are defined for the x86 architecture, maybe worth to add it to
>>>>> another architecture.
>>>>>
>>>>
>>>>
>>>> Thanks for your opinion.
>>>>
>>>> Actually, I work on ARM architecture and I knew that the attempt of applying
>>>> idle notifier was failed. You probably knew it, because the link you gave me
>>>> before is that attempt. (https://lkml.org/lkml/2012/2/7/504) :)
>>>
>>> Yeah, now I recall this thread.
>>>
>>
>>
>> Oh my, I thought you gave the link but you didn't. It was Viresh Kumar from
>> other patch of the patchset. Sorry.
>>
>>>> Currently, there
>>>> is only notifying call which is for led in arch/arm/kernel/process.c and I think
>>>> it isn't for me to use. Anyway, Do you really think it is better way to use
>>>> notifier than my way? Because I think it is too heavy for me. On my board,
>>>> sometimes entering idle happened hundreds times during the 100ms. I don't want
>>>> to call notifier that much time. IMO, just moving local variable to per-cpu
>>>> variable for showing the enter/exit time looks better although it requires code
>>>> modification on cpudile side. What do you think?
>>>
>>> Sorry, but it is hard to figure out what you are trying to achieve with
>>> a single patch.
>>>
>>> IIUC, you want to know how long the cpu is idle including the current
>>> state, right ? So you need to know if the cpu is idle and when it
>>> entered the idle state, correct ?
>>>
>>
>>
>> Exactly.
>>
>>
>>> If the cpu is idle and the information is per cpu, how will you read
>>> this value from another cpu without introducing a locking mechanism ?
>>>
>>
>>
>> I think it might be tolerated for incoherency of that data. Governor reads the
>> data only, and if recoded start time or end time are different in few usec with
>> real one then it doesn't effect to the result much.
>>
>>
>>> Does it mean the cpufreq governor needs cpuidle ? I am wondering if
>>> these informations shouldn't be retrieved from the scheduler, not from
>>> cpuidle.
>>>
>>
>>
>> Yes, tick_sched per-cpu variable has all information that I need. But it isn't
>> global variable. And I'm afraid to change static variable to global one as my
>> pleases.
>
> It is a global variable but there is a function to get access:
>
> extern struct tick_sched *tick_get_tick_sched(int cpu);
>
> Does it fit better for what you want to achieve ?
Yes, it seems exactly what I needed. I'll check it.
Thanks for your advice :)
Thanks,
Jonghwa.
>
> Thanks
> -- Daniel
>
>
Hi Viresh,
First of all I'd like to apologize for a late response.
Please find my comments below.
> On Mon, Apr 1, 2013 at 1:54 PM, Jonghwa Lee
> <[email protected]> wrote:
> > <<Purpose>>
> > One of the problem of ondemand is that it considers the most busy
> > cpu only while doesn't care how many cpu is in busy state at the
> > moment. This may results in unnecessary power consumption, and it'll
> > be critical for the system having limited power source.
> >
> > To get the best energy efficiency, LAB governor considers not only
> > idle time but also the number of idle cpus. It primarily focuses on
> > supplying adequate performance to users within the limited resource.
> > It checks the number of idle cpus then controls the maximum
> > frequency dynamically. And it also applies different frequency
> > increasing level depends on that information. In simple terms the
> > less the number of busy cpus, the better performance will be given.
> > In addition, stable system's power consumption in the busy state
> > can be achieved also with using LAB governor. This will help to
> > manage and estimate power consumption in certain system.
>
> Hi Jonghwa,
>
> First of all, i should accept that i haven't got to the minute details
> about your
> patch until now but have done a broad review of it.
>
> There are many things that i am concerned about:
> - I don't want an additional governor to be added to cpufreq unless
> there is a very very strong reason for it. See what happened to
> earlier attempts:
>
> https://lkml.org/lkml/2012/2/7/504
>
> But it doesn't mean you can't get it in. :)
>
> - What the real logic behind your patchset: I haven't got it
> completely with your
> mails. So what you said is:
>
> - The lesser the number of busy cpus: you want to run at higher
> freqs
> - The more the number of busy cpus: you want to run at lower freqs
>
> But the basic idea i had about this stuff was: The more the number of
> busy cpus, the more loaded the system is, otherwise scheduler wouldn't
> have used so many cpus and so there is need to run at higher frequency
> rather than a lower one. Which would save power in a sense.. Finish
> work early and let most of the cpus enter idle state as early as
> possible. But with your solution we would run at lower frequencies
> and so these cpus will take longer to get into idle state again. This
> will really kill lot of power.
Our approach is a bit different than cpufreq_ondemand one. Ondemand
takes the per CPU idle time, then on that basis calculates per cpu load.
The next step is to choose the highest load and then use this value to
properly scale frequency.
On the other hand LAB tries to model different behavior:
As a first step we applied Vincent Guittot's "pack small tasks" [*]
patch to improve "race to idle" behavior:
http://article.gmane.org/gmane.linux.kernel/1371435/match=sched+pack+small+tasks
Afterwards, we decided to investigate different approach for power
governing:
Use the number of sleeping CPUs (not the maximal per-CPU load) to
change frequency. We thereof depend on [*] to "pack" as many tasks to
CPU as possible and allow other to sleep.
On this basis we can increase (even overclock) frequency (when other
CPUs sleep) to improve performance of the only running CPU.
Contrary, when all cores are heavily loaded, we decided to reduce
frequency by around 30%. With this approach user experience recution is
still acceptable (with much less power consumption).
When system is "idle" - only small background tasks are running, the
frequency is reduced to minimum.
To sum up:
Different approach (number of idle CPUs) is used by us to control
ondemand governor. We also heavily depend on [*] patch set.
>
> Think about it.
>
> - In case you need some sort of support on this use case, why
> replicate ondemand governor again by creating another governor. I
> have had some hard time removing the amount of redundancy inside
> governors and you are again going towards that direction. Modifying
> ondemand governor for this response would be a better option.
We have only posted the "RFC" so, we are open for suggestions.
At cpufreq_governor.c file the dbs_check_cpu function is responsible
for calculating the maximal load (among CPUs). I think that we could
also count the number of sleeping CPUs (as the average of time
accumulated data). Then we could pass this data to newly written
function (e.g. lab_check_cpu) defined at cpufreq_ondemand.c (and
pointed to by gov_check_cpu).
This would require change to dbs_check_cpu function and extending
ONDEMAND governor by lab_check_cpu() function.
>
> - You haven't rebased of latest code from linux-next :)
>
We have posted this "RFC" patch mainly for discussion, and I think it
fits its purpose :-).
--
Best regards,
Lukasz Majewski
Samsung R&D Poland (SRPOL) | Linux Platform Group
On 9 April 2013 16:07, Lukasz Majewski <[email protected]> wrote:
>> On Mon, Apr 1, 2013 at 1:54 PM, Jonghwa Lee
> Our approach is a bit different than cpufreq_ondemand one. Ondemand
> takes the per CPU idle time, then on that basis calculates per cpu load.
> The next step is to choose the highest load and then use this value to
> properly scale frequency.
>
> On the other hand LAB tries to model different behavior:
>
> As a first step we applied Vincent Guittot's "pack small tasks" [*]
> patch to improve "race to idle" behavior:
> http://article.gmane.org/gmane.linux.kernel/1371435/match=sched+pack+small+tasks
Luckily he is part of my team :)
http://www.linaro.org/linux-on-arm/meet-the-team/power-management
BTW, he is using ondemand governor for all his work.
> Afterwards, we decided to investigate different approach for power
> governing:
>
> Use the number of sleeping CPUs (not the maximal per-CPU load) to
> change frequency. We thereof depend on [*] to "pack" as many tasks to
> CPU as possible and allow other to sleep.
He packs only small tasks. And if there are many small tasks we are
packing, then load must be high and so ondemand gov will increase freq.
> Contrary, when all cores are heavily loaded, we decided to reduce
> frequency by around 30%. With this approach user experience recution is
> still acceptable (with much less power consumption).
Don't know.. running many cpus at lower freq for long duration will probably
take more power than running them at high freq for short duration and making
system idle again.
> We have posted this "RFC" patch mainly for discussion, and I think it
> fits its purpose :-).
Yes, no issues with your RFC idea.. its perfect..
@Vincent: Can you please follow this thread a bit and tell us what your views
are?
--
viresh
Hi, sorry for my late reply.
I just want to add comment to assist Lukasz's.
I put my comments below of Lukasz's.
On 2013년 04월 09일 19:37, Lukasz Majewski wrote:
> Hi Viresh,
>
> First of all I'd like to apologize for a late response.
> Please find my comments below.
>
>> On Mon, Apr 1, 2013 at 1:54 PM, Jonghwa Lee
>> <[email protected]> wrote:
>>> <<Purpose>>
>>> One of the problem of ondemand is that it considers the most busy
>>> cpu only while doesn't care how many cpu is in busy state at the
>>> moment. This may results in unnecessary power consumption, and it'll
>>> be critical for the system having limited power source.
>>>
>>> To get the best energy efficiency, LAB governor considers not only
>>> idle time but also the number of idle cpus. It primarily focuses on
>>> supplying adequate performance to users within the limited resource.
>>> It checks the number of idle cpus then controls the maximum
>>> frequency dynamically. And it also applies different frequency
>>> increasing level depends on that information. In simple terms the
>>> less the number of busy cpus, the better performance will be given.
>>> In addition, stable system's power consumption in the busy state
>>> can be achieved also with using LAB governor. This will help to
>>> manage and estimate power consumption in certain system.
>>
>> Hi Jonghwa,
>>
>> First of all, i should accept that i haven't got to the minute details
>> about your
>> patch until now but have done a broad review of it.
>>
>> There are many things that i am concerned about:
>> - I don't want an additional governor to be added to cpufreq unless
>> there is a very very strong reason for it. See what happened to
>> earlier attempts:
>>
>> https://lkml.org/lkml/2012/2/7/504
>>
>> But it doesn't mean you can't get it in. :)
>>
>> - What the real logic behind your patchset: I haven't got it
>> completely with your
>> mails. So what you said is:
>>
>> - The lesser the number of busy cpus: you want to run at higher
>> freqs
>> - The more the number of busy cpus: you want to run at lower freqs
>>
>> But the basic idea i had about this stuff was: The more the number of
>> busy cpus, the more loaded the system is, otherwise scheduler wouldn't
>> have used so many cpus and so there is need to run at higher frequency
>> rather than a lower one. Which would save power in a sense.. Finish
>> work early and let most of the cpus enter idle state as early as
>> possible. But with your solution we would run at lower frequencies
>> and so these cpus will take longer to get into idle state again. This
>> will really kill lot of power.
>
> Our approach is a bit different than cpufreq_ondemand one. Ondemand
> takes the per CPU idle time, then on that basis calculates per cpu load.
> The next step is to choose the highest load and then use this value to
> properly scale frequency.
>
> On the other hand LAB tries to model different behavior:
>
> As a first step we applied Vincent Guittot's "pack small tasks" [*]
> patch to improve "race to idle" behavior:
> http://article.gmane.org/gmane.linux.kernel/1371435/match=sched+pack+small+tasks
>
> Afterwards, we decided to investigate different approach for power
> governing:
>
> Use the number of sleeping CPUs (not the maximal per-CPU load) to
> change frequency. We thereof depend on [*] to "pack" as many tasks to
> CPU as possible and allow other to sleep.
> On this basis we can increase (even overclock) frequency (when other
> CPUs sleep) to improve performance of the only running CPU.
>
> Contrary, when all cores are heavily loaded, we decided to reduce
> frequency by around 30%. With this approach user experience recution is
> still acceptable (with much less power consumption).
> When system is "idle" - only small background tasks are running, the
> frequency is reduced to minimum.
>
> To sum up:
>
> Different approach (number of idle CPUs) is used by us to control
> ondemand governor. We also heavily depend on [*] patch set.
>
In additions, it is hard to say just letting high performance to busy system is
the best way for reducing the power consumption. Yes, as like Viresh said, we
can save the time of busy working, but that is not always perfect
solution. If we push all CPUs to keep high performance as much as they can,
the temperature will increase rapidly. In my test, on-demand governor reached
the thermal limits frequently, while lab governor didn't. And the high
temperature also effects the power consumption increasing. I got a rough result
which has 10% differences in power consumption between 20% temperature differed
conditions.
<<Consumed power with maximum frequency at different temperature>>
Temperature Power consumption(mWh) Loss(%)
65'C 53.89 Base
80'C 59.88 10
So, to reduce the power consumption, it looks we have to give our cares more to
avoid the situation where system goes to high temperature. It is more meaningful
for the mobile environment. In mobile devices, high temperature will also affect
user's experience badly and can't be decreased easily.
Maybe temperature problem is not the duty of cpufreq governor. However, if we
just take a look at power consumption gain, we can find the right of the this
governor either.
<<Consumed power during the web browsing with foreground movie play>>
Governor Power consumption(mWh) Percentage(%)
ondemand 70.28 100
lab 55.14 78
===> There is almost 22% gain for power consumption.
<<Results of various performance test>>
## v8 benchmark test (http://v8.googlecode.com)
Governor score Percentage(%)
ondemand 1736.75 100
lab 1576.75 91
===> 10% performance loss.
## Dhrystone
Governor dhrystone Percentage(%)
ondemand 2362204 100
lab 2197802 93
===> 7% performance loss.
## Egypt GLbenchmark (http://www.glbenchmark.com)
Governor Frames FPS
ondemand 5612 49.65
lab 5559 49.19
===> Almost no differences.
(Above whole tests were tested on the EXYNOS4412 SoC based board.)
Thanks,
Jonghwa
>>
>> Think about it.
>>
>> - In case you need some sort of support on this use case, why
>> replicate ondemand governor again by creating another governor. I
>> have had some hard time removing the amount of redundancy inside
>> governors and you are again going towards that direction. Modifying
>> ondemand governor for this response would be a better option.
>
> We have only posted the "RFC" so, we are open for suggestions.
>
> At cpufreq_governor.c file the dbs_check_cpu function is responsible
> for calculating the maximal load (among CPUs). I think that we could
> also count the number of sleeping CPUs (as the average of time
> accumulated data). Then we could pass this data to newly written
> function (e.g. lab_check_cpu) defined at cpufreq_ondemand.c (and
> pointed to by gov_check_cpu).
>
> This would require change to dbs_check_cpu function and extending
> ONDEMAND governor by lab_check_cpu() function.
>
>>
>> - You haven't rebased of latest code from linux-next :)
>>
>
> We have posted this "RFC" patch mainly for discussion, and I think it
> fits its purpose :-).
>
>
>
Hi Viresh and Vincent,
> On 9 April 2013 16:07, Lukasz Majewski <[email protected]> wrote:
> >> On Mon, Apr 1, 2013 at 1:54 PM, Jonghwa Lee
> > Our approach is a bit different than cpufreq_ondemand one. Ondemand
> > takes the per CPU idle time, then on that basis calculates per cpu
> > load. The next step is to choose the highest load and then use this
> > value to properly scale frequency.
> >
> > On the other hand LAB tries to model different behavior:
> >
> > As a first step we applied Vincent Guittot's "pack small tasks" [*]
> > patch to improve "race to idle" behavior:
> > http://article.gmane.org/gmane.linux.kernel/1371435/match=sched+pack+small+tasks
>
> Luckily he is part of my team :)
>
> http://www.linaro.org/linux-on-arm/meet-the-team/power-management
>
> BTW, he is using ondemand governor for all his work.
>
> > Afterwards, we decided to investigate different approach for power
> > governing:
> >
> > Use the number of sleeping CPUs (not the maximal per-CPU load) to
> > change frequency. We thereof depend on [*] to "pack" as many tasks
> > to CPU as possible and allow other to sleep.
>
> He packs only small tasks.
What's about packing not only small tasks? I will investigate the
possibility to aggressively pack (even with a cost of performance
degradation) as many tasks as possible to a single CPU.
It seems a good idea for a power consumption reduction.
> And if there are many small tasks we are
> packing, then load must be high and so ondemand gov will increase
> freq.
This is of course true for "packing" all tasks to a single CPU. If we
stay at the power consumption envelope, we can even overclock the
frequency.
But what if other - lets say 3 CPUs - are under heavy workload?
Ondemand will switch frequency to maximum, and as Jonghwa pointed out
this can cause dangerous temperature increase.
>
> > Contrary, when all cores are heavily loaded, we decided to reduce
> > frequency by around 30%. With this approach user experience
> > recution is still acceptable (with much less power consumption).
>
> Don't know.. running many cpus at lower freq for long duration will
> probably take more power than running them at high freq for short
> duration and making system idle again.
>
> > We have posted this "RFC" patch mainly for discussion, and I think
> > it fits its purpose :-).
>
> Yes, no issues with your RFC idea.. its perfect..
>
> @Vincent: Can you please follow this thread a bit and tell us what
> your views are?
>
> --
> viresh
--
Best regards,
Lukasz Majewski
Samsung R&D Poland (SRPOL) | Linux Platform Group
On 9 April 2013 20:52, Vincent Guittot <[email protected]> wrote:
>
>
> On Tuesday, 9 April 2013, Lukasz Majewski <[email protected]> wrote:
>> Hi Viresh and Vincent,
>>
>>> On 9 April 2013 16:07, Lukasz Majewski <[email protected]> wrote:
>>> >> On Mon, Apr 1, 2013 at 1:54 PM, Jonghwa Lee
>>> > Our approach is a bit different than cpufreq_ondemand one. Ondemand
>>> > takes the per CPU idle time, then on that basis calculates per cpu
>>> > load. The next step is to choose the highest load and then use this
>>> > value to properly scale frequency.
>>> >
>>> > On the other hand LAB tries to model different behavior:
>>> >
>>> > As a first step we applied Vincent Guittot's "pack small tasks" [*]
>>> > patch to improve "race to idle" behavior:
>>> >
>>> > http://article.gmane.org/gmane.linux.kernel/1371435/match=sched+pack+small+tasks
>>>
>>> Luckily he is part of my team :)
>>>
>>> http://www.linaro.org/linux-on-arm/meet-the-team/power-management
>>>
>>> BTW, he is using ondemand governor for all his work.
>>>
>>> > Afterwards, we decided to investigate different approach for power
>>> > governing:
>>> >
>>> > Use the number of sleeping CPUs (not the maximal per-CPU load) to
>>> > change frequency. We thereof depend on [*] to "pack" as many tasks
>>> > to CPU as possible and allow other to sleep.
>>>
>>> He packs only small tasks.
>>
>> What's about packing not only small tasks? I will investigate the
>> possibility to aggressively pack (even with a cost of performance
>> degradation) as many tasks as possible to a single CPU.
>
> Hi Lukasz,
>
> I've got same comment on my current patch and I'm preparing a new version
> that can pack tasks more agressively based on the same buddy mecanism. This
> will be done at the cost of performance of course.
>
>
>
>>
>> It seems a good idea for a power consumption reduction.
>
> In fact, it's not always true and depends several inputs like the number of
> tasks that run simultaneously
>
>
>>
>>> And if there are many small tasks we are
>>> packing, then load must be high and so ondemand gov will increase
>>> freq.
>>
>> This is of course true for "packing" all tasks to a single CPU. If we
>> stay at the power consumption envelope, we can even overclock the
>> frequency.
>>
>> But what if other - lets say 3 CPUs - are under heavy workload?
>> Ondemand will switch frequency to maximum, and as Jonghwa pointed out
>> this can cause dangerous temperature increase.
>
> IIUC, your main concern is to stay in a power consumption budget to not over
> heat and have to face the side effect of high temperature like a decrease of
> power efficiency. So your governor modifies the max frequency based on the
> number of running/idle CPU to have an almost stable power consumtpion ?
>
> Have you also looked at the power clamp driver that have similar target ?
>
>
> Vincent
>
>
>>
>>>
>>> > Contrary, when all cores are heavily loaded, we decided to reduce
>>> > frequency by around 30%. With this approach user experience
>>> > recution is still acceptable (with much less power consumption).
>>>
>>> Don't know.. running many cpus at lower freq for long duration will
>>> probably take more power than running them at high freq for short
>>> duration and making system idle again.
>>>
>>> > We have posted this "RFC" patch mainly for discussion, and I think
>>> > it fits its purpose :-).
>>>
>>> Yes, no issues with your RFC idea.. its perfect..
>>>
>>> @Vincent: Can you please follow this thread a bit and tell us what
>>> your views are?
>>>
>>> --
>>> viresh
>>
>>
>>
>> --
>> Best regards,
>>
>> Lukasz Majewski
>>
>> Samsung R&D Poland (SRPOL) | Linux Platform Group
>>
Hi Vincent,
>
>
> On Tuesday, 9 April 2013, Lukasz Majewski <[email protected]>
> wrote:
> > Hi Viresh and Vincent,
> >
> >> On 9 April 2013 16:07, Lukasz Majewski <[email protected]>
> >> wrote:
> >> >> On Mon, Apr 1, 2013 at 1:54 PM, Jonghwa Lee
> >> > Our approach is a bit different than cpufreq_ondemand one.
> >> > Ondemand takes the per CPU idle time, then on that basis
> >> > calculates per cpu load. The next step is to choose the highest
> >> > load and then use this value to properly scale frequency.
> >> >
> >> > On the other hand LAB tries to model different behavior:
> >> >
> >> > As a first step we applied Vincent Guittot's "pack small
> >> > tasks" [*] patch to improve "race to idle" behavior:
> >> > http://article.gmane.org/gmane.linux.kernel/1371435/match=sched+pack+small+tasks
> >>
> >> Luckily he is part of my team :)
> >>
> >> http://www.linaro.org/linux-on-arm/meet-the-team/power-management
> >>
> >> BTW, he is using ondemand governor for all his work.
> >>
> >> > Afterwards, we decided to investigate different approach for
> >> > power governing:
> >> >
> >> > Use the number of sleeping CPUs (not the maximal per-CPU load) to
> >> > change frequency. We thereof depend on [*] to "pack" as many
> >> > tasks to CPU as possible and allow other to sleep.
> >>
> >> He packs only small tasks.
> >
> > What's about packing not only small tasks? I will investigate the
> > possibility to aggressively pack (even with a cost of performance
> > degradation) as many tasks as possible to a single CPU.
>
> Hi Lukasz,
>
> I've got same comment on my current patch and I'm preparing a new
> version that can pack tasks more agressively based on the same buddy
> mecanism. This will be done at the cost of performance of course.
Can you share your development tree?
>
>
> >
> > It seems a good idea for a power consumption reduction.
>
> In fact, it's not always true and depends several inputs like the
> number of tasks that run simultaneously
In my understanding, we can try to couple (affine) maximal number of
task with a CPU. Performance shall decrease, but we will avoid costs of
tasks migration.
If I remember correctly, I've asked you about some testbench/test
program for scheduler evaluation. I assume that nothing has changed and
there isn't any "common" set of scheduler tests?
>
> >
> >> And if there are many small tasks we are
> >> packing, then load must be high and so ondemand gov will increase
> >> freq.
> >
> > This is of course true for "packing" all tasks to a single CPU. If
> > we stay at the power consumption envelope, we can even overclock the
> > frequency.
> >
> > But what if other - lets say 3 CPUs - are under heavy workload?
> > Ondemand will switch frequency to maximum, and as Jonghwa pointed
> > out this can cause dangerous temperature increase.
>
> IIUC, your main concern is to stay in a power consumption budget to
> not over heat and have to face the side effect of high temperature
> like a decrease of power efficiency. So your governor modifies the
> max frequency based on the number of running/idle CPU
Yes, this is correct.
> to have an
> almost stable power consumtpion ?
>From our observation it seems, that for 3 or 4 running CPUs under heavy
load we see much more power consumption reduction.
To put it in another way - ondemand would increase frequency to max for
all 4 CPUs. On the other hand, if user experience drops to the
acceptable level we can reduce power consumption.
Reducing frequency and CPU voltage (by DVS) causes as a side effect,
that temperature stays at acceptable level.
>
> Have you also looked at the power clamp driver that have similar
> target ?
I might be wrong here, but in my opinion the power clamp driver is a bit
different:
1. It is dedicated to Intel SoCs, which provide special set of
registers (i.e. MSR_PKG_Cx_RESIDENCY [*]), which forces a processor to
enter certain C state for a given duration. Idle duration is calculated
by per CPU set of high priority kthreads (which also program [*]
registers).
2. ARM SoCs don't have such infrastructure, so we depend on SW here.
Scheduler has to remove tasks from a particular CPU and "execute" on
it the idle_task.
Moreover at Exynos4 thermal control loop depends on SW, since we can
only read SoC temperature via TMU (Thermal Management Unit) block.
Correct me again, but it seems to me that on ARM we can use CPU hotplug
(which as Tomas Glexner stated recently is going to be "refactored" :-)
) or "ask" scheduler to use smallest possible number of CPUs and enter C
state for idling CPUs.
>
>
> Vincent
>
> >
> >>
> >> > Contrary, when all cores are heavily loaded, we decided to reduce
> >> > frequency by around 30%. With this approach user experience
> >> > recution is still acceptable (with much less power consumption).
> >>
> >> Don't know.. running many cpus at lower freq for long duration will
> >> probably take more power than running them at high freq for short
> >> duration and making system idle again.
> >>
> >> > We have posted this "RFC" patch mainly for discussion, and I
> >> > think it fits its purpose :-).
> >>
> >> Yes, no issues with your RFC idea.. its perfect..
> >>
> >> @Vincent: Can you please follow this thread a bit and tell us what
> >> your views are?
> >>
> >> --
> >> viresh
> >
> >
> >
> > --
> > Best regards,
> >
> > Lukasz Majewski
> >
> > Samsung R&D Poland (SRPOL) | Linux Platform Group
> >
--
Best regards,
Lukasz Majewski
Samsung R&D Poland (SRPOL) | Linux Platform Group
On 10 April 2013 10:44, Lukasz Majewski <[email protected]> wrote:
> Hi Vincent,
>
>>
>>
>> On Tuesday, 9 April 2013, Lukasz Majewski <[email protected]>
>> wrote:
>> > Hi Viresh and Vincent,
>> >
>> >> On 9 April 2013 16:07, Lukasz Majewski <[email protected]>
>> >> wrote:
>> >> >> On Mon, Apr 1, 2013 at 1:54 PM, Jonghwa Lee
>> >> > Our approach is a bit different than cpufreq_ondemand one.
>> >> > Ondemand takes the per CPU idle time, then on that basis
>> >> > calculates per cpu load. The next step is to choose the highest
>> >> > load and then use this value to properly scale frequency.
>> >> >
>> >> > On the other hand LAB tries to model different behavior:
>> >> >
>> >> > As a first step we applied Vincent Guittot's "pack small
>> >> > tasks" [*] patch to improve "race to idle" behavior:
>> >> > http://article.gmane.org/gmane.linux.kernel/1371435/match=sched+pack+small+tasks
>> >>
>> >> Luckily he is part of my team :)
>> >>
>> >> http://www.linaro.org/linux-on-arm/meet-the-team/power-management
>> >>
>> >> BTW, he is using ondemand governor for all his work.
>> >>
>> >> > Afterwards, we decided to investigate different approach for
>> >> > power governing:
>> >> >
>> >> > Use the number of sleeping CPUs (not the maximal per-CPU load) to
>> >> > change frequency. We thereof depend on [*] to "pack" as many
>> >> > tasks to CPU as possible and allow other to sleep.
>> >>
>> >> He packs only small tasks.
>> >
>> > What's about packing not only small tasks? I will investigate the
>> > possibility to aggressively pack (even with a cost of performance
>> > degradation) as many tasks as possible to a single CPU.
>>
>> Hi Lukasz,
>>
>> I've got same comment on my current patch and I'm preparing a new
>> version that can pack tasks more agressively based on the same buddy
>> mecanism. This will be done at the cost of performance of course.
>
> Can you share your development tree?
The dev is not finished yet but i will share it as soon as possible
>
>>
>>
>> >
>> > It seems a good idea for a power consumption reduction.
>>
>> In fact, it's not always true and depends several inputs like the
>> number of tasks that run simultaneously
>
> In my understanding, we can try to couple (affine) maximal number of
> task with a CPU. Performance shall decrease, but we will avoid costs of
> tasks migration.
>
> If I remember correctly, I've asked you about some testbench/test
> program for scheduler evaluation. I assume that nothing has changed and
> there isn't any "common" set of scheduler tests?
There are a bunch of bench that are used to evaluate scheduler like
hackbench, pgbench but they generally fills all CPU in order to test
max performance. Are you looking for such kind of bench ?
>
>>
>> >
>> >> And if there are many small tasks we are
>> >> packing, then load must be high and so ondemand gov will increase
>> >> freq.
>> >
>> > This is of course true for "packing" all tasks to a single CPU. If
>> > we stay at the power consumption envelope, we can even overclock the
>> > frequency.
>> >
>> > But what if other - lets say 3 CPUs - are under heavy workload?
>> > Ondemand will switch frequency to maximum, and as Jonghwa pointed
>> > out this can cause dangerous temperature increase.
>>
>> IIUC, your main concern is to stay in a power consumption budget to
>> not over heat and have to face the side effect of high temperature
>> like a decrease of power efficiency. So your governor modifies the
>> max frequency based on the number of running/idle CPU
> Yes, this is correct.
>
>> to have an
>> almost stable power consumtpion ?
>
> From our observation it seems, that for 3 or 4 running CPUs under heavy
> load we see much more power consumption reduction.
That's logic because you will reduce the voltage
>
> To put it in another way - ondemand would increase frequency to max for
> all 4 CPUs. On the other hand, if user experience drops to the
> acceptable level we can reduce power consumption.
>
> Reducing frequency and CPU voltage (by DVS) causes as a side effect,
> that temperature stays at acceptable level.
>
>>
>> Have you also looked at the power clamp driver that have similar
>> target ?
>
> I might be wrong here, but in my opinion the power clamp driver is a bit
> different:
yes, it periodically forces the cluster in a low power state
>
> 1. It is dedicated to Intel SoCs, which provide special set of
> registers (i.e. MSR_PKG_Cx_RESIDENCY [*]), which forces a processor to
> enter certain C state for a given duration. Idle duration is calculated
> by per CPU set of high priority kthreads (which also program [*]
> registers).
IIRC, a trial on ARM platform have been done by lorenzo and daniel.
Lorenzo, Daniel, have you more information ?
>
> 2. ARM SoCs don't have such infrastructure, so we depend on SW here.
> Scheduler has to remove tasks from a particular CPU and "execute" on
> it the idle_task.
> Moreover at Exynos4 thermal control loop depends on SW, since we can
> only read SoC temperature via TMU (Thermal Management Unit) block.
The idle duration is quite small and should not perturb normal behavior
Vincent
>
>
> Correct me again, but it seems to me that on ARM we can use CPU hotplug
> (which as Tomas Glexner stated recently is going to be "refactored" :-)
> ) or "ask" scheduler to use smallest possible number of CPUs and enter C
> state for idling CPUs.
>
>
>
>>
>>
>> Vincent
>>
>> >
>> >>
>> >> > Contrary, when all cores are heavily loaded, we decided to reduce
>> >> > frequency by around 30%. With this approach user experience
>> >> > recution is still acceptable (with much less power consumption).
>> >>
>> >> Don't know.. running many cpus at lower freq for long duration will
>> >> probably take more power than running them at high freq for short
>> >> duration and making system idle again.
>> >>
>> >> > We have posted this "RFC" patch mainly for discussion, and I
>> >> > think it fits its purpose :-).
>> >>
>> >> Yes, no issues with your RFC idea.. its perfect..
>> >>
>> >> @Vincent: Can you please follow this thread a bit and tell us what
>> >> your views are?
>> >>
>> >> --
>> >> viresh
>> >
>> >
>> >
>> > --
>> > Best regards,
>> >
>> > Lukasz Majewski
>> >
>> > Samsung R&D Poland (SRPOL) | Linux Platform Group
>> >
>
>
> --
> Best regards,
>
> Lukasz Majewski
>
> Samsung R&D Poland (SRPOL) | Linux Platform Group
Hi Vincent,
> On 10 April 2013 10:44, Lukasz Majewski <[email protected]>
> wrote:
> > Hi Vincent,
> >
> >>
> >>
> >> On Tuesday, 9 April 2013, Lukasz Majewski <[email protected]>
> >> wrote:
> >> > Hi Viresh and Vincent,
> >> >
> >> >> On 9 April 2013 16:07, Lukasz Majewski <[email protected]>
> >> >> wrote:
> >> >> >> On Mon, Apr 1, 2013 at 1:54 PM, Jonghwa Lee
> >> >> > Our approach is a bit different than cpufreq_ondemand one.
> >> >> > Ondemand takes the per CPU idle time, then on that basis
> >> >> > calculates per cpu load. The next step is to choose the
> >> >> > highest load and then use this value to properly scale
> >> >> > frequency.
> >> >> >
> >> >> > On the other hand LAB tries to model different behavior:
> >> >> >
> >> >> > As a first step we applied Vincent Guittot's "pack small
> >> >> > tasks" [*] patch to improve "race to idle" behavior:
> >> >> > http://article.gmane.org/gmane.linux.kernel/1371435/match=sched+pack+small+tasks
> >> >>
> >> >> Luckily he is part of my team :)
> >> >>
> >> >> http://www.linaro.org/linux-on-arm/meet-the-team/power-management
> >> >>
> >> >> BTW, he is using ondemand governor for all his work.
> >> >>
> >> >> > Afterwards, we decided to investigate different approach for
> >> >> > power governing:
> >> >> >
> >> >> > Use the number of sleeping CPUs (not the maximal per-CPU
> >> >> > load) to change frequency. We thereof depend on [*] to "pack"
> >> >> > as many tasks to CPU as possible and allow other to sleep.
> >> >>
> >> >> He packs only small tasks.
> >> >
> >> > What's about packing not only small tasks? I will investigate the
> >> > possibility to aggressively pack (even with a cost of performance
> >> > degradation) as many tasks as possible to a single CPU.
> >>
> >> Hi Lukasz,
> >>
> >> I've got same comment on my current patch and I'm preparing a new
> >> version that can pack tasks more agressively based on the same
> >> buddy mecanism. This will be done at the cost of performance of
> >> course.
> >
> > Can you share your development tree?
>
> The dev is not finished yet but i will share it as soon as possible
Ok
>
> >
> >>
> >>
> >> >
> >> > It seems a good idea for a power consumption reduction.
> >>
> >> In fact, it's not always true and depends several inputs like the
> >> number of tasks that run simultaneously
> >
> > In my understanding, we can try to couple (affine) maximal number of
> > task with a CPU. Performance shall decrease, but we will avoid
> > costs of tasks migration.
> >
> > If I remember correctly, I've asked you about some testbench/test
> > program for scheduler evaluation. I assume that nothing has changed
> > and there isn't any "common" set of scheduler tests?
>
> There are a bunch of bench that are used to evaluate scheduler like
> hackbench, pgbench but they generally fills all CPU in order to test
> max performance. Are you looking for such kind of bench ?
I'd rather see a bit different set of tests - something similar to
"cyclic" tests for PREEMPT_RT patch.
For sched work it would be welcome to spawn a lot of processes with
different duration and workload. And on this basis observe if e.g. 2 or
3 processors are idle.
>
> >
> >>
> >> >
> >> >> And if there are many small tasks we are
> >> >> packing, then load must be high and so ondemand gov will
> >> >> increase freq.
> >> >
> >> > This is of course true for "packing" all tasks to a single CPU.
> >> > If we stay at the power consumption envelope, we can even
> >> > overclock the frequency.
> >> >
> >> > But what if other - lets say 3 CPUs - are under heavy workload?
> >> > Ondemand will switch frequency to maximum, and as Jonghwa pointed
> >> > out this can cause dangerous temperature increase.
> >>
> >> IIUC, your main concern is to stay in a power consumption budget to
> >> not over heat and have to face the side effect of high temperature
> >> like a decrease of power efficiency. So your governor modifies the
> >> max frequency based on the number of running/idle CPU
> > Yes, this is correct.
> >
> >> to have an
> >> almost stable power consumtpion ?
> >
> > From our observation it seems, that for 3 or 4 running CPUs under
> > heavy load we see much more power consumption reduction.
>
> That's logic because you will reduce the voltage
>
> >
> > To put it in another way - ondemand would increase frequency to max
> > for all 4 CPUs. On the other hand, if user experience drops to the
> > acceptable level we can reduce power consumption.
> >
> > Reducing frequency and CPU voltage (by DVS) causes as a side effect,
> > that temperature stays at acceptable level.
> >
> >>
> >> Have you also looked at the power clamp driver that have similar
> >> target ?
> >
> > I might be wrong here, but in my opinion the power clamp driver is
> > a bit different:
>
> yes, it periodically forces the cluster in a low power state
>
> >
> > 1. It is dedicated to Intel SoCs, which provide special set of
> > registers (i.e. MSR_PKG_Cx_RESIDENCY [*]), which forces a processor
> > to enter certain C state for a given duration. Idle duration is
> > calculated by per CPU set of high priority kthreads (which also
> > program [*] registers).
>
> IIRC, a trial on ARM platform have been done by lorenzo and daniel.
> Lorenzo, Daniel, have you more information ?
More information would be welcome :-)
>
> >
> > 2. ARM SoCs don't have such infrastructure, so we depend on SW here.
> > Scheduler has to remove tasks from a particular CPU and "execute" on
> > it the idle_task.
> > Moreover at Exynos4 thermal control loop depends on SW, since we can
> > only read SoC temperature via TMU (Thermal Management Unit) block.
>
> The idle duration is quite small and should not perturb normal
> behavior
What do you mean by "small idle duration"? You think about exact time
needed to enter idle state (ARM's WFI) or the time in which CPU is idle.
>
> Vincent
> >
> >
> > Correct me again, but it seems to me that on ARM we can use CPU
> > hotplug (which as Tomas Glexner stated recently is going to be
> > "refactored" :-) ) or "ask" scheduler to use smallest possible
> > number of CPUs and enter C state for idling CPUs.
> >
> >
> >
> >>
> >>
> >> Vincent
> >>
> >> >
> >> >>
> >> >> > Contrary, when all cores are heavily loaded, we decided to
> >> >> > reduce frequency by around 30%. With this approach user
> >> >> > experience recution is still acceptable (with much less power
> >> >> > consumption).
> >> >>
> >> >> Don't know.. running many cpus at lower freq for long duration
> >> >> will probably take more power than running them at high freq
> >> >> for short duration and making system idle again.
> >> >>
> >> >> > We have posted this "RFC" patch mainly for discussion, and I
> >> >> > think it fits its purpose :-).
> >> >>
> >> >> Yes, no issues with your RFC idea.. its perfect..
> >> >>
> >> >> @Vincent: Can you please follow this thread a bit and tell us
> >> >> what your views are?
> >> >>
> >> >> --
> >> >> viresh
> >> >
> >> >
> >> >
> >> > --
> >> > Best regards,
> >> >
> >> > Lukasz Majewski
> >> >
> >> > Samsung R&D Poland (SRPOL) | Linux Platform Group
> >> >
> >
> >
> > --
> > Best regards,
> >
> > Lukasz Majewski
> >
> > Samsung R&D Poland (SRPOL) | Linux Platform Group
--
Best regards,
Lukasz Majewski
Samsung R&D Poland (SRPOL) | Linux Platform Group
On Wed, Apr 10, 2013 at 09:44:52AM +0100, Lukasz Majewski wrote:
[...]
> > Have you also looked at the power clamp driver that have similar
> > target ?
>
> I might be wrong here, but in my opinion the power clamp driver is a bit
> different:
>
> 1. It is dedicated to Intel SoCs, which provide special set of
> registers (i.e. MSR_PKG_Cx_RESIDENCY [*]), which forces a processor to
> enter certain C state for a given duration. Idle duration is calculated
> by per CPU set of high priority kthreads (which also program [*]
> registers).
>
Those registers are used for compensation (ie user asked a given idle
ratio but HW stats show a mismatch) and they are not "programmed"
they are just read. That code is Intel specific but it can be easily ported
to ARM, I did that and most of the code is common with zero dependency
on the architecture.
> 2. ARM SoCs don't have such infrastructure, so we depend on SW here.
Well, it is true that most of the SoCs I am working on do not have
a programming interface to monitor C-state residency, granted, this is
a problem. If those stats can be retrieved somehow (I did that on our TC2
platform) then power clamp can be used on ARM with minor modifications.
Lorenzo
On 10 April 2013 11:38, Lukasz Majewski <[email protected]> wrote:
> Hi Vincent,
>
>> On 10 April 2013 10:44, Lukasz Majewski <[email protected]>
>> wrote:
>> > Hi Vincent,
>> >
>> >>
>> >>
>> >> On Tuesday, 9 April 2013, Lukasz Majewski <[email protected]>
>> >> wrote:
>> >> > Hi Viresh and Vincent,
>> >> >
>> >> >> On 9 April 2013 16:07, Lukasz Majewski <[email protected]>
>> >> >> wrote:
>> >> >> >> On Mon, Apr 1, 2013 at 1:54 PM, Jonghwa Lee
>> >> >> > Our approach is a bit different than cpufreq_ondemand one.
>> >> >> > Ondemand takes the per CPU idle time, then on that basis
>> >> >> > calculates per cpu load. The next step is to choose the
>> >> >> > highest load and then use this value to properly scale
>> >> >> > frequency.
>> >> >> >
>> >> >> > On the other hand LAB tries to model different behavior:
>> >> >> >
>> >> >> > As a first step we applied Vincent Guittot's "pack small
>> >> >> > tasks" [*] patch to improve "race to idle" behavior:
>> >> >> > http://article.gmane.org/gmane.linux.kernel/1371435/match=sched+pack+small+tasks
>> >> >>
>> >> >> Luckily he is part of my team :)
>> >> >>
>> >> >> http://www.linaro.org/linux-on-arm/meet-the-team/power-management
>> >> >>
>> >> >> BTW, he is using ondemand governor for all his work.
>> >> >>
>> >> >> > Afterwards, we decided to investigate different approach for
>> >> >> > power governing:
>> >> >> >
>> >> >> > Use the number of sleeping CPUs (not the maximal per-CPU
>> >> >> > load) to change frequency. We thereof depend on [*] to "pack"
>> >> >> > as many tasks to CPU as possible and allow other to sleep.
>> >> >>
>> >> >> He packs only small tasks.
>> >> >
>> >> > What's about packing not only small tasks? I will investigate the
>> >> > possibility to aggressively pack (even with a cost of performance
>> >> > degradation) as many tasks as possible to a single CPU.
>> >>
>> >> Hi Lukasz,
>> >>
>> >> I've got same comment on my current patch and I'm preparing a new
>> >> version that can pack tasks more agressively based on the same
>> >> buddy mecanism. This will be done at the cost of performance of
>> >> course.
>> >
>> > Can you share your development tree?
>>
>> The dev is not finished yet but i will share it as soon as possible
>
> Ok
>
>>
>> >
>> >>
>> >>
>> >> >
>> >> > It seems a good idea for a power consumption reduction.
>> >>
>> >> In fact, it's not always true and depends several inputs like the
>> >> number of tasks that run simultaneously
>> >
>> > In my understanding, we can try to couple (affine) maximal number of
>> > task with a CPU. Performance shall decrease, but we will avoid
>> > costs of tasks migration.
>> >
>> > If I remember correctly, I've asked you about some testbench/test
>> > program for scheduler evaluation. I assume that nothing has changed
>> > and there isn't any "common" set of scheduler tests?
>>
>> There are a bunch of bench that are used to evaluate scheduler like
>> hackbench, pgbench but they generally fills all CPU in order to test
>> max performance. Are you looking for such kind of bench ?
>
> I'd rather see a bit different set of tests - something similar to
> "cyclic" tests for PREEMPT_RT patch.
>
> For sched work it would be welcome to spawn a lot of processes with
> different duration and workload. And on this basis observe if e.g. 2 or
> 3 processors are idle.
sanjay is working on something like that:
https://git.linaro.org/gitweb?p=people/sanjayrawat/cyclicTest.git;a=shortlog;h=refs/heads/master
>
>>
>> >
>> >>
>> >> >
>> >> >> And if there are many small tasks we are
>> >> >> packing, then load must be high and so ondemand gov will
>> >> >> increase freq.
>> >> >
>> >> > This is of course true for "packing" all tasks to a single CPU.
>> >> > If we stay at the power consumption envelope, we can even
>> >> > overclock the frequency.
>> >> >
>> >> > But what if other - lets say 3 CPUs - are under heavy workload?
>> >> > Ondemand will switch frequency to maximum, and as Jonghwa pointed
>> >> > out this can cause dangerous temperature increase.
>> >>
>> >> IIUC, your main concern is to stay in a power consumption budget to
>> >> not over heat and have to face the side effect of high temperature
>> >> like a decrease of power efficiency. So your governor modifies the
>> >> max frequency based on the number of running/idle CPU
>> > Yes, this is correct.
>> >
>> >> to have an
>> >> almost stable power consumtpion ?
>> >
>> > From our observation it seems, that for 3 or 4 running CPUs under
>> > heavy load we see much more power consumption reduction.
>>
>> That's logic because you will reduce the voltage
>>
>> >
>> > To put it in another way - ondemand would increase frequency to max
>> > for all 4 CPUs. On the other hand, if user experience drops to the
>> > acceptable level we can reduce power consumption.
>> >
>> > Reducing frequency and CPU voltage (by DVS) causes as a side effect,
>> > that temperature stays at acceptable level.
>> >
>> >>
>> >> Have you also looked at the power clamp driver that have similar
>> >> target ?
>> >
>> > I might be wrong here, but in my opinion the power clamp driver is
>> > a bit different:
>>
>> yes, it periodically forces the cluster in a low power state
>>
>> >
>> > 1. It is dedicated to Intel SoCs, which provide special set of
>> > registers (i.e. MSR_PKG_Cx_RESIDENCY [*]), which forces a processor
>> > to enter certain C state for a given duration. Idle duration is
>> > calculated by per CPU set of high priority kthreads (which also
>> > program [*] registers).
>>
>> IIRC, a trial on ARM platform have been done by lorenzo and daniel.
>> Lorenzo, Daniel, have you more information ?
>
> More information would be welcome :-)
>
>>
>> >
>> > 2. ARM SoCs don't have such infrastructure, so we depend on SW here.
>> > Scheduler has to remove tasks from a particular CPU and "execute" on
>> > it the idle_task.
>> > Moreover at Exynos4 thermal control loop depends on SW, since we can
>> > only read SoC temperature via TMU (Thermal Management Unit) block.
>>
>> The idle duration is quite small and should not perturb normal
>> behavior
>
> What do you mean by "small idle duration"? You think about exact time
> needed to enter idle state (ARM's WFI) or the time in which CPU is idle.
The time in which CPUs are idle
Vincent
>
>>
>> Vincent
>> >
>> >
>> > Correct me again, but it seems to me that on ARM we can use CPU
>> > hotplug (which as Tomas Glexner stated recently is going to be
>> > "refactored" :-) ) or "ask" scheduler to use smallest possible
>> > number of CPUs and enter C state for idling CPUs.
>> >
>> >
>> >
>> >>
>> >>
>> >> Vincent
>> >>
>> >> >
>> >> >>
>> >> >> > Contrary, when all cores are heavily loaded, we decided to
>> >> >> > reduce frequency by around 30%. With this approach user
>> >> >> > experience recution is still acceptable (with much less power
>> >> >> > consumption).
>> >> >>
>> >> >> Don't know.. running many cpus at lower freq for long duration
>> >> >> will probably take more power than running them at high freq
>> >> >> for short duration and making system idle again.
>> >> >>
>> >> >> > We have posted this "RFC" patch mainly for discussion, and I
>> >> >> > think it fits its purpose :-).
>> >> >>
>> >> >> Yes, no issues with your RFC idea.. its perfect..
>> >> >>
>> >> >> @Vincent: Can you please follow this thread a bit and tell us
>> >> >> what your views are?
>> >> >>
>> >> >> --
>> >> >> viresh
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best regards,
>> >> >
>> >> > Lukasz Majewski
>> >> >
>> >> > Samsung R&D Poland (SRPOL) | Linux Platform Group
>> >> >
>> >
>> >
>> > --
>> > Best regards,
>> >
>> > Lukasz Majewski
>> >
>> > Samsung R&D Poland (SRPOL) | Linux Platform Group
>
>
>
> --
> Best regards,
>
> Lukasz Majewski
>
> Samsung R&D Poland (SRPOL) | Linux Platform Group