The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.
cpuidle menu governor has a method to predict the repeat pattern if there are 8
C-states residency which are continuous and the same or very close, so it will
predict the next C-states residency will keep same residency time.
We encountered a real case that turbostat utility (tools/power/x86/turbostat)
at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
governor will predict it is repeat mode and there is another IPI wake up idle
CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
idle. However, in the turbostat, following 10 registers reading is sleep 5
seconds by default, so the idle CPU will keep at C1 for a long time though it is
idle until break event occurs.
The "old turbostat" is specific case and it is already fix by skip to read CPU
MSRs. But we do not guarantee that other application will not do it like this.
So the proper ways is to enhance the logic of the men governor prediction for
next C-states.
This patchset adds a timer when menu governor choose a non-deepest C-state in
order to wake up quickly from shallow C-state to avoid staying too long at
shallow C-state for prediction failure. If the timer is not triggered and CPU
is waken up from C-state, the timer will be cancelled initiatively to avoid the
adding timer bring affect to system. If the timer is time out, CPU will quickly
be waken up from shallow C-state and re-evaluates deeper C-states possibility.
After plenty of testing and tuning, the patchset get about 1% power efficiency
ehancement in SpecPower2008 on Romley-EP. Especailly, when workload is not so
high < 70%, it can notice 1~3 watts power saving; while workload is high > 80%,
It will cost more power consumption. Another benchmarks non-CPU intensive, like
fio, apache and aio-stress will also get power saving while the performance does
not drop.
While I try to fix the issue, I got a lot of help and suggestion from Arjan,
Thanks a lot Arjan!
Thanks
-Youquan
The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.
cpuidle menu governor has a method to predict the repeat pattern if there are 8
C-states residency which are continuous and the same or very close, so it will
predict the next C-states residency will keep same residency time.
We encountered a real case that turbostat utility (tools/power/x86/turbostat)
at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
governor will predict it is repeat mode and there is another IPI wake up idle
CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
idle. However, in the turbostat, following 10 registers reading is sleep 5
seconds by default, so the idle CPU will keep at C1 for a long time though it is
idle until break event occurs.
In the patch, a timer is added when menu governor detects a repeat mode and
choose a shallow C-state.The timer will be time out in (2 * predicted_time + 60)
micro-seconds. When repeat mode happens as expected, the timer is not triggered
and CPU waken up from C-states and it will cancel the timer initiatively.
When repeat mode does not happen, the timer will be time out and menu governor
will quickly in (2 * predicted_time + 60) us notice that the repeat mode
prediction fails and re-evaluates deeper C-states possibility.
Signed-off-by: Youquan Song <[email protected]>
---
drivers/cpuidle/governors/menu.c | 68 +++++++++++++++++++++++++++++++++++--
include/linux/tick.h | 6 +++
kernel/time/tick-sched.c | 2 +
3 files changed, 72 insertions(+), 4 deletions(-)
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 0633575..8c23fbd 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -28,6 +28,11 @@
#define MAX_INTERESTING 50000
#define STDDEV_THRESH 400
+/* 60 * 60 > STDDEV_THRESH * INTERVALS = 400 * 8 */
+#define MAX_DEVIATION 60
+
+static DEFINE_PER_CPU(struct hrtimer, menu_hrtimer);
+static DEFINE_PER_CPU(int, hrtimer_started);
/*
* Concepts and ideas behind the menu governor
@@ -191,17 +196,42 @@ static u64 div_round64(u64 dividend, u32 divisor)
return div_u64(dividend + (divisor / 2), divisor);
}
+/* Cancel the hrtimer if it is not triggered yet */
+void menu_hrtimer_cancel(void)
+{
+ int cpu = smp_processor_id();
+ struct hrtimer *hrtmr = &per_cpu(menu_hrtimer, cpu);
+
+ /* The timer is still not time out*/
+ if (per_cpu(hrtimer_started, cpu)) {
+ hrtimer_cancel(hrtmr);
+ per_cpu(hrtimer_started, cpu) = 0;
+ }
+}
+EXPORT_SYMBOL_GPL(menu_hrtimer_cancel);
+
+/* Call back for hrtimer is triggered */
+static enum hrtimer_restart menu_hrtimer_notify(struct hrtimer *hrtimer)
+{
+ int cpu = smp_processor_id();
+
+ per_cpu(hrtimer_started, cpu) = 0;
+
+ return HRTIMER_NORESTART;
+}
+
/*
* Try detecting repeating patterns by keeping track of the last 8
* intervals, and checking if the standard deviation of that set
* of points is below a threshold. If it is... then use the
* average of these 8 points as the estimated value.
*/
-static void detect_repeating_patterns(struct menu_device *data)
+static int detect_repeating_patterns(struct menu_device *data)
{
int i;
uint64_t avg = 0;
uint64_t stddev = 0; /* contains the square of the std deviation */
+ int ret = 0;
/* first calculate average and standard deviation of the past */
for (i = 0; i < INTERVALS; i++)
@@ -210,7 +240,7 @@ static void detect_repeating_patterns(struct menu_device *data)
/* if the avg is beyond the known next tick, it's worthless */
if (avg > data->expected_us)
- return;
+ return 0;
for (i = 0; i < INTERVALS; i++)
stddev += (data->intervals[i] - avg) *
@@ -223,8 +253,12 @@ static void detect_repeating_patterns(struct menu_device *data)
* repeating pattern and predict we keep doing this.
*/
- if (avg && stddev < STDDEV_THRESH)
+ if (avg && stddev < STDDEV_THRESH) {
data->predicted_us = avg;
+ ret = 1;
+ }
+
+ return ret;
}
/**
@@ -240,6 +274,9 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
int i;
int multiplier;
struct timespec t;
+ int repeat = 0;
+ int cpu = smp_processor_id();
+ struct hrtimer *hrtmr = &per_cpu(menu_hrtimer, cpu);
if (data->needs_update) {
menu_update(drv, dev);
@@ -274,7 +311,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
data->predicted_us = div_round64(data->expected_us * data->correction_factor[data->bucket],
RESOLUTION * DECAY);
- detect_repeating_patterns(data);
+ repeat = detect_repeating_patterns(data);
/*
* We want to default to C1 (hlt), not to busy polling
@@ -307,6 +344,26 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
}
}
+ if (data->last_state_idx < drv->state_count - 1) {
+
+ /* Repeat mode detected */
+ if (repeat) {
+ unsigned int repeat_us = 0;
+ /*
+ * Set enough timer to recognize the repeat mode broken.
+ * If the timer is time out, the repeat mode prediction
+ * fails,then re-evaluate deeper C-states possibility.
+ * If the timer is not triggered, the timer will be
+ * cancelled when CPU waken up.
+ */
+ repeat_us = 2 * data->predicted_us + MAX_DEVIATION;
+ hrtimer_start(hrtmr, ns_to_ktime(1000 * repeat_us),
+ HRTIMER_MODE_REL_PINNED);
+ /* menu hrtimer is started */
+ per_cpu(hrtimer_started, cpu) = 1;
+ }
+ }
+
return data->last_state_idx;
}
@@ -397,6 +454,9 @@ static int menu_enable_device(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
{
struct menu_device *data = &per_cpu(menu_devices, dev->cpu);
+ struct hrtimer *t = &per_cpu(menu_hrtimer, dev->cpu);
+ hrtimer_init(t, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ t->function = menu_hrtimer_notify;
memset(data, 0, sizeof(struct menu_device));
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ab8be90..4cea029 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -142,4 +142,10 @@ static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
# endif /* !NO_HZ */
+# ifdef CONFIG_CPU_IDLE_GOV_MENU
+extern void menu_hrtimer_cancel(void);
+# else
+static inline void menu_hrtimer_cancel(void) { return -1; }
+# endif /* CONFIG_CPU_IDLE_GOV_MENU */
+
#endif
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6a3a5b9..812c1d6 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -499,6 +499,8 @@ void tick_nohz_irq_exit(void)
if (!ts->inidle)
return;
+ /* Cancel the timer because CPU already waken up from the C-states*/
+ menu_hrtimer_cancel();
tick_nohz_stop_sched_tick(ts);
}
@@ -562,6 +564,8 @@ void tick_nohz_idle_exit(void)
ts->inidle = 0;
+ /* Cancel the timer because CPU already waken up from the C-states*/
+ menu_hrtimer_cancel();
if (ts->idle_active || ts->tick_stopped)
now = ktime_get();
When cpuidle governor choose a C-state to enter for idle CPU, but it notice that
there is tasks request to be executed. So the idle CPU will not really enter
the target C-state and go to run task.
In this situation, it will use the residency of previous really entered target
C-states. Obviously, it is not reasonable.
So, this patch fix it by set the target C-state residency to 0.
Signed-off-by: Youquan Song <[email protected]>
---
drivers/cpuidle/cpuidle.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 2f0083a..7992417 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -136,6 +136,10 @@ int cpuidle_idle_call(void)
/* ask the governor for the next state */
next_state = cpuidle_curr_governor->select(drv, dev);
if (need_resched()) {
+ dev->last_residency = 0;
+ /* give the governor an opportunity to reflect on the outcome */
+ if (cpuidle_curr_governor->reflect)
+ cpuidle_curr_governor->reflect(dev, next_state);
local_irq_enable();
return 0;
}
The patch extends the patch to enhance the prediction for repeat mode by add a
timer when menu governor choose a shallow C-state.
The timer is set to time out in 50 milli-seconds by default. It is special twist
that there are no power saving gains even sleep longer than it.
When C-state is waken up prior to the adding timer, the timer will be cancelled
initiatively. When the timer is triggered and menu governor will quickly notice
prediction failure and re-evaluates deeper C-states possibility.
Signed-off-by: Youquan Song <[email protected]>
---
drivers/cpuidle/governors/menu.c | 48 ++++++++++++++++++++++++++------------
1 files changed, 33 insertions(+), 15 deletions(-)
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 8c23fbd..9f92dd4 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -113,6 +113,13 @@ static DEFINE_PER_CPU(int, hrtimer_started);
* represented in the system load average.
*
*/
+
+/*
+ * Default set to 50 milliseconds based on special twist mentioned above that
+ * there are no power gains sleep longer than it.
+ */
+static unsigned int perfect_cstate_ms __read_mostly = 50;
+module_param(perfect_cstate_ms, uint, 0000);
struct menu_device {
int last_state_idx;
@@ -343,26 +350,37 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
data->exit_us = s->exit_latency;
}
}
-
+
+ /* not deepest C-state chosen */
if (data->last_state_idx < drv->state_count - 1) {
+ unsigned int repeat_us = 0;
+ unsigned int perfect_us = 0;
+
+ /*
+ * Set enough timer to recognize the repeat mode broken.
+ * If the timer is time out, the repeat mode prediction
+ * fails,then re-evaluate deeper C-states possibility.
+ * If the timer is not triggered, the timer will be
+ * cancelled when CPU waken up.
+ */
+ repeat_us =
+ (repeat ? (2 * data->predicted_us + MAX_DEVIATION) : 0);
+ perfect_us = perfect_cstate_ms * 1000;
/* Repeat mode detected */
- if (repeat) {
- unsigned int repeat_us = 0;
- /*
- * Set enough timer to recognize the repeat mode broken.
- * If the timer is time out, the repeat mode prediction
- * fails,then re-evaluate deeper C-states possibility.
- * If the timer is not triggered, the timer will be
- * cancelled when CPU waken up.
- */
- repeat_us = 2 * data->predicted_us + MAX_DEVIATION;
- hrtimer_start(hrtmr, ns_to_ktime(1000 * repeat_us),
- HRTIMER_MODE_REL_PINNED);
+ if (repeat && (repeat_us < perfect_us)) {
+ hrtimer_start(hrtmr, ns_to_ktime(1000 * repeat_us),
+ HRTIMER_MODE_REL_PINNED);
+ /* menu hrtimer is started */
+ per_cpu(hrtimer_started, cpu) = 1;
+ } else if (perfect_us < data->expected_us) {
+ /* expected time is larger than adding timer time */
+ hrtimer_start(hrtmr, ns_to_ktime(1000 * perfect_us),
+ HRTIMER_MODE_REL_PINNED);
/* menu hrtimer is started */
per_cpu(hrtimer_started, cpu) = 1;
- }
- }
+ }
+ }
return data->last_state_idx;
}
--
1.6.4.2