2012-10-16 13:01:28

by Youquan Song

[permalink] [raw]
Subject: [PATCH 0/5] x86,idle: Enhance menu governor C-state prediction



The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.

cpuidle menu governor has a method to predict the repeat pattern if there are 8
C-states residency which are continuous and the same or very close, so it will
predict the next C-states residency will keep same residency time.

This patchset adds a timer when menu governor choose a non-deepest C-state in
order to wake up quickly from shallow C-state to avoid staying too long at
shallow C-state for prediction failure. The timer is set to a time out value
that is greater than predicted time and if the timer with the value is triggered
, we can confidently conclude prediction is failure. When prediction
succeeds, CPU is waken up from C-states in predicted time and the timer is not
triggered and will be cancelled right after CPU waken up. When prediction fails,
the timer is triggered to wake up CPU from shallow C-states, so menu governor
will quickly notice that prediction fails and then re-evaluates deeper C-states
possibility. This patchset can improves cpuidle prediction process for both
repeat mode and general mode.

The patchset integrates one patch from Rik van Riel <[email protected]>, which try
to find a typical interval along with cut the upside outliers depends on
historical sleep intervals. The patch tends to choose a shallow C-state to
achieve better performance and ehancement of prediction failure will advise it
if the deepest C-state should be chosen.

Testing result:

The whole patchset achieve good result after bunch of testing/tuning.
Testing on two sockets Sandybridge server, SPECPower2008 get 2%~5% increase
ssj_ops/watt; Running benchmark in phoronix-test-suite: compress-7zip,
build-linux-kernel, apache, fio etc, it also proves to increase the
performance/power; What's more, it not only boosts the performance but also
saves power.

There are also 2 cases will clear show this patchset benefit.

One case is turbostat utility (tools/power/x86/turbostat) at kernel 3.3 or early
. turbostat utility will read 10 registers one by one at Sandybridge, so it will
generate 10 IPIs to wake up idle CPUs. So cpuidle menu governor will predict it
is repeat mode and there is another IPI wake up idle CPU soon, so it keeps idle
CPU stay at C1 state even though CPU is totally idle. However, in the turbostat
, following 10 registers reading is sleep 5 seconds by default, so the idle CPU
will keep at C1 for a long time though it is idle until break event occurs.
In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
deep C-state stays at >99.98%.

Below is another case which will clearly show the patch much benefit:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/time.h>
#include <time.h>
#include <pthread.h>

volatile int * shutdown;
volatile long * count;
int delay = 20;
int loop = 8;

void usage(void)
{
fprintf(stderr,
"Usage: idle_predict [options]\n"
" --help -h Print this help\n"
" --thread -n Thread number\n"
" --loop -l Loop times in shallow Cstate\n"
" --delay -t Sleep time (uS)in shallow Cstate\n");
}

void *simple_loop() {
int idle_num = 1;
while (!(*shutdown)) {
*count = *count + 1;

if (idle_num % loop)
usleep(delay);
else {
/* sleep 1 second */
usleep(1000000);
idle_num = 0;
}
idle_num++;
}

}

static void sighand(int sig)
{
*shutdown = 1;
}

int main(int argc, char *argv[])
{
sigset_t sigset;
int signum = SIGALRM;
int i, c, er = 0, thread_num = 8;
pthread_t pt[1024];

static char optstr[] = "n:l:t:h:";

while ((c = getopt(argc, argv, optstr)) != EOF)
switch (c) {
case 'n':
thread_num = atoi(optarg);
break;
case 'l':
loop = atoi(optarg);
break;
case 't':
delay = atoi(optarg);
break;
case 'h':
default:
usage();
exit(1);
}

printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
count = malloc(sizeof(long));
shutdown = malloc(sizeof(int));
*count = 0;
*shutdown = 0;

sigemptyset(&sigset);
sigaddset(&sigset, signum);
sigprocmask (SIG_BLOCK, &sigset, NULL);
signal(SIGINT, sighand);
signal(SIGTERM, sighand);

for(i = 0; i < thread_num ; i++)
pthread_create(&pt[i], NULL, simple_loop, NULL);

for (i = 0; i < thread_num; i++)
pthread_join(pt[i], NULL);

exit(0);
}

Get powertop v2 from git://github.com/fenrus75/powertop, build powertop.
After build the above test application, then run it.
Test plaform can be Intel Sandybridge or other recent platforms.
#./idle_predict -l 10 &
#./powertop

We will find that deep C-state will dangle between 40%~100% and much time spent
on C1 state. It is because menu governor wrongly predict that repeat mode
is kept, so it will choose the C1 shallow C-state even though it has chance to
sleep 1 second in deep C-state.

While after patched the kernel, we find that deep C-state will keep >99.6%.

Thanks for help from Arjan, Len Brown and Rik!

Thanks
-Youquan


2012-10-16 13:01:33

by Youquan Song

[permalink] [raw]
Subject: [PATCH 1/5] x86,idle: Quickly notice prediction failure for repeat mode

The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.

cpuidle menu governor has a method to predict the repeat pattern if there are 8
C-states residency which are continuous and the same or very close, so it will
predict the next C-states residency will keep same residency time.

There is a real case that turbostat utility (tools/power/x86/turbostat)
at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
governor will predict it is repeat mode and there is another IPI wake up idle
CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
idle. However, in the turbostat, following 10 registers reading is sleep 5
seconds by default, so the idle CPU will keep at C1 for a long time though it is
idle until break event occurs.
In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
deep C-state stays at >99.98%.

In the patch, a timer is added when menu governor detects a repeat mode and
choose a shallow C-state. The timer is set to a time out value that greater
than predicted time, and we conclude repeat mode prediction failure if timer is
triggered. When repeat mode happens as expected, the timer is not triggered
and CPU waken up from C-states and it will cancel the timer initiatively.
When repeat mode does not happen, the timer will be time out and menu governor
will quickly notice that the repeat mode prediction fails and then re-evaluates
deeper C-states possibility.

Below is another case which will clearly show the patch much benefit:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/time.h>
#include <time.h>
#include <pthread.h>

volatile int * shutdown;
volatile long * count;
int delay = 20;
int loop = 8;

void usage(void)
{
fprintf(stderr,
"Usage: idle_predict [options]\n"
" --help -h Print this help\n"
" --thread -n Thread number\n"
" --loop -l Loop times in shallow Cstate\n"
" --delay -t Sleep time (uS)in shallow Cstate\n");
}

void *simple_loop() {
int idle_num = 1;
while (!(*shutdown)) {
*count = *count + 1;

if (idle_num % loop)
usleep(delay);
else {
/* sleep 1 second */
usleep(1000000);
idle_num = 0;
}
idle_num++;
}

}

static void sighand(int sig)
{
*shutdown = 1;
}

int main(int argc, char *argv[])
{
sigset_t sigset;
int signum = SIGALRM;
int i, c, er = 0, thread_num = 8;
pthread_t pt[1024];

static char optstr[] = "n:l:t:h:";

while ((c = getopt(argc, argv, optstr)) != EOF)
switch (c) {
case 'n':
thread_num = atoi(optarg);
break;
case 'l':
loop = atoi(optarg);
break;
case 't':
delay = atoi(optarg);
break;
case 'h':
default:
usage();
exit(1);
}

printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
count = malloc(sizeof(long));
shutdown = malloc(sizeof(int));
*count = 0;
*shutdown = 0;

sigemptyset(&sigset);
sigaddset(&sigset, signum);
sigprocmask (SIG_BLOCK, &sigset, NULL);
signal(SIGINT, sighand);
signal(SIGTERM, sighand);

for(i = 0; i < thread_num ; i++)
pthread_create(&pt[i], NULL, simple_loop, NULL);

for (i = 0; i < thread_num; i++)
pthread_join(pt[i], NULL);

exit(0);
}

Get powertop V2 from git://github.com/fenrus75/powertop, build powertop.
After build the above test application, then run it.
Test plaform can be Intel Sandybridge or other recent platforms.
#./idle_predict -l 10 &
#./powertop

We will find that deep C-state will dangle between 40%~100% and much time spent
on C1 state. It is because menu governor wrongly predict that repeat mode
is kept, so it will choose the C1 shallow C-state even though it has chance to
sleep 1 second in deep C-state.

While after patched the kernel, we find that deep C-state will keep >99.6%.

Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
drivers/cpuidle/governors/menu.c | 73 +++++++++++++++++++++++++++++++++++---
include/linux/tick.h | 6 +++
kernel/time/tick-sched.c | 4 ++
3 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 5b1f2c3..beeab6a 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -28,6 +28,11 @@
#define MAX_INTERESTING 50000
#define STDDEV_THRESH 400

+/* 60 * 60 > STDDEV_THRESH * INTERVALS = 400 * 8 */
+#define MAX_DEVIATION 60
+
+static DEFINE_PER_CPU(struct hrtimer, menu_hrtimer);
+static DEFINE_PER_CPU(int, hrtimer_started);

/*
* Concepts and ideas behind the menu governor
@@ -191,17 +196,42 @@ static u64 div_round64(u64 dividend, u32 divisor)
return div_u64(dividend + (divisor / 2), divisor);
}

+/* Cancel the hrtimer if it is not triggered yet */
+void menu_hrtimer_cancel(void)
+{
+ int cpu = smp_processor_id();
+ struct hrtimer *hrtmr = &per_cpu(menu_hrtimer, cpu);
+
+ /* The timer is still not time out*/
+ if (per_cpu(hrtimer_started, cpu)) {
+ hrtimer_cancel(hrtmr);
+ per_cpu(hrtimer_started, cpu) = 0;
+ }
+}
+EXPORT_SYMBOL_GPL(menu_hrtimer_cancel);
+
+/* Call back for hrtimer is triggered */
+static enum hrtimer_restart menu_hrtimer_notify(struct hrtimer *hrtimer)
+{
+ int cpu = smp_processor_id();
+
+ per_cpu(hrtimer_started, cpu) = 0;
+
+ return HRTIMER_NORESTART;
+}
+
/*
* Try detecting repeating patterns by keeping track of the last 8
* intervals, and checking if the standard deviation of that set
* of points is below a threshold. If it is... then use the
* average of these 8 points as the estimated value.
*/
-static void detect_repeating_patterns(struct menu_device *data)
+static int detect_repeating_patterns(struct menu_device *data)
{
int i;
uint64_t avg = 0;
uint64_t stddev = 0; /* contains the square of the std deviation */
+ int ret = 0;

/* first calculate average and standard deviation of the past */
for (i = 0; i < INTERVALS; i++)
@@ -210,7 +240,7 @@ static void detect_repeating_patterns(struct menu_device *data)

/* if the avg is beyond the known next tick, it's worthless */
if (avg > data->expected_us)
- return;
+ return 0;

for (i = 0; i < INTERVALS; i++)
stddev += (data->intervals[i] - avg) *
@@ -223,8 +253,12 @@ static void detect_repeating_patterns(struct menu_device *data)
* repeating pattern and predict we keep doing this.
*/

- if (avg && stddev < STDDEV_THRESH)
+ if (avg && stddev < STDDEV_THRESH) {
data->predicted_us = avg;
+ ret = 1;
+ }
+
+ return ret;
}

/**
@@ -240,6 +274,9 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
int i;
int multiplier;
struct timespec t;
+ int repeat = 0, low_predicted = 0;
+ int cpu = smp_processor_id();
+ struct hrtimer *hrtmr = &per_cpu(menu_hrtimer, cpu);

if (data->needs_update) {
menu_update(drv, dev);
@@ -274,7 +311,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
data->predicted_us = div_round64(data->expected_us * data->correction_factor[data->bucket],
RESOLUTION * DECAY);

- detect_repeating_patterns(data);
+ repeat = detect_repeating_patterns(data);

/*
* We want to default to C1 (hlt), not to busy polling
@@ -295,8 +332,10 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)

if (s->disabled || su->disable)
continue;
- if (s->target_residency > data->predicted_us)
+ if (s->target_residency > data->predicted_us) {
+ low_predicted = 1;
continue;
+ }
if (s->exit_latency > latency_req)
continue;
if (s->exit_latency * multiplier > data->predicted_us)
@@ -309,6 +348,27 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
}
}

+ /* not deepest C-state chosen for low predicted residency */
+ if (low_predicted) {
+ unsigned int timer_us = 0;
+
+ /*
+ * Set a timer to detect whether this sleep is much
+ * longer than repeat mode predicted. If the timer
+ * triggers, the code will evaluate whether to put
+ * the CPU into a deeper C-state.
+ * The timer is cancelled on CPU wakeup.
+ */
+ timer_us = 2 * (data->predicted_us + MAX_DEVIATION);
+
+ if (repeat && (4 * timer_us < data->expected_us)) {
+ hrtimer_start(hrtmr, ns_to_ktime(1000 * timer_us),
+ HRTIMER_MODE_REL_PINNED);
+ /* menu hrtimer is started */
+ per_cpu(hrtimer_started, cpu) = 1;
+ }
+ }
+
return data->last_state_idx;
}

@@ -399,6 +459,9 @@ static int menu_enable_device(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
{
struct menu_device *data = &per_cpu(menu_devices, dev->cpu);
+ struct hrtimer *t = &per_cpu(menu_hrtimer, dev->cpu);
+ hrtimer_init(t, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ t->function = menu_hrtimer_notify;

memset(data, 0, sizeof(struct menu_device));

diff --git a/include/linux/tick.h b/include/linux/tick.h
index f37fceb..8867424 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -142,4 +142,10 @@ static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
# endif /* !NO_HZ */

+# ifdef CONFIG_CPU_IDLE_GOV_MENU
+extern void menu_hrtimer_cancel(void);
+# else
+static inline void menu_hrtimer_cancel(void) { return -1; }
+# endif /* CONFIG_CPU_IDLE_GOV_MENU */
+
#endif
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3a9e5d5..dff6d13 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -525,6 +525,8 @@ void tick_nohz_irq_exit(void)
if (!ts->inidle)
return;

+ /* Cancel the timer because CPU already waken up from the C-states*/
+ menu_hrtimer_cancel();
__tick_nohz_idle_enter(ts);
}

@@ -621,6 +623,8 @@ void tick_nohz_idle_exit(void)

ts->inidle = 0;

+ /* Cancel the timer because CPU already waken up from the C-states*/
+ menu_hrtimer_cancel();
if (ts->idle_active || ts->tick_stopped)
now = ktime_get();

--
1.7.7.4

2012-10-16 13:01:41

by Youquan Song

[permalink] [raw]
Subject: [PATCH 3/5] x86,idle: Reset correction factor

In general case, the expected residency is much larger than deepest C-state
target residency, but prediction logic still predicts the small predicted
residency, so the prediction history is totally broken. In this situation,
reset the correction factor is the only choice.

Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
drivers/cpuidle/governors/menu.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index b34bf11..7dbac97 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -221,6 +221,10 @@ EXPORT_SYMBOL_GPL(menu_hrtimer_cancel);
static enum hrtimer_restart menu_hrtimer_notify(struct hrtimer *hrtimer)
{
int cpu = smp_processor_id();
+ struct menu_device *data = &per_cpu(menu_devices, cpu);
+
+ if (per_cpu(hrtimer_started, cpu) == 2)
+ data->correction_factor[data->bucket] = RESOLUTION * DECAY;

per_cpu(hrtimer_started, cpu) = 0;

@@ -386,7 +390,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
hrtimer_start(hrtmr, ns_to_ktime(1000 * timer_us),
HRTIMER_MODE_REL_PINNED);
/* menu hrtimer is started */
- per_cpu(hrtimer_started, cpu) = 1;
+ per_cpu(hrtimer_started, cpu) = 2;
}

}
--
1.7.7.4

2012-10-16 13:01:40

by Youquan Song

[permalink] [raw]
Subject: [PATCH 4/5] x86,idle: Set residency to 0 if target Cstate not enter

When cpuidle governor choose a C-state to enter for idle CPU, but it notice that
there is tasks request to be executed. So the idle CPU will not really enter
the target C-state and go to run task.

In this situation, it will use the residency of previous really entered target
C-states. Obviously, it is not reasonable.

So, this patch fix it by set the target C-state residency to 0.

Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
drivers/cpuidle/cpuidle.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index e28f6ea..01dca54 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -144,6 +144,10 @@ int cpuidle_idle_call(void)
/* ask the governor for the next state */
next_state = cpuidle_curr_governor->select(drv, dev);
if (need_resched()) {
+ dev->last_residency = 0;
+ /* give the governor an opportunity to reflect on the outcome */
+ if (cpuidle_curr_governor->reflect)
+ cpuidle_curr_governor->reflect(dev, next_state);
local_irq_enable();
return 0;
}
--
1.7.7.4

2012-10-16 13:02:01

by Youquan Song

[permalink] [raw]
Subject: [PATCH 5/5] x86,idle: Get typical recent sleep interval

The function detect_repeating_patterns was not very useful for
workloads with alternating long and short pauses, for example
virtual machines handling network requests for each other (say
a web and database server).

Instead, try to find a recent sleep interval that is somewhere
between the median and the mode sleep time, by discarding outliers
to the up side and recalculating the average and standard deviation
until that is no longer required.

This should do something sane with a sleep interval series like:

200 180 210 10000 30 1000 170 200

The current code would simply discard such a series, while the
new code will guess a typical sleep interval just shy of 200.

The original patch come from Rik van Riel <[email protected]>.

Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
drivers/cpuidle/governors/menu.c | 69 +++++++++++++++++++++++++------------
1 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 7dbac97..dbb9e1c 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -237,36 +237,59 @@ static enum hrtimer_restart menu_hrtimer_notify(struct hrtimer *hrtimer)
* of points is below a threshold. If it is... then use the
* average of these 8 points as the estimated value.
*/
-static int detect_repeating_patterns(struct menu_device *data)
+static u32 get_typical_interval(struct menu_device *data)
{
- int i;
- uint64_t avg = 0;
- uint64_t stddev = 0; /* contains the square of the std deviation */
- int ret = 0;
-
- /* first calculate average and standard deviation of the past */
- for (i = 0; i < INTERVALS; i++)
- avg += data->intervals[i];
- avg = avg / INTERVALS;
+ int i = 0, divisor = 0;
+ int64_t max = 0, avg = 0, stddev = 0;
+ int64_t thresh = LLONG_MAX; /* Discard outliers above this value. */
+ unsigned int ret = 0;

- /* if the avg is beyond the known next tick, it's worthless */
- if (avg > data->expected_us)
- return 0;
-
- for (i = 0; i < INTERVALS; i++)
- stddev += (data->intervals[i] - avg) *
- (data->intervals[i] - avg);
+again:

- stddev = stddev / INTERVALS;
+ /* first calculate average and standard deviation of the past */
+ max = avg = divisor = stddev = 0;
+ for (i = 0; i < INTERVALS; i++) {
+ int64_t value = data->intervals[i];
+ if (value <= thresh) {
+ avg += value;
+ divisor++;
+ if (value > max)
+ max = value;
+ }
+ }
+ do_div(avg, divisor);

+ for (i = 0; i < INTERVALS; i++) {
+ int64_t value = data->intervals[i];
+ if (value <= thresh) {
+ int64_t diff = value - avg;
+ stddev += diff * diff;
+ }
+ }
+ do_div(stddev, divisor);
+ stddev = int_sqrt(stddev);
/*
- * now.. if stddev is small.. then assume we have a
- * repeating pattern and predict we keep doing this.
+ * If we have outliers to the upside in our distribution, discard
+ * those by setting the threshold to exclude these outliers, then
+ * calculate the average and standard deviation again. Once we get
+ * down to the bottom 3/4 of our samples, stop excluding samples.
+ *
+ * This can deal with workloads that have long pauses interspersed
+ * with sporadic activity with a bunch of short pauses.
+ *
+ * The typical interval is obtained when standard deviation is small
+ * or standard deviation is small compared to the average interval.
*/
-
- if (avg && stddev < STDDEV_THRESH) {
+ if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
+ || stddev <= 20) {
data->predicted_us = avg;
ret = 1;
+ return ret;
+
+ } else if ((divisor * 4) > INTERVALS * 3) {
+ /* Exclude the max interval */
+ thresh = max - 1;
+ goto again;
}

return ret;
@@ -322,7 +345,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
data->predicted_us = div_round64(data->expected_us * data->correction_factor[data->bucket],
RESOLUTION * DECAY);

- repeat = detect_repeating_patterns(data);
+ repeat = get_typical_interval(data);

/*
* We want to default to C1 (hlt), not to busy polling
--
1.7.7.4

2012-10-16 13:02:34

by Youquan Song

[permalink] [raw]
Subject: [PATCH 2/5] x86,idle: Quickly notice prediction failure in general case

The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.

The patch extends to general case that prediction logic get a small predicted
residency, so it choose a shallow C-state though the expected residency is large
. Once the prediction will be fail, the CPU will keep staying at shallow C-state
for a long time. Acutally, the CPU has change enter into deep C-state.
So when the expected residency is long enough but governor choose a shallow
C-state, an timer will be added in order to monitor if the prediction failure.

When C-state is waken up prior to the adding timer, the timer will be cancelled
initiatively. When the timer is triggered and menu governor will quickly notice
prediction failure and re-evaluates deeper C-states possibility.

Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
drivers/cpuidle/governors/menu.c | 22 ++++++++++++++++++++++
1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index beeab6a..b34bf11 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -114,6 +114,13 @@ static DEFINE_PER_CPU(int, hrtimer_started);
*
*/

+/*
+ * The C-state residency is so long that is is worthwhile to exit
+ * from the shallow C-state and re-enter into a deeper C-state.
+ */
+static unsigned int perfect_cstate_ms __read_mostly = 30;
+module_param(perfect_cstate_ms, uint, 0000);
+
struct menu_device {
int last_state_idx;
int needs_update;
@@ -351,6 +358,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
/* not deepest C-state chosen for low predicted residency */
if (low_predicted) {
unsigned int timer_us = 0;
+ unsigned int perfect_us = 0;

/*
* Set a timer to detect whether this sleep is much
@@ -361,12 +369,26 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
*/
timer_us = 2 * (data->predicted_us + MAX_DEVIATION);

+ perfect_us = perfect_cstate_ms * 1000;
+
if (repeat && (4 * timer_us < data->expected_us)) {
hrtimer_start(hrtmr, ns_to_ktime(1000 * timer_us),
HRTIMER_MODE_REL_PINNED);
/* menu hrtimer is started */
per_cpu(hrtimer_started, cpu) = 1;
+ } else if (perfect_us < data->expected_us) {
+ /*
+ * The next timer is long. This could be because
+ * we did not make a useful prediction.
+ * In that case, it makes sense to re-enter
+ * into a deeper C-state after some time.
+ */
+ hrtimer_start(hrtmr, ns_to_ktime(1000 * timer_us),
+ HRTIMER_MODE_REL_PINNED);
+ /* menu hrtimer is started */
+ per_cpu(hrtimer_started, cpu) = 1;
}
+
}

return data->last_state_idx;
--
1.7.7.4

2012-10-18 06:43:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 0/5] x86,idle: Enhance menu governor C-state prediction

Hi,

On Tuesday 16 of October 2012 21:04:35 Youquan Song wrote:
>
> The prediction for future is difficult and when the cpuidle governor prediction
> fails and govenor possibly choose the shallower C-state than it should. How to
> quickly notice and find the failure becomes important for power saving.
>
> cpuidle menu governor has a method to predict the repeat pattern if there are 8
> C-states residency which are continuous and the same or very close, so it will
> predict the next C-states residency will keep same residency time.
>
> This patchset adds a timer when menu governor choose a non-deepest C-state in
> order to wake up quickly from shallow C-state to avoid staying too long at
> shallow C-state for prediction failure. The timer is set to a time out value
> that is greater than predicted time and if the timer with the value is triggered
> , we can confidently conclude prediction is failure. When prediction
> succeeds, CPU is waken up from C-states in predicted time and the timer is not
> triggered and will be cancelled right after CPU waken up. When prediction fails,
> the timer is triggered to wake up CPU from shallow C-states, so menu governor
> will quickly notice that prediction fails and then re-evaluates deeper C-states
> possibility. This patchset can improves cpuidle prediction process for both
> repeat mode and general mode.
>
> The patchset integrates one patch from Rik van Riel <[email protected]>, which try
> to find a typical interval along with cut the upside outliers depends on
> historical sleep intervals. The patch tends to choose a shallow C-state to
> achieve better performance and ehancement of prediction failure will advise it
> if the deepest C-state should be chosen.
>
> Testing result:
>
> The whole patchset achieve good result after bunch of testing/tuning.
> Testing on two sockets Sandybridge server, SPECPower2008 get 2%~5% increase
> ssj_ops/watt; Running benchmark in phoronix-test-suite: compress-7zip,
> build-linux-kernel, apache, fio etc, it also proves to increase the
> performance/power; What's more, it not only boosts the performance but also
> saves power.
>
> There are also 2 cases will clear show this patchset benefit.
>
> One case is turbostat utility (tools/power/x86/turbostat) at kernel 3.3 or early
> . turbostat utility will read 10 registers one by one at Sandybridge, so it will
> generate 10 IPIs to wake up idle CPUs. So cpuidle menu governor will predict it
> is repeat mode and there is another IPI wake up idle CPU soon, so it keeps idle
> CPU stay at C1 state even though CPU is totally idle. However, in the turbostat
> , following 10 registers reading is sleep 5 seconds by default, so the idle CPU
> will keep at C1 for a long time though it is idle until break event occurs.
> In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
> C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
> deep C-state stays at >99.98%.
>
> Below is another case which will clearly show the patch much benefit:
>
> #include <stdlib.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <signal.h>
> #include <sys/time.h>
> #include <time.h>
> #include <pthread.h>
>
> volatile int * shutdown;
> volatile long * count;
> int delay = 20;
> int loop = 8;
>
> void usage(void)
> {
> fprintf(stderr,
> "Usage: idle_predict [options]\n"
> " --help -h Print this help\n"
> " --thread -n Thread number\n"
> " --loop -l Loop times in shallow Cstate\n"
> " --delay -t Sleep time (uS)in shallow Cstate\n");
> }
>
> void *simple_loop() {
> int idle_num = 1;
> while (!(*shutdown)) {
> *count = *count + 1;
>
> if (idle_num % loop)
> usleep(delay);
> else {
> /* sleep 1 second */
> usleep(1000000);
> idle_num = 0;
> }
> idle_num++;
> }
>
> }
>
> static void sighand(int sig)
> {
> *shutdown = 1;
> }
>
> int main(int argc, char *argv[])
> {
> sigset_t sigset;
> int signum = SIGALRM;
> int i, c, er = 0, thread_num = 8;
> pthread_t pt[1024];
>
> static char optstr[] = "n:l:t:h:";
>
> while ((c = getopt(argc, argv, optstr)) != EOF)
> switch (c) {
> case 'n':
> thread_num = atoi(optarg);
> break;
> case 'l':
> loop = atoi(optarg);
> break;
> case 't':
> delay = atoi(optarg);
> break;
> case 'h':
> default:
> usage();
> exit(1);
> }
>
> printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
> count = malloc(sizeof(long));
> shutdown = malloc(sizeof(int));
> *count = 0;
> *shutdown = 0;
>
> sigemptyset(&sigset);
> sigaddset(&sigset, signum);
> sigprocmask (SIG_BLOCK, &sigset, NULL);
> signal(SIGINT, sighand);
> signal(SIGTERM, sighand);
>
> for(i = 0; i < thread_num ; i++)
> pthread_create(&pt[i], NULL, simple_loop, NULL);
>
> for (i = 0; i < thread_num; i++)
> pthread_join(pt[i], NULL);
>
> exit(0);
> }
>
> Get powertop v2 from git://github.com/fenrus75/powertop, build powertop.
> After build the above test application, then run it.
> Test plaform can be Intel Sandybridge or other recent platforms.
> #./idle_predict -l 10 &
> #./powertop
>
> We will find that deep C-state will dangle between 40%~100% and much time spent
> on C1 state. It is because menu governor wrongly predict that repeat mode
> is kept, so it will choose the C1 shallow C-state even though it has chance to
> sleep 1 second in deep C-state.
>
> While after patched the kernel, we find that deep C-state will keep >99.6%.
>
> Thanks for help from Arjan, Len Brown and Rik!

The whole series looks good to me, but I think it would be better to fold
patch [3/5] into [2/5] and use #defined symbols or enums instead of "magic"
numbers 1 and 2 as values for hrtimer_started.

Moreover, patch [4/5] seems to be a bug fix that should go into -stable
regardless of the other patches in the series.

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.