2017-09-30 07:21:32

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality

We found under some latency intensive workloads, short idle periods occurs
very common, then idle entry and exit path starts to dominate, so it's
important to optimize them. To determine the short idle pattern, we need
to figure out how long of the coming idle and the threshold of the short
idle interval.

A cpu idle prediction functionality is introduced in this proposal to catch
the short idle pattern.

Firstly, we check the IRQ timings subsystem, if there is an event
coming soon.
-- https://lwn.net/Articles/691297/

Secondly, we check the idle statistics of scheduler, if it's likely we'll
go into a short idle.
-- https://patchwork.kernel.org/patch/2839221/

Thirdly, we predict the next idle interval by using the prediction
fucntionality in the idle governor if it has.

For the threshold of the short idle interval, we record the timestamps of
the idle entry, and multiply by a tunable parameter at here:
-- /proc/sys/kernel/fast_idle_ratio

We use the output of the idle prediction to skip turning tick off if a
short idle is determined in this proposal. Reprogramming hardware timer
twice(off and on) is expensive for a very short idle. There are some
potential optimizations can be done according to the same indicator.

I observed when system is idle, the idle predictor reports 20/s long idle
and ZERO fast idle on one CPU. And when the workload is running, the idle
predictor reports 72899/s fast idle and ZERO long idle on the same CPU.

Aubrey Li (8):
cpuidle: menu: extract prediction functionality
cpuidle: record the overhead of idle entry
cpuidle: add a new predict interface
tick/nohz: keep tick on for a fast idle
timers: keep sleep length updated as needed
cpuidle: make fast idle threshold tunable
cpuidle: introduce irq timing to make idle prediction
cpuidle: introduce run queue average idle to make idle prediction

drivers/cpuidle/Kconfig | 1 +
drivers/cpuidle/cpuidle.c | 109 +++++++++++++++++++++++++++++++++++++++
drivers/cpuidle/governors/menu.c | 69 ++++++++++++++++---------
include/linux/cpuidle.h | 21 ++++++++
kernel/sched/idle.c | 14 ++++-
kernel/sysctl.c | 12 +++++
kernel/time/tick-sched.c | 7 +++
7 files changed, 209 insertions(+), 24 deletions(-)

--
2.7.4


2017-09-30 07:21:37

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry

Record the overhead of idle entry in micro-second

Signed-off-by: Aubrey Li <[email protected]>
---
drivers/cpuidle/cpuidle.c | 33 +++++++++++++++++++++++++++++++++
include/linux/cpuidle.h | 14 ++++++++++++++
kernel/sched/idle.c | 8 +++++++-
3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 60bb64f..4066308 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -302,6 +302,39 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
cpuidle_curr_governor->reflect(dev, index);
}

+/* cpuidle_entry_start - record idle entry start */
+void cpuidle_entry_start(void)
+{
+ struct cpuidle_device *dev = cpuidle_get_device();
+
+ if (dev)
+ dev->idle_stat.entry_start = local_clock();
+}
+
+/*
+ * cpuidle_entry_end - record idle entry end, and maintain
+ * the entry overhead average in micro-second
+ */
+void cpuidle_entry_end(void)
+{
+ struct cpuidle_device *dev = cpuidle_get_device();
+ u64 overhead;
+ s64 diff;
+
+ if (dev) {
+ dev->idle_stat.entry_end = local_clock();
+ overhead = div_u64(dev->idle_stat.entry_end -
+ dev->idle_stat.entry_start, NSEC_PER_USEC);
+ diff = overhead - dev->idle_stat.overhead;
+ dev->idle_stat.overhead += diff >> 3;
+ /*
+ * limit overhead to 1us
+ */
+ if (dev->idle_stat.overhead == 0)
+ dev->idle_stat.overhead = 1;
+ }
+}
+
/**
* cpuidle_install_idle_handler - installs the cpuidle idle loop handler
*/
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index fc1e5d7..cad9b71 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -72,6 +72,15 @@ struct cpuidle_device_kobj;
struct cpuidle_state_kobj;
struct cpuidle_driver_kobj;

+struct cpuidle_stat {
+ u64 entry_start; /* nanosecond */
+ u64 entry_end; /* nanosecond */
+ u64 overhead; /* nanosecond */
+ unsigned int predicted_us; /* microsecond */
+ bool predicted; /* ever predicted? */
+ bool fast_idle; /* fast idle? */
+};
+
struct cpuidle_device {
unsigned int registered:1;
unsigned int enabled:1;
@@ -89,6 +98,7 @@ struct cpuidle_device {
cpumask_t coupled_cpus;
struct cpuidle_coupled *coupled;
#endif
+ struct cpuidle_stat idle_stat;
};

DECLARE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
@@ -131,6 +141,8 @@ extern bool cpuidle_not_available(struct cpuidle_driver *drv,

extern int cpuidle_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev);
+extern void cpuidle_entry_start(void);
+extern void cpuidle_entry_end(void);
extern int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index);
extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -164,6 +176,8 @@ static inline bool cpuidle_not_available(struct cpuidle_driver *drv,
static inline int cpuidle_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
{return -ENODEV; }
+static inline void cpuidle_entry_start(void) { }
+static inline void cpuidle_entry_end(void) { }
static inline int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index)
{return -ENODEV; }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6c23e30..0951dac 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -210,6 +210,12 @@ static void cpuidle_idle_call(void)
static void do_idle(void)
{
/*
+ * we record idle entry overhead now, so any deferrable items
+ * in idle entry path need to be placed between cpuidle_entry_start()
+ * and cpuidle_entry_end()
+ */
+ cpuidle_entry_start();
+ /*
* If the arch has a polling bit, we maintain an invariant:
*
* Our polling bit is clear if we're not scheduled (i.e. if rq->curr !=
@@ -217,10 +223,10 @@ static void do_idle(void)
* then setting need_resched is guaranteed to cause the CPU to
* reschedule.
*/
-
__current_set_polling();
quiet_vmstat();
tick_nohz_idle_enter();
+ cpuidle_entry_end();

while (!need_resched()) {
check_pgt_cache();
--
2.7.4

2017-09-30 07:21:41

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v2 3/8] cpuidle: add a new predict interface

For the governor has predict functionality, add a new predict
interface in cpuidle framework to call and use it.
---
drivers/cpuidle/cpuidle.c | 34 ++++++++++++++++++++++++++++++++++
drivers/cpuidle/governors/menu.c | 7 +++++++
include/linux/cpuidle.h | 3 +++
kernel/sched/idle.c | 1 +
4 files changed, 45 insertions(+)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 4066308..ef6f7dd 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -336,6 +336,40 @@ void cpuidle_entry_end(void)
}

/**
+ * cpuidle_predict - predict whether the coming idle is a fast idle or not
+ */
+void cpuidle_predict(void)
+{
+ struct cpuidle_device *dev = cpuidle_get_device();
+ unsigned int overhead_threshold;
+
+ if (!dev)
+ return;
+
+ overhead_threshold = dev->idle_stat.overhead;
+
+ if (cpuidle_curr_governor->predict) {
+ dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
+ /*
+ * notify idle governor to avoid reduplicative
+ * prediction computation
+ */
+ dev->idle_stat.predicted = true;
+ if (dev->idle_stat.predicted_us < overhead_threshold) {
+ /*
+ * notify tick subsystem to keep ticking
+ * for the coming idle
+ */
+ dev->idle_stat.fast_idle = true;
+ } else
+ dev->idle_stat.fast_idle = false;
+ } else {
+ dev->idle_stat.predicted = false;
+ dev->idle_stat.fast_idle = false;
+ }
+}
+
+/**
* cpuidle_install_idle_handler - installs the cpuidle idle loop handler
*/
void cpuidle_install_idle_handler(void)
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 6bed197..90b2a10 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -344,6 +344,12 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
if (unlikely(latency_req == 0))
return 0;

+ /*don't predict again if idle framework already did it */
+ if (!dev->idle_stat.predicted)
+ menu_predict();
+ else
+ dev->idle_stat.predicted = false;
+
if (CPUIDLE_DRIVER_STATE_START > 0) {
struct cpuidle_state *s = &drv->states[CPUIDLE_DRIVER_STATE_START];
unsigned int polling_threshold;
@@ -518,6 +524,7 @@ static struct cpuidle_governor menu_governor = {
.enable = menu_enable_device,
.select = menu_select,
.reflect = menu_reflect,
+ .predict = menu_predict,
};

/**
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index cad9b71..9ca0288 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -143,6 +143,7 @@ extern int cpuidle_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev);
extern void cpuidle_entry_start(void);
extern void cpuidle_entry_end(void);
+extern void cpuidle_predict(void);
extern int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index);
extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -178,6 +179,7 @@ static inline int cpuidle_select(struct cpuidle_driver *drv,
{return -ENODEV; }
static inline void cpuidle_entry_start(void) { }
static inline void cpuidle_entry_end(void) { }
+static inline void cpuidle_predict(void) { }
static inline int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index)
{return -ENODEV; }
@@ -255,6 +257,7 @@ struct cpuidle_governor {
int (*select) (struct cpuidle_driver *drv,
struct cpuidle_device *dev);
void (*reflect) (struct cpuidle_device *dev, int index);
+ unsigned int (*predict)(void);
};

#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 0951dac..8704f3c 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -225,6 +225,7 @@ static void do_idle(void)
*/
__current_set_polling();
quiet_vmstat();
+ cpuidle_predict();
tick_nohz_idle_enter();
cpuidle_entry_end();

--
2.7.4

2017-09-30 07:21:45

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v2 5/8] timers: keep sleep length updated as needed

sleep length indicates how long we'll be idle. Currently, it's updated
only when tick nohz enters. These patch series make a new requirement
with tick, so we should keep sleep length updated as needed
---
kernel/time/tick-sched.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d663fab..94fb9b8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1008,8 +1008,11 @@ void tick_nohz_irq_exit(void)
*/
ktime_t tick_nohz_get_sleep_length(void)
{
+ struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

+ ts->sleep_length = ktime_sub(dev->next_event, ktime_get());
+
return ts->sleep_length;
}

--
2.7.4

2017-09-30 07:21:51

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v2 7/8] cpuidle: introduce irq timing to make idle prediction

Introduce irq timings output as a factor to predict the duration
of the coming idle

Signed-off-by: Aubrey Li <[email protected]>
---
drivers/cpuidle/Kconfig | 1 +
drivers/cpuidle/cpuidle.c | 17 ++++++++++++++++-
2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 7e48eb5..8b07e1c 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -5,6 +5,7 @@ config CPU_IDLE
default y if ACPI || PPC_PSERIES
select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
+ select IRQ_TIMINGS
help
CPU idle is a generic framework for supporting software-controlled
idle processor power management. It includes modular cross-platform
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 5d4f0b6..be56cea 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -22,6 +22,7 @@
#include <linux/module.h>
#include <linux/suspend.h>
#include <linux/tick.h>
+#include <linux/interrupt.h>
#include <trace/events/power.h>

#include "cpuidle.h"
@@ -342,13 +343,27 @@ void cpuidle_entry_end(void)
void cpuidle_predict(void)
{
struct cpuidle_device *dev = cpuidle_get_device();
- unsigned int overhead_threshold;
+ unsigned int idle_interval, overhead_threshold;
+ u64 now, next_evt;

if (!dev)
return;

overhead_threshold = dev->idle_stat.overhead * sysctl_fast_idle_ratio;

+ /*
+ * check irq timings if the next event is coming soon
+ */
+ now = local_clock();
+ local_irq_disable();
+ next_evt = irq_timings_next_event(now);
+ local_irq_enable();
+ idle_interval = div_u64(next_evt - now, NSEC_PER_USEC);
+ if (idle_interval < overhead_threshold) {
+ dev->idle_stat.fast_idle = true;
+ return;
+ }
+
if (cpuidle_curr_governor->predict) {
dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
/*
--
2.7.4

2017-09-30 07:22:01

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v2 8/8] cpuidle: introduce run queue average idle to make idle prediction

Introduce run queue average idle in scheduler as a factor to make
idle prediction

Signed-off-by: Aubrey Li <[email protected]>
---
drivers/cpuidle/cpuidle.c | 12 ++++++++++++
include/linux/cpuidle.h | 1 +
kernel/sched/idle.c | 5 +++++
3 files changed, 18 insertions(+)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index be56cea..9424a2d 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -364,6 +364,18 @@ void cpuidle_predict(void)
return;
}

+ /*
+ * check scheduler if the coming idle is likely a fast idle
+ */
+ idle_interval = div_u64(sched_idle_avg(), NSEC_PER_USEC);
+ if (idle_interval < overhead_threshold) {
+ dev->idle_stat.fast_idle = true;
+ return;
+ }
+
+ /*
+ * check the idle governor if the coming idle is likely a fast idle
+ */
if (cpuidle_curr_governor->predict) {
dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
/*
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 45b8264..387d72b 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -234,6 +234,7 @@ static inline void cpuidle_use_deepest_state(bool enable)
/* kernel/sched/idle.c */
extern void sched_idle_set_state(struct cpuidle_state *idle_state);
extern void default_idle_call(void);
+extern u64 sched_idle_avg(void);

#ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8704f3c..d23b472 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -30,6 +30,11 @@ void sched_idle_set_state(struct cpuidle_state *idle_state)
idle_set_state(this_rq(), idle_state);
}

+u64 sched_idle_avg(void)
+{
+ return this_rq()->avg_idle;
+}
+
static int __read_mostly cpu_idle_force_poll;

void cpu_idle_poll_ctrl(bool enable)
--
2.7.4

2017-09-30 07:21:48

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable

Add a knob to make fast idle threshold tunable

Signed-off-by: Aubrey Li <[email protected]>
---
drivers/cpuidle/cpuidle.c | 3 ++-
include/linux/cpuidle.h | 1 +
kernel/sysctl.c | 12 ++++++++++++
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 6cb7e17..5d4f0b6 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -35,6 +35,7 @@ LIST_HEAD(cpuidle_detected_devices);
static int enabled_devices;
static int off __read_mostly;
static int initialized __read_mostly;
+int sysctl_fast_idle_ratio = 10;

int cpuidle_disabled(void)
{
@@ -346,7 +347,7 @@ void cpuidle_predict(void)
if (!dev)
return;

- overhead_threshold = dev->idle_stat.overhead;
+ overhead_threshold = dev->idle_stat.overhead * sysctl_fast_idle_ratio;

if (cpuidle_curr_governor->predict) {
dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 791db15..45b8264 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -24,6 +24,7 @@ struct module;
struct cpuidle_device;
struct cpuidle_driver;

+extern int sysctl_fast_idle_ratio;

/****************************
* CPUIDLE DEVICE INTERFACE *
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..97f7e8af 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -67,6 +67,7 @@
#include <linux/kexec.h>
#include <linux/bpf.h>
#include <linux/mount.h>
+#include <linux/cpuidle.h>

#include <linux/uaccess.h>
#include <asm/processor.h>
@@ -1229,6 +1230,17 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
#endif
+#ifdef CONFIG_CPU_IDLE
+ {
+ .procname = "fast_idle_ratio",
+ .data = &sysctl_fast_idle_ratio,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &one,
+ .extra2 = &one_hundred,
+ },
+#endif
{ }
};

--
2.7.4

2017-09-30 07:22:46

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle

If the next idle is expected to be a fast idle, we should keep tick
on before going into idle

Signed-off-by: Aubrey Li <[email protected]>
---
drivers/cpuidle/cpuidle.c | 14 ++++++++++++++
include/linux/cpuidle.h | 2 ++
kernel/time/tick-sched.c | 4 ++++
3 files changed, 20 insertions(+)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index ef6f7dd..6cb7e17 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -370,6 +370,20 @@ void cpuidle_predict(void)
}

/**
+ * cpuidle_fast_idle - predict whether or not the coming idle is a fast idle
+ * This function can be called in irq exit path, make it as soon as possible
+ */
+bool cpuidle_fast_idle(void)
+{
+ struct cpuidle_device *dev = cpuidle_get_device();
+
+ if (!dev)
+ return false;
+
+ return dev->idle_stat.fast_idle;
+}
+
+/**
* cpuidle_install_idle_handler - installs the cpuidle idle loop handler
*/
void cpuidle_install_idle_handler(void)
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 9ca0288..791db15 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -144,6 +144,7 @@ extern int cpuidle_select(struct cpuidle_driver *drv,
extern void cpuidle_entry_start(void);
extern void cpuidle_entry_end(void);
extern void cpuidle_predict(void);
+extern bool cpuidle_fast_idle(void);
extern int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index);
extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -180,6 +181,7 @@ static inline int cpuidle_select(struct cpuidle_driver *drv,
static inline void cpuidle_entry_start(void) { }
static inline void cpuidle_entry_end(void) { }
static inline void cpuidle_predict(void) { }
+static inline void cpuidle_fast_idle(void) {return false; }
static inline int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index)
{return -ENODEV; }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c7a899c..d663fab 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -27,6 +27,7 @@
#include <linux/irq_work.h>
#include <linux/posix-timers.h>
#include <linux/context_tracking.h>
+#include <linux/cpuidle.h>

#include <asm/irq_regs.h>

@@ -916,6 +917,9 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
return false;
}

+ if (cpuidle_fast_idle())
+ return false;
+
return true;
}

--
2.7.4

2017-09-30 07:21:35

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v2 1/8] cpuidle: menu: extract prediction functionality

There are several factors in the menu governor to predict the next
idle interval:
- the next timer
- the recent idle interval history
- the corrected idle interval pattern
These factors are common enough to be extracted to be one function.

Signed-off-by: Aubrey Li <[email protected]>
---
drivers/cpuidle/governors/menu.c | 64 +++++++++++++++++++++++++---------------
1 file changed, 40 insertions(+), 24 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 61b64c2..6bed197 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -276,6 +276,45 @@ static unsigned int get_typical_interval(struct menu_device *data)
}

/**
+ * menu_predict - predict the coming idle interval
+ *
+ * Return the predicted coming idle interval in micro-second
+ */
+static unsigned int menu_predict(void)
+{
+ struct menu_device *data = this_cpu_ptr(&menu_devices);
+ unsigned int expected_interval;
+ int cpu = smp_processor_id();
+
+ if (!data)
+ return UINT_MAX;
+
+ /* determine the expected residency time, round up */
+ data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
+
+ data->bucket = which_bucket(data->next_timer_us, nr_iowait_cpu(cpu));
+
+ /*
+ * Force the result of multiplication to be 64 bits even if both
+ * operands are 32 bits.
+ * Make sure to round up for half microseconds.
+ */
+ data->predicted_us = DIV_ROUND_CLOSEST_ULL((uint64_t)
+ data->next_timer_us * data->correction_factor[data->bucket],
+ RESOLUTION * DECAY);
+
+ expected_interval = get_typical_interval(data);
+ expected_interval = min(expected_interval, data->next_timer_us);
+
+ /*
+ * Use the lowest expected idle interval to pick the idle state.
+ */
+ data->predicted_us = min(data->predicted_us, expected_interval);
+
+ return data->predicted_us;
+}
+
+/**
* menu_select - selects the next idle state to enter
* @drv: cpuidle driver containing state data
* @dev: the CPU
@@ -289,7 +328,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
int first_idx;
int idx;
unsigned int interactivity_req;
- unsigned int expected_interval;
unsigned long nr_iowaiters, cpu_load;
int resume_latency = dev_pm_qos_raw_read_value(device);

@@ -306,24 +344,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
if (unlikely(latency_req == 0))
return 0;

- /* determine the expected residency time, round up */
- data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
-
- get_iowait_load(&nr_iowaiters, &cpu_load);
- data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
-
- /*
- * Force the result of multiplication to be 64 bits even if both
- * operands are 32 bits.
- * Make sure to round up for half microseconds.
- */
- data->predicted_us = DIV_ROUND_CLOSEST_ULL((uint64_t)data->next_timer_us *
- data->correction_factor[data->bucket],
- RESOLUTION * DECAY);
-
- expected_interval = get_typical_interval(data);
- expected_interval = min(expected_interval, data->next_timer_us);
-
if (CPUIDLE_DRIVER_STATE_START > 0) {
struct cpuidle_state *s = &drv->states[CPUIDLE_DRIVER_STATE_START];
unsigned int polling_threshold;
@@ -345,14 +365,10 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
}

/*
- * Use the lowest expected idle interval to pick the idle state.
- */
- data->predicted_us = min(data->predicted_us, expected_interval);
-
- /*
* Use the performance multiplier and the user-configurable
* latency_req to determine the maximum exit latency.
*/
+ get_iowait_load(&nr_iowaiters, &cpu_load);
interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
if (latency_req > interactivity_req)
latency_req = interactivity_req;
--
2.7.4

2017-11-30 01:38:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality

On Thu, Nov 30, 2017 at 2:00 AM, Li, Aubrey <[email protected]> wrote:
> Hi,
>
> Thanks Rafael's comments against V2. I'd like to ping here to see which
> direction this proposal should go and what fundamental change this proposal
> should make.
>
> I'm also open to any suggestions if my proposal is not on the right way.

To me (and I discussed that with Peter briefly last month in Prague)
the way to go would be to avoid stopping the scheduler tick until we
are entirely sure that it should be stopped. Which is when the
governor (any governor BTW, not just menu) chooses the target idle
state for the CPU and the target residency of that state is long
enough for the tick to be stopped ("long enough" here means that if
the tick is not stopped, the time spent by the CPU in the target idle
state will be shorter than the target residency). Note that this is
totally independent on what governor is used and it doesn't involve
any "refinements" or "improvements" of the prediction whatever.

Of course, for the governor to make the state selection, it generally
will have to know when the next non-tick timer is going to expire and
getting that information to it will require some effort.

Thanks,
Rafael

From 1585450782485188701@xxx Thu Nov 30 01:01:31 +0000 2017
X-GM-THRID: 1579948553110255952
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-30 01:01:31

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality

Hi,

Thanks Rafael's comments against V2. I'd like to ping here to see which
direction this proposal should go and what fundamental change this proposal
should make.

I'm also open to any suggestions if my proposal is not on the right way.

Thanks,
-Aubrey

On 2017/9/30 15:20, Aubrey Li wrote:
> We found under some latency intensive workloads, short idle periods occurs
> very common, then idle entry and exit path starts to dominate, so it's
> important to optimize them. To determine the short idle pattern, we need
> to figure out how long of the coming idle and the threshold of the short
> idle interval.
>
> A cpu idle prediction functionality is introduced in this proposal to catch
> the short idle pattern.
>
> Firstly, we check the IRQ timings subsystem, if there is an event
> coming soon.
> -- https://lwn.net/Articles/691297/
>
> Secondly, we check the idle statistics of scheduler, if it's likely we'll
> go into a short idle.
> -- https://patchwork.kernel.org/patch/2839221/
>
> Thirdly, we predict the next idle interval by using the prediction
> fucntionality in the idle governor if it has.
>
> For the threshold of the short idle interval, we record the timestamps of
> the idle entry, and multiply by a tunable parameter at here:
> -- /proc/sys/kernel/fast_idle_ratio
>
> We use the output of the idle prediction to skip turning tick off if a
> short idle is determined in this proposal. Reprogramming hardware timer
> twice(off and on) is expensive for a very short idle. There are some
> potential optimizations can be done according to the same indicator.
>
> I observed when system is idle, the idle predictor reports 20/s long idle
> and ZERO fast idle on one CPU. And when the workload is running, the idle
> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
>
> Aubrey Li (8):
> cpuidle: menu: extract prediction functionality
> cpuidle: record the overhead of idle entry
> cpuidle: add a new predict interface
> tick/nohz: keep tick on for a fast idle
> timers: keep sleep length updated as needed
> cpuidle: make fast idle threshold tunable
> cpuidle: introduce irq timing to make idle prediction
> cpuidle: introduce run queue average idle to make idle prediction
>
> drivers/cpuidle/Kconfig | 1 +
> drivers/cpuidle/cpuidle.c | 109 +++++++++++++++++++++++++++++++++++++++
> drivers/cpuidle/governors/menu.c | 69 ++++++++++++++++---------
> include/linux/cpuidle.h | 21 ++++++++
> kernel/sched/idle.c | 14 ++++-
> kernel/sysctl.c | 12 +++++
> kernel/time/tick-sched.c | 7 +++
> 7 files changed, 209 insertions(+), 24 deletions(-)
>


From 1581517559705971419@xxx Tue Oct 17 15:04:37 +0000 2017
X-GM-THRID: 1579948553110255952
X-Gmail-Labels: Inbox,Category Forums

2017-10-17 15:04:38

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality

On 2017/10/17 8:07, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 9:44:41 AM CEST Li, Aubrey wrote:
>>
>> Or you concern why the threshold can't simply be tick interval?
>
> That I guess.
>
>> For the latter, if the threshold is close/equal to the tick, it's quite possible
>> the next event is the tick and no other else event.
>
> Well, I don't quite get that.
>
> What's the reasoning here?

if we set the threshold to the tick interval, when the system keeps the tick on once,
it will have no chance to turn if off forever.

Because we turn on the tick, and count in the time of idle entry and idle exit,
the idle interval always < tick interval, that means idle is always a short idle.

Thanks,
-Aubrey


From 1581491497919299336@xxx Tue Oct 17 08:10:23 +0000 2017
X-GM-THRID: 1579948553110255952
X-Gmail-Labels: Inbox,Category Forums

2017-10-17 08:10:23

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality

On Monday, October 16, 2017 9:44:41 AM CEST Li, Aubrey wrote:
> On 2017/10/14 9:14, Rafael J. Wysocki wrote:
> > On Saturday, September 30, 2017 9:20:26 AM CEST Aubrey Li wrote:
> >> We found under some latency intensive workloads, short idle periods occurs
> >> very common, then idle entry and exit path starts to dominate, so it's
> >> important to optimize them. To determine the short idle pattern, we need
> >> to figure out how long of the coming idle and the threshold of the short
> >> idle interval.
> >>
> >> A cpu idle prediction functionality is introduced in this proposal to catch
> >> the short idle pattern.
> >>
> >> Firstly, we check the IRQ timings subsystem, if there is an event
> >> coming soon.
> >> -- https://lwn.net/Articles/691297/
> >>
> >> Secondly, we check the idle statistics of scheduler, if it's likely we'll
> >> go into a short idle.
> >> -- https://patchwork.kernel.org/patch/2839221/
> >>
> >> Thirdly, we predict the next idle interval by using the prediction
> >> fucntionality in the idle governor if it has.
> >>
> >> For the threshold of the short idle interval, we record the timestamps of
> >> the idle entry, and multiply by a tunable parameter at here:
> >> -- /proc/sys/kernel/fast_idle_ratio
> >>
> >> We use the output of the idle prediction to skip turning tick off if a
> >> short idle is determined in this proposal. Reprogramming hardware timer
> >> twice(off and on) is expensive for a very short idle. There are some
> >> potential optimizations can be done according to the same indicator.
> >>
> >> I observed when system is idle, the idle predictor reports 20/s long idle
> >> and ZERO fast idle on one CPU. And when the workload is running, the idle
> >> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
> >>
> >> Aubrey Li (8):
> >> cpuidle: menu: extract prediction functionality
> >> cpuidle: record the overhead of idle entry
> >> cpuidle: add a new predict interface
> >> tick/nohz: keep tick on for a fast idle
> >> timers: keep sleep length updated as needed
> >> cpuidle: make fast idle threshold tunable
> >> cpuidle: introduce irq timing to make idle prediction
> >> cpuidle: introduce run queue average idle to make idle prediction
> >>
> >> drivers/cpuidle/Kconfig | 1 +
> >> drivers/cpuidle/cpuidle.c | 109 +++++++++++++++++++++++++++++++++++++++
> >> drivers/cpuidle/governors/menu.c | 69 ++++++++++++++++---------
> >> include/linux/cpuidle.h | 21 ++++++++
> >> kernel/sched/idle.c | 14 ++++-
> >> kernel/sysctl.c | 12 +++++
> >> kernel/time/tick-sched.c | 7 +++
> >> 7 files changed, 209 insertions(+), 24 deletions(-)
> >>
> >
> > Overall, it looks like you could avoid stopping the tick every time the
> > predicted idle duration is not longer than the tick interval in the first
> > place.
> > > Why don't you do that?
>
> I didn't catch this.
>
> Are you suggesting?
>
> if(!cpu_stat.fast_idle)
> tick_nohz_idle_enter()
>
> Or you concern why the threshold can't simply be tick interval?

That I guess.

> For the first, can_stop_idle_tick() is a better place to skip tick-off IMHO.
> For the latter, if the threshold is close/equal to the tick, it's quite possible
> the next event is the tick and no other else event.

Well, I don't quite get that.

What's the reasoning here?

Thanks,
Rafael


From 1581399326728988559@xxx Mon Oct 16 07:45:21 +0000 2017
X-GM-THRID: 1579948553110255952
X-Gmail-Labels: Inbox,Category Forums

2017-10-16 07:45:21

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality

On 2017/10/14 9:14, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:26 AM CEST Aubrey Li wrote:
>> We found under some latency intensive workloads, short idle periods occurs
>> very common, then idle entry and exit path starts to dominate, so it's
>> important to optimize them. To determine the short idle pattern, we need
>> to figure out how long of the coming idle and the threshold of the short
>> idle interval.
>>
>> A cpu idle prediction functionality is introduced in this proposal to catch
>> the short idle pattern.
>>
>> Firstly, we check the IRQ timings subsystem, if there is an event
>> coming soon.
>> -- https://lwn.net/Articles/691297/
>>
>> Secondly, we check the idle statistics of scheduler, if it's likely we'll
>> go into a short idle.
>> -- https://patchwork.kernel.org/patch/2839221/
>>
>> Thirdly, we predict the next idle interval by using the prediction
>> fucntionality in the idle governor if it has.
>>
>> For the threshold of the short idle interval, we record the timestamps of
>> the idle entry, and multiply by a tunable parameter at here:
>> -- /proc/sys/kernel/fast_idle_ratio
>>
>> We use the output of the idle prediction to skip turning tick off if a
>> short idle is determined in this proposal. Reprogramming hardware timer
>> twice(off and on) is expensive for a very short idle. There are some
>> potential optimizations can be done according to the same indicator.
>>
>> I observed when system is idle, the idle predictor reports 20/s long idle
>> and ZERO fast idle on one CPU. And when the workload is running, the idle
>> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
>>
>> Aubrey Li (8):
>> cpuidle: menu: extract prediction functionality
>> cpuidle: record the overhead of idle entry
>> cpuidle: add a new predict interface
>> tick/nohz: keep tick on for a fast idle
>> timers: keep sleep length updated as needed
>> cpuidle: make fast idle threshold tunable
>> cpuidle: introduce irq timing to make idle prediction
>> cpuidle: introduce run queue average idle to make idle prediction
>>
>> drivers/cpuidle/Kconfig | 1 +
>> drivers/cpuidle/cpuidle.c | 109 +++++++++++++++++++++++++++++++++++++++
>> drivers/cpuidle/governors/menu.c | 69 ++++++++++++++++---------
>> include/linux/cpuidle.h | 21 ++++++++
>> kernel/sched/idle.c | 14 ++++-
>> kernel/sysctl.c | 12 +++++
>> kernel/time/tick-sched.c | 7 +++
>> 7 files changed, 209 insertions(+), 24 deletions(-)
>>
>
> Overall, it looks like you could avoid stopping the tick every time the
> predicted idle duration is not longer than the tick interval in the first
> place.
> > Why don't you do that?

I didn't catch this.

Are you suggesting?

if(!cpu_stat.fast_idle)
tick_nohz_idle_enter()

Or you concern why the threshold can't simply be tick interval?

For the first, can_stop_idle_tick() is a better place to skip tick-off IMHO.
For the latter, if the threshold is close/equal to the tick, it's quite possible
the next event is the tick and no other else event.

Thanks,
-Aubrey

From 1581194181131083749@xxx Sat Oct 14 01:24:39 +0000 2017
X-GM-THRID: 1579948553110255952
X-Gmail-Labels: Inbox,Category Forums

2017-10-14 01:24:40

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality

On Saturday, September 30, 2017 9:20:26 AM CEST Aubrey Li wrote:
> We found under some latency intensive workloads, short idle periods occurs
> very common, then idle entry and exit path starts to dominate, so it's
> important to optimize them. To determine the short idle pattern, we need
> to figure out how long of the coming idle and the threshold of the short
> idle interval.
>
> A cpu idle prediction functionality is introduced in this proposal to catch
> the short idle pattern.
>
> Firstly, we check the IRQ timings subsystem, if there is an event
> coming soon.
> -- https://lwn.net/Articles/691297/
>
> Secondly, we check the idle statistics of scheduler, if it's likely we'll
> go into a short idle.
> -- https://patchwork.kernel.org/patch/2839221/
>
> Thirdly, we predict the next idle interval by using the prediction
> fucntionality in the idle governor if it has.
>
> For the threshold of the short idle interval, we record the timestamps of
> the idle entry, and multiply by a tunable parameter at here:
> -- /proc/sys/kernel/fast_idle_ratio
>
> We use the output of the idle prediction to skip turning tick off if a
> short idle is determined in this proposal. Reprogramming hardware timer
> twice(off and on) is expensive for a very short idle. There are some
> potential optimizations can be done according to the same indicator.
>
> I observed when system is idle, the idle predictor reports 20/s long idle
> and ZERO fast idle on one CPU. And when the workload is running, the idle
> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
>
> Aubrey Li (8):
> cpuidle: menu: extract prediction functionality
> cpuidle: record the overhead of idle entry
> cpuidle: add a new predict interface
> tick/nohz: keep tick on for a fast idle
> timers: keep sleep length updated as needed
> cpuidle: make fast idle threshold tunable
> cpuidle: introduce irq timing to make idle prediction
> cpuidle: introduce run queue average idle to make idle prediction
>
> drivers/cpuidle/Kconfig | 1 +
> drivers/cpuidle/cpuidle.c | 109 +++++++++++++++++++++++++++++++++++++++
> drivers/cpuidle/governors/menu.c | 69 ++++++++++++++++---------
> include/linux/cpuidle.h | 21 ++++++++
> kernel/sched/idle.c | 14 ++++-
> kernel/sysctl.c | 12 +++++
> kernel/time/tick-sched.c | 7 +++
> 7 files changed, 209 insertions(+), 24 deletions(-)
>

Overall, it looks like you could avoid stopping the tick every time the
predicted idle duration is not longer than the tick interval in the first
place.

Why don't you do that?


From 1579948553110255952@xxx Sat Sep 30 07:25:56 +0000 2017
X-GM-THRID: 1579948553110255952
X-Gmail-Labels: Inbox,Category Forums