We are entering a very power and thermal constrained environment. Often
we have more horsepower than we can use due to these limits. But
on the other side we all demand performance when needed. The reserved
performance headroom does not come free. To conserve energy, more and
more SoC blocks can be power gated at runtime. However, randomly
scheduled idle time on individual CPU may not result in good power
saving in that the common circuits such as memory controller can only be
in the low power state when all cores are in idle at the same time.
Frequency-Voltage scaling presents a good solution but its efficiency
is limited to certain range.
In general, only synchronized idle will allow SoC to enter the
deepest power states. For most modern processors, deep idle power is
nearly negligible to the peak running power. This implies if we can
duty cycle the CPU between running and the deepest idle state, we can
scale performance and power almost linearly. Combined with the most
efficient frequency point, idle injection presents a way to cap power
efficiently.
Intel powerclamp driver was introduced a while ago to address the
problem but is broken in the sense of turning off idle ticks in the
forced idle period. https://lkml.org/lkml/2014/12/18/369
It was suggested to replace the current kthread play idle loop with a
timer based runqueue throttling scheme. I finally got around to
implement this and code looks much simpler and more effective.
Test results were presented at LinuxCon where data/graph can be seen
from slides #18 and later.
http://events.linuxfoundation.org/sites/events/files/slides/LinuxCon_Japan_2015_idle_injection1_0.pdf
RFC discussions are here.
https://lkml.org/lkml/2015/11/2/756
Thanks,
Jacob
Jacob Pan (4):
ktime: add a roundup function
timer: relax tick stop in idle entry
sched: introduce synchronized idle injection
sched: add trace event for idle injection
include/linux/ktime.h | 10 ++
include/linux/sched.h | 16 ++
include/linux/sched/sysctl.h | 5 +
include/trace/events/sched.h | 25 +++
init/Kconfig | 10 ++
kernel/sched/fair.c | 356 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 54 ++++++-
kernel/sysctl.c | 21 +++
kernel/time/tick-sched.c | 2 +-
9 files changed, 493 insertions(+), 6 deletions(-)
--
1.9.1
ktime roundup function can be used to keep timer aligned and
prevent drift for recurring timeouts.
Signed-off-by: Jacob Pan <[email protected]>
---
include/linux/ktime.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/include/linux/ktime.h b/include/linux/ktime.h
index 2b6a204..2e293fa 100644
--- a/include/linux/ktime.h
+++ b/include/linux/ktime.h
@@ -233,6 +233,16 @@ static inline ktime_t ktime_sub_us(const ktime_t kt, const u64 usec)
extern ktime_t ktime_add_safe(const ktime_t lhs, const ktime_t rhs);
+static inline ktime_t ktime_roundup(ktime_t x, ktime_t y)
+{
+ u64 temp_tv64;
+
+ temp_tv64 = x.tv64 + y.tv64 - 1;
+ temp_tv64 = div64_u64(temp_tv64, y.tv64);
+ x.tv64 = temp_tv64 * y.tv64;
+
+ return x;
+}
/**
* ktime_to_timespec_cond - convert a ktime_t variable to timespec
* format only if the variable contains data
--
1.9.1
Upon entering idle, we can turn off tick if the next timeout
is exactly one tick away. Otherwise, we could enter inner idle loop
with tick still enabled, without resched set, the tick will continue
during idle therefore less optimal in terms of energy savings.
Signed-off-by: Jacob Pan <[email protected]>
---
kernel/time/tick-sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7c7ec45..bcadaab 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -606,7 +606,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
* restart it proper.
*/
delta = next_tick - basemono;
- if (delta <= (u64)TICK_NSEC) {
+ if (delta < (u64)TICK_NSEC) {
tick.tv64 = 0;
if (!ts->tick_stopped)
goto out;
--
1.9.1
With increasingly constrained power and thermal budget, it's often
necessary to cap power via throttling. Throttling individual CPUs
or devices at random times can help power capping but may not be
optimal in terms of energy efficiency. Frequency scaling is also
limited by certain range before losing energy efficiency.
In general, the optimal solution in terms of energy efficiency is
to align idle periods such that more shared circuits can be power
gated to enter lower power states. Combined with energy efficient
frequency point, idle injection provides a way to scale power and
performance efficiently.
This patch introduces a scheduler based idle injection method, it
works by blocking CFS runqueue synchronously and periodically. The
actions on all online CPUs are orchestrated by per CPU hrtimers.
Two sysctl knobs are given to the userspace for selecting the
percentage of idle time as well as the forced idle duration for each
idle period injected.
Since only CFS class is targeted, other high priority tasks are not
affected, such as EDF and RT tasks as well as softirq and interrupts.
Hotpath in CFS pick_next_task is optimized by Peter Zijlstra, where
a new runnable flag is introduced to combine forced idle and
nr_running.
Signed-off-by: Jacob Pan <[email protected]>
---
include/linux/sched.h | 11 ++
include/linux/sched/sysctl.h | 5 +
init/Kconfig | 10 ++
kernel/sched/fair.c | 353 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 54 ++++++-
kernel/sysctl.c | 21 +++
6 files changed, 449 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b7b9501..ff551a3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3180,4 +3180,15 @@ static inline unsigned long rlimit_max(unsigned int limit)
return task_rlimit_max(current, limit);
}
+#ifdef CONFIG_CFS_IDLE_INJECT
+extern int proc_sched_cfs_idle_inject_pct_handler(struct ctl_table *table,
+ int write,
+ void __user *buffer,
+ size_t *length, loff_t *ppos);
+extern int proc_sched_cfs_idle_inject_duration_handler(struct ctl_table *table,
+ int write,
+ void __user *buffer,
+ size_t *length, loff_t *ppos);
+#endif
+
#endif
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731..d32da45 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -81,6 +81,11 @@ extern unsigned int sysctl_sched_cfs_bandwidth_slice;
extern unsigned int sysctl_sched_autogroup_enabled;
#endif
+#ifdef CONFIG_CFS_IDLE_INJECT
+extern unsigned int sysctl_sched_cfs_idle_inject_pct;
+extern unsigned int sysctl_sched_cfs_idle_inject_duration;
+#endif
+
extern int sched_rr_timeslice;
extern int sched_rr_handler(struct ctl_table *table, int write,
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f7..4041c94 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1086,6 +1086,16 @@ menuconfig CGROUP_SCHED
bandwidth allocation to such task groups. It uses cgroups to group
tasks.
+config CFS_IDLE_INJECT
+ bool "Synchronized CFS idle injection"
+ depends on NO_HZ_IDLE && HIGH_RES_TIMERS
+ default n
+ help
+ This feature let scheduler inject synchronized idle time across all online
+ CPUs. Idle injection affects normal tasks only, yeilds to RT and interrupts.
+ Effecitvely, CPUs can be duty cycled between running at the most power
+ efficient performance state and deep idle states.
+
if CGROUP_SCHED
config FAIR_GROUP_SCHED
bool "Group scheduling for SCHED_OTHER"
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..a0cd777 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -30,6 +30,7 @@
#include <linux/mempolicy.h>
#include <linux/migrate.h>
#include <linux/task_work.h>
+#include <linux/suspend.h>
#include <trace/events/sched.h>
@@ -114,6 +115,13 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
#endif
+#ifdef CONFIG_CFS_IDLE_INJECT
+/* Percentage of forced idle time for all online CPUs */
+unsigned int sysctl_sched_cfs_idle_inject_pct;
+/* Duration of idle time in ticks of each injection period */
+unsigned int sysctl_sched_cfs_idle_inject_duration = 5UL;
+#endif
+
static inline void update_load_add(struct load_weight *lw, unsigned long inc)
{
lw->weight += inc;
@@ -2334,7 +2342,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
list_add(&se->group_node, &rq->cfs_tasks);
}
#endif
- cfs_rq->nr_running++;
+ cfs_rq_nr_running_inc(cfs_rq);
}
static void
@@ -2347,7 +2355,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
account_numa_dequeue(rq_of(cfs_rq), task_of(se));
list_del_init(&se->group_node);
}
- cfs_rq->nr_running--;
+ cfs_rq_nr_running_dec(cfs_rq);
}
#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -5139,7 +5147,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
again:
#ifdef CONFIG_FAIR_GROUP_SCHED
- if (!cfs_rq->nr_running)
+ if (!cfs_rq_runnable(cfs_rq))
goto idle;
if (prev->sched_class != &fair_sched_class)
@@ -5218,7 +5226,7 @@ simple:
cfs_rq = &rq->cfs;
#endif
- if (!cfs_rq->nr_running)
+ if (!cfs_rq_runnable(cfs_rq))
goto idle;
put_prev_task(rq, prev);
@@ -5237,6 +5245,13 @@ simple:
return p;
idle:
+ if (in_forced_idle(cfs_rq)) {
+ if (unlikely(local_softirq_pending())) {
+ __unthrottle_cfs_rq(cfs_rq);
+ goto again;
+ }
+ return NULL;
+ }
/*
* This is OK, because current is on_cpu, which avoids it being picked
* for load-balance and preemption/IRQs are still disabled avoiding
@@ -8318,3 +8333,333 @@ __init void init_sched_fair_class(void)
#endif /* SMP */
}
+
+#ifdef CONFIG_CFS_IDLE_INJECT
+static DEFINE_PER_CPU(struct hrtimer, idle_inject_timer);
+static DEFINE_PER_CPU(int, idle_injected);
+/* protect injection parameters from runtime changes */
+static DEFINE_SPINLOCK(idle_inject_lock);
+
+/* idle inject duration in ticks for better scale different HZ values */
+static unsigned int duration;
+static ktime_t duration_ktime;
+static ktime_t inject_interval_ktime;
+static ktime_t inject_period_ktime;
+static unsigned int idle_pct; /* percentage of time idle is forced */
+/* starting reference time for all CPUs to align idle period */
+static ktime_t inject_start_time;
+static int prepare_idle_inject(void);
+
+static void throttle_rq(int cpu)
+{
+ unsigned int resched = 0;
+ unsigned long flags;
+ struct rq *rq = cpu_rq(cpu);
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ rq->cfs.forced_idle = true;
+ resched = rq->cfs.runnable;
+ rq->cfs.runnable = false;
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+ if (resched)
+ resched_cpu(cpu);
+}
+
+static void unthrottle_rq(int cpu)
+{
+ unsigned int resched = 0;
+ unsigned long flags;
+ struct rq *rq = cpu_rq(cpu);
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ rq->cfs.forced_idle = false;
+ resched = rq->cfs.runnable = !!rq->cfs.nr_running;
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+ if (resched)
+ resched_cpu(cpu);
+}
+
+static enum hrtimer_restart idle_inject_timer_fn(struct hrtimer *hrtimer)
+{
+ int cpu = smp_processor_id();
+ struct hrtimer *hrt = this_cpu_ptr(&idle_inject_timer);
+ ktime_t now, delta;
+ int status;
+
+ status = raw_cpu_read(idle_injected);
+
+ now = hrtimer_cb_get_time(hrt);
+
+ if (status) {
+ /*
+ * We were injecting idle in the last phase, let's forward the
+ * timer to the next period
+ *
+ * status: 1 0 1 0
+ * ____ ____________________ _______
+ * |________| |_________|
+ *
+ * |duration| interval |
+ *
+ * ^ we are here
+ * forward to here: ^
+ *
+ * first compute the distance from the common starting point,
+ * then round up to the next period, this will automatically
+ * pick up any runtime changes done by sysctl interface or
+ * when a new cpu goes online.
+ */
+
+ delta = ktime_sub(now, inject_start_time);
+ delta = ktime_roundup(delta, inject_period_ktime);
+ hrtimer_set_expires(hrt, ktime_add(delta, inject_start_time));
+ unthrottle_rq(cpu);
+
+ } else {
+ /*
+ * We were not injecting idle in the last phase, let's forward
+ * timer after forced idle duration
+ * ____ ____________________ _______
+ * |________| |_________|
+ *
+ * |duration| interval |
+ *
+ * ^ we are here
+ * ^ forward timer to here
+ */
+ hrtimer_set_expires(hrt, ktime_add(duration_ktime, now));
+ throttle_rq(cpu);
+ }
+ raw_cpu_write(idle_injected, !status);
+
+ return HRTIMER_RESTART;
+}
+
+static void idle_inject_timer_start(void *info)
+{
+ struct hrtimer *hrt = this_cpu_ptr(&idle_inject_timer);
+
+ this_cpu_write(idle_injected, 1);
+ hrtimer_set_expires(hrt, *(ktime_t *)info);
+ hrtimer_start_expires(hrt, HRTIMER_MODE_ABS_PINNED);
+}
+
+static int start_idle_inject(void)
+{
+ ktime_t now = ktime_get();
+
+ /* prevent cpu hotplug */
+ get_online_cpus();
+ /* set a future time to let all per cpu timers expires the same time */
+ now = ktime_roundup(now, duration_ktime);
+
+ /* start one timer per online cpu */
+ inject_start_time = now;
+ on_each_cpu(idle_inject_timer_start, &now, 1);
+
+ put_online_cpus();
+
+ return 0;
+}
+
+static void stop_idle_inject(void)
+{
+ struct hrtimer *hrt;
+ unsigned int cpu;
+
+ get_online_cpus();
+ for_each_online_cpu(cpu) {
+ hrt = &per_cpu(idle_inject_timer, cpu);
+ hrtimer_cancel(hrt);
+ unthrottle_rq(cpu);
+ }
+ put_online_cpus();
+}
+
+static int idle_inject_cpu_callback(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ unsigned long cpu = (unsigned long)hcpu;
+ struct hrtimer *hrt = &per_cpu(idle_inject_timer, cpu);
+ ktime_t now, delta;
+
+ switch (action) {
+ case CPU_STARTING:
+ raw_cpu_write(idle_injected, 1);
+
+ hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+ hrt->function = idle_inject_timer_fn;
+ now = hrtimer_cb_get_time(hrt);
+ /*
+ * When a new CPU comes online, we need to make sure it aligns
+ * its phase with the rest of the CPUs. So we set the
+ * timer to the next period based on the common starting time,
+ * then start injecting idle time.
+ */
+ delta = ktime_sub(now, inject_start_time);
+ delta = ktime_roundup(delta, inject_period_ktime);
+ hrtimer_set_expires(hrt, ktime_add(delta, inject_start_time));
+ hrtimer_start_expires(hrt, HRTIMER_MODE_ABS_PINNED);
+ break;
+ case CPU_DYING:
+ hrtimer_cancel(hrt);
+ raw_cpu_write(idle_injected, 0);
+ unthrottle_rq(cpu);
+ break;
+ default:
+ return NOTIFY_DONE;
+ }
+
+ return NOTIFY_OK;
+}
+
+static int idle_inject_pm_callback(struct notifier_block *self,
+ unsigned long action, void *hcpu)
+{
+ switch (action) {
+ case PM_HIBERNATION_PREPARE:
+ case PM_SUSPEND_PREPARE:
+ stop_idle_inject();
+ break;
+ case PM_POST_HIBERNATION:
+ case PM_POST_SUSPEND:
+ /* resume from suspend */
+ start_idle_inject();
+ break;
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block idle_inject_pm_notifier = {
+ .notifier_call = idle_inject_pm_callback,
+};
+
+static struct notifier_block idle_inject_cpu_notifier = {
+ .notifier_call = idle_inject_cpu_callback,
+};
+
+static void end_idle_inject(void)
+{
+ unregister_hotcpu_notifier(&idle_inject_cpu_notifier);
+ unregister_pm_notifier(&idle_inject_pm_notifier);
+}
+
+static int prepare_idle_inject(void)
+{
+ int retval = 0;
+ int cpu;
+ struct hrtimer *hrt;
+
+ retval = register_pm_notifier(&idle_inject_pm_notifier);
+ if (retval)
+ goto exit;
+ retval = register_hotcpu_notifier(&idle_inject_cpu_notifier);
+ if (retval)
+ goto exit_unregister_pm;
+ get_online_cpus();
+ for_each_online_cpu(cpu) {
+ hrt = &per_cpu(idle_inject_timer, cpu);
+ hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+ hrt->function = idle_inject_timer_fn;
+ }
+ put_online_cpus();
+
+ return 0;
+exit_unregister_pm:
+ unregister_pm_notifier(&idle_inject_pm_notifier);
+exit:
+ return retval;
+}
+
+/*
+ * Must be called before idle injection starts and every time injection
+ * parameters changes.
+ */
+static void compute_idle_inject_params(void)
+{
+ unsigned int inject_interval_msec, duration_msec;
+
+ /*
+ * duration is fixed for each injection period, we adjust
+ * non idle interval to satisfy the idle percentage set
+ * by the user. e.g. if duration is 10 and we want 33% idle
+ * then interval is 20. interval becomes 10 if 50% idle is set.
+ *
+ * e.g 33% and 50% idle
+ * ____ ___________________ _________
+ * |________| |________| 33% idle
+ * ____ ________ _______
+ * |________| |________| 50% idle
+ *
+ * |duration|interval|
+ */
+ spin_lock(&idle_inject_lock);
+ duration = sysctl_sched_cfs_idle_inject_duration;
+ idle_pct = sysctl_sched_cfs_idle_inject_pct;
+
+ duration_msec = jiffies_to_msecs(sysctl_sched_cfs_idle_inject_duration);
+ duration_ktime = ms_to_ktime(duration_msec);
+
+ if (idle_pct) {
+ inject_interval_msec = (duration_msec * (100 - idle_pct)) / idle_pct;
+ inject_interval_ktime = ms_to_ktime(inject_interval_msec);
+ /*
+ * precompute period here so we don't have to do that for every
+ * timer interrupt.
+ */
+ inject_period_ktime = ktime_add(inject_interval_ktime, duration_ktime);
+ }
+
+ spin_unlock(&idle_inject_lock);
+
+}
+
+int proc_sched_cfs_idle_inject_pct_handler(struct ctl_table *table,
+ int write,
+ void __user *buffer,
+ size_t *length, loff_t *ppos)
+{
+ int ret, need_to_start;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret)
+ goto out;
+
+ if (idle_pct != sysctl_sched_cfs_idle_inject_pct) {
+ need_to_start = !idle_pct;
+ compute_idle_inject_params();
+ if (need_to_start) {
+ ret = prepare_idle_inject();
+ if (ret)
+ goto out;
+ start_idle_inject();
+ } else if (!sysctl_sched_cfs_idle_inject_pct) {
+ stop_idle_inject();
+ end_idle_inject();
+ }
+ }
+out:
+ return ret;
+}
+
+int proc_sched_cfs_idle_inject_duration_handler(struct ctl_table *table,
+ int write,
+ void __user *buffer,
+ size_t *length, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret)
+ goto out;
+
+ if (duration != sysctl_sched_cfs_idle_inject_duration)
+ compute_idle_inject_params();
+out:
+ return ret;
+}
+
+#endif /* CONFIG_CFS_IDLE_INJECT */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6d2a119..d50beb8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -343,7 +343,9 @@ struct cfs_bandwidth { };
struct cfs_rq {
struct load_weight load;
unsigned int nr_running, h_nr_running;
-
+#ifdef CONFIG_CFS_IDLE_INJECT
+ unsigned int runnable, forced_idle;
+#endif
u64 exec_clock;
u64 min_vruntime;
#ifndef CONFIG_64BIT
@@ -419,6 +421,56 @@ struct cfs_rq {
#endif /* CONFIG_FAIR_GROUP_SCHED */
};
+#ifdef CONFIG_CFS_IDLE_INJECT
+static inline void cfs_rq_nr_running_inc(struct cfs_rq *cfs_rq)
+{
+ if (!cfs_rq->nr_running++ && !cfs_rq->forced_idle)
+ cfs_rq->runnable = true;
+}
+
+static inline void cfs_rq_nr_running_dec(struct cfs_rq *cfs_rq)
+{
+ if (!--cfs_rq->nr_running && !cfs_rq->forced_idle)
+ cfs_rq->runnable = false;
+}
+
+static inline bool cfs_rq_runnable(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->runnable;
+}
+
+static inline void __unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ cfs_rq->forced_idle = false;
+ cfs_rq->runnable = cfs_rq->nr_running;
+}
+static inline bool in_forced_idle(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->forced_idle;
+}
+#else
+
+static inline void cfs_rq_nr_running_inc(struct cfs_rq *cfs_rq)
+{
+ cfs_rq->nr_running++;
+}
+
+static inline void cfs_rq_nr_running_dec(struct cfs_rq *cfs_rq)
+{
+ cfs_rq->nr_running--;
+
+}
+
+static inline bool cfs_rq_runnable(struct cfs_rq *cfs_rq)
+{
+ return !!cfs_rq->nr_running;
+}
+
+static inline void __unthrottle_cfs_rq(struct cfs_rq *cfs_rq) {}
+static inline bool in_forced_idle(struct cfs_rq *cfs_rq) { return false; }
+
+#endif /* CONFIG_CFS_ILDE_INJECT */
+
static inline int rt_bandwidth_enabled(void)
{
return sysctl_sched_rt_runtime >= 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e69201d..e7d39f7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -124,6 +124,7 @@ static int __maybe_unused one = 1;
static int __maybe_unused two = 2;
static int __maybe_unused four = 4;
static unsigned long one_ul = 1;
+static int fifty = 50;
static int one_hundred = 100;
#ifdef CONFIG_PRINTK
static int ten_thousand = 10000;
@@ -433,6 +434,26 @@ static struct ctl_table kern_table[] = {
.extra1 = &one,
},
#endif
+#ifdef CONFIG_CFS_IDLE_INJECT
+ {
+ .procname = "sched_cfs_idle_inject_pct",
+ .data = &sysctl_sched_cfs_idle_inject_pct,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_sched_cfs_idle_inject_pct_handler,
+ .extra1 = &zero,
+ .extra2 = &fifty,
+ },
+ {
+ .procname = "sched_cfs_idle_inject_duration",
+ .data = &sysctl_sched_cfs_idle_inject_duration,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_sched_cfs_idle_inject_duration_handler,
+ .extra1 = &four,
+ .extra2 = &one_hundred,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
--
1.9.1
Trace events for idle injection can be used to determine
timer activities for checking synchronization. In addition they
also helps to determine when the runqueue is throttled.
Signed-off-by: Jacob Pan <[email protected]>
---
include/linux/sched.h | 5 +++++
include/trace/events/sched.h | 25 +++++++++++++++++++++++++
kernel/sched/fair.c | 3 +++
3 files changed, 33 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ff551a3..99c79bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3189,6 +3189,11 @@ extern int proc_sched_cfs_idle_inject_duration_handler(struct ctl_table *table,
int write,
void __user *buffer,
size_t *length, loff_t *ppos);
+enum cfs_idle_inject_action {
+ CFS_IDLE_INJECT_TIMER, /* timer sync point */
+ CFS_IDLE_INJECT_FORCED, /* idle forced in rq */
+ CFS_IDLE_INJECT_YIELD_SOFTIRQ /* yield to pending softirq */
+};
#endif
#endif
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 539d6bc..52c11c1 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -566,6 +566,31 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
TP_printk("cpu=%d", __entry->cpu)
);
+
+#ifdef CONFIG_CFS_IDLE_INJECT
+/*
+ * Tracepoint for idle injection
+ */
+TRACE_EVENT(sched_cfs_idle_inject,
+
+ TP_PROTO(enum cfs_idle_inject_action action, int throttled),
+
+ TP_ARGS(action, throttled),
+
+ TP_STRUCT__entry(
+ __field(enum cfs_idle_inject_action, action)
+ __field(int, throttled)
+ ),
+
+ TP_fast_assign(
+ __entry->action = action;
+ __entry->throttled = throttled;
+ ),
+
+ TP_printk("action:%d throttled:%d", __entry->action, __entry->throttled)
+);
+#endif
+
#endif /* _TRACE_SCHED_H */
/* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0cd777..20027eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5248,8 +5248,10 @@ idle:
if (in_forced_idle(cfs_rq)) {
if (unlikely(local_softirq_pending())) {
__unthrottle_cfs_rq(cfs_rq);
+ trace_sched_cfs_idle_inject(CFS_IDLE_INJECT_YIELD_SOFTIRQ, 1);
goto again;
}
+ trace_sched_cfs_idle_inject(CFS_IDLE_INJECT_FORCED, 1);
return NULL;
}
/*
@@ -8432,6 +8434,7 @@ static enum hrtimer_restart idle_inject_timer_fn(struct hrtimer *hrtimer)
throttle_rq(cpu);
}
raw_cpu_write(idle_injected, !status);
+ trace_sched_cfs_idle_inject(CFS_IDLE_INJECT_TIMER, !status);
return HRTIMER_RESTART;
}
--
1.9.1
Hi Jacob,
[auto build test ERROR on: tip/sched/core]
[also build test ERROR on: v4.3 next-20151113]
url: https://github.com/0day-ci/linux/commits/Jacob-Pan/CFS-idle-injection/20151114-035732
config: cris-etrax-100lx_v2_defconfig (attached as .config)
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=cris
All errors (new ones prefixed by >>):
kernel/sched/fair.c: In function 'pick_next_task_fair':
>> kernel/sched/fair.c:5316:4: error: implicit declaration of function 'trace_sched_cfs_idle_inject' [-Werror=implicit-function-declaration]
>> kernel/sched/fair.c:5316:32: error: 'CFS_IDLE_INJECT_YIELD_SOFTIRQ' undeclared (first use in this function)
kernel/sched/fair.c:5316:32: note: each undeclared identifier is reported only once for each function it appears in
>> kernel/sched/fair.c:5319:31: error: 'CFS_IDLE_INJECT_FORCED' undeclared (first use in this function)
cc1: some warnings being treated as errors
vim +/trace_sched_cfs_idle_inject +5316 kernel/sched/fair.c
5310 return p;
5311
5312 idle:
5313 if (in_forced_idle(cfs_rq)) {
5314 if (unlikely(local_softirq_pending())) {
5315 __unthrottle_cfs_rq(cfs_rq);
> 5316 trace_sched_cfs_idle_inject(CFS_IDLE_INJECT_YIELD_SOFTIRQ, 1);
5317 goto again;
5318 }
> 5319 trace_sched_cfs_idle_inject(CFS_IDLE_INJECT_FORCED, 1);
5320 return NULL;
5321 }
5322 /*
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Fri, Nov 13, 2015 at 11:53 AM, Jacob Pan
<[email protected]> wrote:
> ktime roundup function can be used to keep timer aligned and
> prevent drift for recurring timeouts.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> include/linux/ktime.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/ktime.h b/include/linux/ktime.h
> index 2b6a204..2e293fa 100644
> --- a/include/linux/ktime.h
> +++ b/include/linux/ktime.h
> @@ -233,6 +233,16 @@ static inline ktime_t ktime_sub_us(const ktime_t kt, const u64 usec)
>
> extern ktime_t ktime_add_safe(const ktime_t lhs, const ktime_t rhs);
>
> +static inline ktime_t ktime_roundup(ktime_t x, ktime_t y)
> +{
> + u64 temp_tv64;
> +
> + temp_tv64 = x.tv64 + y.tv64 - 1;
> + temp_tv64 = div64_u64(temp_tv64, y.tv64);
> + x.tv64 = temp_tv64 * y.tv64;
> +
> + return x;
> +}
Could you add a comment as to what the function does, and use some
better variable names here to make it more immediately obvious what is
being done here?
Something like:
/**
* ktime_roundup - Rounds value up to interval chunk
* @ value: Value to be rounded up
* @ interval: interval size to round up to
*
* Rounds a value up to the next higher multiple of an interval size
*/
static inline ktime ktime_roundup(ktime_t value, ktime_t interval)
thanks
-john
On Fri, 13 Nov 2015, Jacob Pan wrote:
> ktime roundup function can be used to keep timer aligned and
> prevent drift for recurring timeouts.
That tells me a use case, but not WHAT the function does and WHY we
need it.
>
> +static inline ktime_t ktime_roundup(ktime_t x, ktime_t y)
Kerneldoc comment of this function would be appreciated.
Thanks,
tglx
On Fri, 13 Nov 2015, Jacob Pan wrote:
> Upon entering idle, we can turn off tick if the next timeout
> is exactly one tick away. Otherwise, we could enter inner idle loop
> with tick still enabled, without resched set, the tick will continue
> during idle therefore less optimal in terms of energy savings.
This does not make any sense at all.
next_tick is the next required tick event. If it's exactly ONE tick
away why should we go through the hassle of stopping it? Just to
cancel the timer and then set it to the same value again? Oh well.
Thanks,
tglx
Hi Jacob,
[auto build test WARNING on: tip/sched/core]
[also build test WARNING on: v4.3 next-20151113]
url: https://github.com/0day-ci/linux/commits/Jacob-Pan/CFS-idle-injection/20151114-035732
config: i386-randconfig-s1-201545 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All warnings (new ones prefixed by >>):
>> kernel/sysctl.c:127:12: warning: 'fifty' defined but not used [-Wunused-variable]
static int fifty = 50;
^
vim +/fifty +127 kernel/sysctl.c
111 #ifndef CONFIG_MMU
112 extern int sysctl_nr_trim_pages;
113 #endif
114
115 /* Constants used for minimum and maximum */
116 #ifdef CONFIG_LOCKUP_DETECTOR
117 static int sixty = 60;
118 #endif
119
120 static int __maybe_unused neg_one = -1;
121
122 static int zero;
123 static int __maybe_unused one = 1;
124 static int __maybe_unused two = 2;
125 static int __maybe_unused four = 4;
126 static unsigned long one_ul = 1;
> 127 static int fifty = 50;
128 static int one_hundred = 100;
129 #ifdef CONFIG_PRINTK
130 static int ten_thousand = 10000;
131 #endif
132
133 /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
134 static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
135
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Fri, 13 Nov 2015 15:22:16 -0500 (EST)
Thomas Gleixner <[email protected]> wrote:
>
>
> On Fri, 13 Nov 2015, Jacob Pan wrote:
>
> > Upon entering idle, we can turn off tick if the next timeout
> > is exactly one tick away. Otherwise, we could enter inner idle loop
> > with tick still enabled, without resched set, the tick will continue
> > during idle therefore less optimal in terms of energy savings.
>
> This does not make any sense at all.
>
> next_tick is the next required tick event. If it's exactly ONE tick
> away why should we go through the hassle of stopping it? Just to
> cancel the timer and then set it to the same value again? Oh well.
>
I have been trying to understand this code, please help. Here is my theory
and the ftrace of an injection period where tick did not stop.
(sorry about the long lines). My comments are after [JP]
cat-1993 [000] 30.093405: sched_cfs_idle_inject: action:0 throttled:1
[JP] injection timer expired, set forced idle flag, call scheduler
cat-1993 [000] 30.093406: hrtimer_expire_exit: hrtimer=0xffff88003de0cc20
cat-1993 [000] 30.093406: hrtimer_start: hrtimer=0xffff88003de0cc20 function=idle_inject_timer_fn/0x0 expires=29993055250 softexpires=29993055250
cat-1993 [000] 30.093407: hrtimer_cancel: hrtimer=0xffff88003dfce400
cat-1993 [000] 30.093407: hrtimer_expire_entry: hrtimer=0xffff88003dfce400 now=29988042960 function=tick_sched_timer/0x0
cat-1993 [000] 30.093407: function: tick_sched_timer
cat-1993 [000] 30.093422: function: tick_sched_do_timer
cat-1993 [000] 30.093422: function: tick_do_update_jiffies64
cat-1993 [000] 30.093433: function: tick_sched_handle.isra.15
cat-1993 [000] 30.093447: sched_stat_runtime: comm=cat pid=1993 runtime=1058498 [ns] vruntime=6695549826 [ns]
cat-1993 [000] 30.093449: hrtimer_expire_exit: hrtimer=0xffff88003dfce400
cat-1993 [000] 30.093449: hrtimer_start: hrtimer=0xffff88003dfce400 function=tick_sched_timer/0x0 expires=29989000000 softexpires=29989000000
cat-1993 [000] 30.093450: function: tick_program_event
cat-1993 [000] 30.093460: sched_waking: comm=rcu_preempt pid=7 prio=120 target_cpu=002
cat-1993 [000] 30.093461: sched_wake_idle_without_ipi: cpu=2
cat-1993 [000] 30.093463: sched_cfs_idle_inject: action:1 throttled:1
[JP] CFS pick_next_task_fair sees forced idle, pick no task to run.
cat-1993 [000] 30.093463: sched_stat_runtime: comm=cat pid=1993 runtime=16122 [ns] vruntime=6695565948 [ns]
cat-1993 [000] 30.093464: sched_switch: cat:1993 [120] R ==> swapper/0:0 [120]
<idle>-0 [000] 30.093465: function: tick_nohz_idle_enter
<idle>-0 [000] 30.093473: bprint: __tick_nohz_idle_enter: JPAN: __tick_nohz_idle_enter 803
<idle>-0 [000] 30.093473: bprint: __tick_nohz_idle_enter: JPAN: can_stop_idle_tick 743
[JP] can_stop_idle_tick() checks ok to stop tick
<idle>-0 [000] 30.093474: bprint: __tick_nohz_idle_enter: JPAN: tick_nohz_stop_sched_tick 609 delta 1000000
[JP] but sees delta is exactly 1 tick away. didn't stop tick.
<idle>-0 [000] 30.093475: function: tick_check_broadcast_expired
<idle>-0 [000] 30.094366: function: tick_irq_enter
<idle>-0 [000] 30.094367: function: tick_check_oneshot_broadcast_this_cpu
<idle>-0 [000] 30.094372: function: tick_nohz_stop_idle
<idle>-0 [000] 30.094387: hrtimer_cancel:
hrtimer=0xffff88003dfce400
[JP] enter repeated tick sched in inner idle loop since !need_resched()
> Thanks,
>
> tglx
>
>
[Jacob Pan]
On Fri, 13 Nov 2015 12:11:01 -0800
John Stultz <[email protected]> wrote:
> Could you add a comment as to what the function does, and use some
> better variable names here to make it more immediately obvious what is
> being done here?
>
> Something like:
> /**
> * ktime_roundup - Rounds value up to interval chunk
> * @ value: Value to be rounded up
> * @ interval: interval size to round up to
> *
> * Rounds a value up to the next higher multiple of an interval size
> */
> static inline ktime ktime_roundup(ktime_t value, ktime_t interval)
will do. thank you for taking the time.
On Fri, 13 Nov 2015 15:13:45 -0500 (EST)
Thomas Gleixner <[email protected]> wrote:
> >
> > +static inline ktime_t ktime_roundup(ktime_t x, ktime_t y)
>
> Kerneldoc comment of this function would be appreciated.
will do. Plan to reuse John's comment.
Thanks,
Jacob
On Fri, 13 Nov 2015, Jacob Pan wrote:
> <idle>-0 [000] 30.093473: bprint: __tick_nohz_idle_enter: JPAN: __tick_nohz_idle_enter 803
> <idle>-0 [000] 30.093473: bprint: __tick_nohz_idle_enter: JPAN: can_stop_idle_tick 743
> [JP] can_stop_idle_tick() checks ok to stop tick
>
> <idle>-0 [000] 30.093474: bprint: __tick_nohz_idle_enter: JPAN: tick_nohz_stop_sched_tick 609 delta 1000000
> [JP] but sees delta is exactly 1 tick away. didn't stop tick.
If the delta is 1 tick then it is not supposed to stop it. Did you
ever try to figure out WHY it is 1 tick?
There are two code pathes which can set it to basemono + TICK_NSEC:
if (rcu_needs_cpu(basemono, &next_rcu) ||
arch_needs_cpu() || irq_work_needs_cpu()) {
next_tick = basemono + TICK_NSEC;
} else {
next_tmr = get_next_timer_interrupt(basejiff, basemono);
ts->next_timer = next_tmr;
/* Take the next rcu event into account */
next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
}
Can you please figure out WHY the tick is requested to continue
instead of blindly wreckaging the logic in that code?
Thanks,
tglx
On Mon, 16 Nov 2015 16:06:57 +0100 (CET)
Thomas Gleixner <[email protected]> wrote:
> > <idle>-0 [000] 30.093474: bprint:
> > __tick_nohz_idle_enter: JPAN: tick_nohz_stop_sched_tick 609 delta
> > 1000000 [JP] but sees delta is exactly 1 tick away. didn't stop
> > tick.
>
> If the delta is 1 tick then it is not supposed to stop it. Did you
> ever try to figure out WHY it is 1 tick?
>
> There are two code pathes which can set it to basemono + TICK_NSEC:
>
> if (rcu_needs_cpu(basemono, &next_rcu) ||
> arch_needs_cpu() || irq_work_needs_cpu()) {
> next_tick = basemono + TICK_NSEC;
> } else {
> next_tmr = get_next_timer_interrupt(basejiff,
> basemono); ts->next_timer = next_tmr;
> /* Take the next rcu event into account */
> next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> }
>
> Can you please figure out WHY the tick is requested to continue
> instead of blindly wreckaging the logic in that code?
Looks like the it hits in both cases during forced idle.
+ Josh
+ Paul
For the first case, it is always related to RCU. I found there are two
CONFIG options to avoid this undesired tick in idle loop.
1. enable CONFIG_RCU_NOCB_CPU_ALL, offload to orcu kthreads
2. or enable CONFIG_RCU_FAST_NO_HZ (enter dytick idle w/ rcu callback)
Either one works but my concern is that users may not realize the
intricate CONFIG_ options and how they translate into energy savings.
Consulted with Josh, it seems we could add a check here to recognize
the forced idle state and relax rcu_needs_cpu() to return false even it
has callbacks. Since we are blocking everybody for a short time (5 ticks
default). It should not impact synchronize and kfree rcu.
For the second case, which is much more rare, I think we do have next
timer exactly one tick away. Just don't know why tick will continue into
idle loop.
On Mon, 16 Nov 2015, Jacob Pan wrote:
> For the second case, which is much more rare, I think we do have next
> timer exactly one tick away. Just don't know why tick will continue into
> idle loop.
Well, it should not be hard to figure that out. There are not so many
checks involved when tick_nohz_irq_exit() is called.
Thanks,
tglx
On Mon, Nov 16, 2015 at 01:51:26PM -0800, Jacob Pan wrote:
> On Mon, 16 Nov 2015 16:06:57 +0100 (CET)
> Thomas Gleixner <[email protected]> wrote:
>
> > > <idle>-0 [000] 30.093474: bprint:
> > > __tick_nohz_idle_enter: JPAN: tick_nohz_stop_sched_tick 609 delta
> > > 1000000 [JP] but sees delta is exactly 1 tick away. didn't stop
> > > tick.
> >
> > If the delta is 1 tick then it is not supposed to stop it. Did you
> > ever try to figure out WHY it is 1 tick?
> >
> > There are two code pathes which can set it to basemono + TICK_NSEC:
> >
> > if (rcu_needs_cpu(basemono, &next_rcu) ||
> > arch_needs_cpu() || irq_work_needs_cpu()) {
> > next_tick = basemono + TICK_NSEC;
> > } else {
> > next_tmr = get_next_timer_interrupt(basejiff,
> > basemono); ts->next_timer = next_tmr;
> > /* Take the next rcu event into account */
> > next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> > }
> >
> > Can you please figure out WHY the tick is requested to continue
> > instead of blindly wreckaging the logic in that code?
>
> Looks like the it hits in both cases during forced idle.
> + Josh
> + Paul
>
> For the first case, it is always related to RCU. I found there are two
> CONFIG options to avoid this undesired tick in idle loop.
> 1. enable CONFIG_RCU_NOCB_CPU_ALL, offload to orcu kthreads
> 2. or enable CONFIG_RCU_FAST_NO_HZ (enter dytick idle w/ rcu callback)
>
> Either one works but my concern is that users may not realize the
> intricate CONFIG_ options and how they translate into energy savings.
> Consulted with Josh, it seems we could add a check here to recognize
> the forced idle state and relax rcu_needs_cpu() to return false even it
> has callbacks. Since we are blocking everybody for a short time (5 ticks
> default). It should not impact synchronize and kfree rcu.
Or we could just set things up so that whatever Kconfig you are using
to enable this state causes CONFIG_RCU_NOCB_CPU_ALL to also be enabled.
Or that causes CONFIG_RCU_FAST_NO_HZ to also be enabled, if that works
better for you.
Just out of curiosity, what is the purpose of the forced idle state?
Thermal control or some such?
Thanx, Paul
> For the second case, which is much more rare, I think we do have next
> timer exactly one tick away. Just don't know why tick will continue into
> idle loop.
>
On Mon, Nov 16, 2015 at 01:51:26PM -0800, Jacob Pan wrote:
> On Mon, 16 Nov 2015 16:06:57 +0100 (CET)
> Thomas Gleixner <[email protected]> wrote:
>
> > > <idle>-0 [000] 30.093474: bprint:
> > > __tick_nohz_idle_enter: JPAN: tick_nohz_stop_sched_tick 609 delta
> > > 1000000 [JP] but sees delta is exactly 1 tick away. didn't stop
> > > tick.
> >
> > If the delta is 1 tick then it is not supposed to stop it. Did you
> > ever try to figure out WHY it is 1 tick?
> >
> > There are two code pathes which can set it to basemono + TICK_NSEC:
> >
> > if (rcu_needs_cpu(basemono, &next_rcu) ||
> > arch_needs_cpu() || irq_work_needs_cpu()) {
> > next_tick = basemono + TICK_NSEC;
> > } else {
> > next_tmr = get_next_timer_interrupt(basejiff,
> > basemono); ts->next_timer = next_tmr;
> > /* Take the next rcu event into account */
> > next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> > }
> >
> > Can you please figure out WHY the tick is requested to continue
> > instead of blindly wreckaging the logic in that code?
>
> Looks like the it hits in both cases during forced idle.
> + Josh
> + Paul
>
> For the first case, it is always related to RCU. I found there are two
> CONFIG options to avoid this undesired tick in idle loop.
> 1. enable CONFIG_RCU_NOCB_CPU_ALL, offload to orcu kthreads
> 2. or enable CONFIG_RCU_FAST_NO_HZ (enter dytick idle w/ rcu callback)
>
> Either one works but my concern is that users may not realize the
> intricate CONFIG_ options and how they translate into energy savings.
> Consulted with Josh, it seems we could add a check here to recognize
> the forced idle state and relax rcu_needs_cpu() to return false even it
> has callbacks. Since we are blocking everybody for a short time (5 ticks
> default). It should not impact synchronize and kfree rcu.
Right; as long as you're blocking *everybody*, and RCU priority boosting
doesn't come into play (meaning a real-time task is waiting on RCU
callbacks), then I don't see any harm in blocking RCU callbacks for a
while. You'd block completion of synchronize_rcu() and similar, as well
as memory reclamation, but since you've blocked *every* CPU systemwide
then that doesn't cause a problem.
On Mon, 16 Nov 2015 14:31:17 -0800
"Paul E. McKenney" <[email protected]> wrote:
> Just out of curiosity, what is the purpose of the forced idle state?
> Thermal control or some such?
yes. for thermal control or power capping. I have some description in
cover letter.
https://lwn.net/Articles/664405/
Thanks,
Jacob
On Mon, 16 Nov 2015 14:31:17 -0800
"Paul E. McKenney" <[email protected]> wrote:
> > Either one works but my concern is that users may not realize the
> > intricate CONFIG_ options and how they translate into energy
> > savings. Consulted with Josh, it seems we could add a check here to
> > recognize the forced idle state and relax rcu_needs_cpu() to return
> > false even it has callbacks. Since we are blocking everybody for a
> > short time (5 ticks default). It should not impact synchronize and
> > kfree rcu.
>
> Or we could just set things up so that whatever Kconfig you are using
> to enable this state causes CONFIG_RCU_NOCB_CPU_ALL to also be
> enabled. Or that causes CONFIG_RCU_FAST_NO_HZ to also be enabled, if
> that works better for you.
That would be great, we can work this out once the patch is
finalized. This is not a hard dependency in that it only affects the
efficiency of idle injection.
On Mon, Nov 16, 2015 at 02:32:11PM -0800, Josh Triplett wrote:
> On Mon, Nov 16, 2015 at 01:51:26PM -0800, Jacob Pan wrote:
> > On Mon, 16 Nov 2015 16:06:57 +0100 (CET)
> > Thomas Gleixner <[email protected]> wrote:
> >
> > > > <idle>-0 [000] 30.093474: bprint:
> > > > __tick_nohz_idle_enter: JPAN: tick_nohz_stop_sched_tick 609 delta
> > > > 1000000 [JP] but sees delta is exactly 1 tick away. didn't stop
> > > > tick.
> > >
> > > If the delta is 1 tick then it is not supposed to stop it. Did you
> > > ever try to figure out WHY it is 1 tick?
> > >
> > > There are two code pathes which can set it to basemono + TICK_NSEC:
> > >
> > > if (rcu_needs_cpu(basemono, &next_rcu) ||
> > > arch_needs_cpu() || irq_work_needs_cpu()) {
> > > next_tick = basemono + TICK_NSEC;
> > > } else {
> > > next_tmr = get_next_timer_interrupt(basejiff,
> > > basemono); ts->next_timer = next_tmr;
> > > /* Take the next rcu event into account */
> > > next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> > > }
> > >
> > > Can you please figure out WHY the tick is requested to continue
> > > instead of blindly wreckaging the logic in that code?
> >
> > Looks like the it hits in both cases during forced idle.
> > + Josh
> > + Paul
> >
> > For the first case, it is always related to RCU. I found there are two
> > CONFIG options to avoid this undesired tick in idle loop.
> > 1. enable CONFIG_RCU_NOCB_CPU_ALL, offload to orcu kthreads
> > 2. or enable CONFIG_RCU_FAST_NO_HZ (enter dytick idle w/ rcu callback)
> >
> > Either one works but my concern is that users may not realize the
> > intricate CONFIG_ options and how they translate into energy savings.
> > Consulted with Josh, it seems we could add a check here to recognize
> > the forced idle state and relax rcu_needs_cpu() to return false even it
> > has callbacks. Since we are blocking everybody for a short time (5 ticks
> > default). It should not impact synchronize and kfree rcu.
>
> Right; as long as you're blocking *everybody*, and RCU priority boosting
> doesn't come into play (meaning a real-time task is waiting on RCU
> callbacks), then I don't see any harm in blocking RCU callbacks for a
> while. You'd block completion of synchronize_rcu() and similar, as well
> as memory reclamation, but since you've blocked *every* CPU systemwide
> then that doesn't cause a problem.
True enough. But how does RCU distinguish between this being a
normal idle cycle that might last indefinitely on the one hand and the
five-jiffy system-wide throttling on the other? OK, maybe there is a
global variable that says that the just-now-starting idle period is
system-wide throttling. But then what about the CPU that just went
idle 10 microseconds ago, and therefore left its timer tick running?
Fine and well, we could IPI it to wake it up and let it see that we
are now doing thermal throttling. But then we presumably also have to
IPI it at the end of the thermal-throttling interval in order for it to
re-evaluate whether or not it should have the tick going. :-/
On the one hand, I am sure that all of this can be made to work,
but simply having systems using thermal throttling enable either
CONFIG_RCU_NOCB_CPU_ALL or CONFIG_RCU_FAST_NO_HZ seems -way- simpler.
CONFIG_RCU_FAST_NO_HZ is probably the better choice for generic workloads,
but CONFIG_RCU_NOCB_CPU_ALL is the better choice for embedded workloads
where it is less likely that RCU callbacks will be posted with continuous
wild abandon.
Or am I missing something subtle here?
Thanx, Paul
On Mon, Nov 16, 2015 at 03:15:03PM -0800, Jacob Pan wrote:
> On Mon, 16 Nov 2015 14:31:17 -0800
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > Either one works but my concern is that users may not realize the
> > > intricate CONFIG_ options and how they translate into energy
> > > savings. Consulted with Josh, it seems we could add a check here to
> > > recognize the forced idle state and relax rcu_needs_cpu() to return
> > > false even it has callbacks. Since we are blocking everybody for a
> > > short time (5 ticks default). It should not impact synchronize and
> > > kfree rcu.
> >
> > Or we could just set things up so that whatever Kconfig you are using
> > to enable this state causes CONFIG_RCU_NOCB_CPU_ALL to also be
> > enabled. Or that causes CONFIG_RCU_FAST_NO_HZ to also be enabled, if
> > that works better for you.
>
> That would be great, we can work this out once the patch is
> finalized. This is not a hard dependency in that it only affects the
> efficiency of idle injection.
Is this mostly an special-purpose embedded thing, or do you expect distros
to be enabling this? If the former, I suggest CONFIG_RCU_NOCB_CPU_ALL,
but if distros are doing this for general-purpose workloads, I instead
suggest CONFIG_RCU_FAST_NO_HZ.
But as you say, we can work this out later. Figured I should ask now,
though, just to get people thinking about it.
Thanx, Paul
On 11/16/2015 3:28 PM, Paul E. McKenney wrote:
> Is this mostly an special-purpose embedded thing, or do you expect distros
> to be enabling this? If the former, I suggest CONFIG_RCU_NOCB_CPU_ALL,
> but if distros are doing this for general-purpose workloads, I instead
> suggest CONFIG_RCU_FAST_NO_HZ.
thermal overload happens a lot on small devices, but sadly also in big datacenters
where it is not uncommon to underprovision cooling capacity by a bit
(it's one of those "99% of the time you only need THIS much, the 1% you need 30% more"
and that more is expensive or even impractical)
On Mon, 16 Nov 2015 15:32:38 -0800
Arjan van de Ven <[email protected]> wrote:
> On 11/16/2015 3:28 PM, Paul E. McKenney wrote:
>
> > Is this mostly an special-purpose embedded thing, or do you expect
> > distros to be enabling this? If the former, I suggest
> > , but if distros are doing this for
> > general-purpose workloads, I instead suggest CONFIG_RCU_FAST_NO_HZ.
>
> thermal overload happens a lot on small devices, but sadly also in
> big datacenters where it is not uncommon to underprovision cooling
> capacity by a bit (it's one of those "99% of the time you only need
> THIS much, the 1% you need 30% more" and that more is expensive or
> even impractical)
CONFIG_RCU_FAST_NO_HZ is more inline with idle injection in that both
are targeted energy efficiency.
Checking ubuntu and fedora, seems both have CONFIG_RCU_NOCB_CPU_ALL=y.
On Mon, Nov 16, 2015 at 03:40:46PM -0800, Jacob Pan wrote:
> On Mon, 16 Nov 2015 15:32:38 -0800
> Arjan van de Ven <[email protected]> wrote:
>
> > On 11/16/2015 3:28 PM, Paul E. McKenney wrote:
> >
> > > Is this mostly an special-purpose embedded thing, or do you expect
> > > distros to be enabling this? If the former, I suggest
> > > , but if distros are doing this for
> > > general-purpose workloads, I instead suggest CONFIG_RCU_FAST_NO_HZ.
> >
> > thermal overload happens a lot on small devices, but sadly also in
> > big datacenters where it is not uncommon to underprovision cooling
> > capacity by a bit (it's one of those "99% of the time you only need
> > THIS much, the 1% you need 30% more" and that more is expensive or
> > even impractical)
Then CONFIG_RCU_FAST_NO_HZ it is.
> CONFIG_RCU_FAST_NO_HZ is more inline with idle injection in that both
> are targeted energy efficiency.
> Checking ubuntu and fedora, seems both have CONFIG_RCU_NOCB_CPU_ALL=y.
Ah, that is right -- they both do CONFIG_NO_HZ_FULL=y, which does imply
CONFIG_RCU_NOCB_CPU_ALL=y. But SUSE does not, last I knew.
But it should be easy to set up Kconfig for this. Make your new Kconfig
option select CONFIG_RCU_FAST_NO_HZ unless CONFIG_RCU_NOCB_CPU_ALL is
already set. Alternatively, CONFIG_RCU_FAST_NO_HZ could be set up
something like the following:
depends on (NO_HZ_COMMON && SMP && RCU_EXPERT) || CONFIG_THERM_THROT
default CONFIG_THERM_THROT
But the "select RCU_FAST_NO_HZ if !RCU_NOCB_CPU_ALL" is probably cleaner.
Anyway, again, the details can be settled later.
Thanx, Paul
On Mon, 16 Nov 2015 23:01:12 +0100 (CET)
Thomas Gleixner <[email protected]> wrote:
> > For the second case, which is much more rare, I think we do have
> > next timer exactly one tick away. Just don't know why tick will
> > continue into idle loop.
>
> Well, it should not be hard to figure that out. There are not so many
> checks involved when tick_nohz_irq_exit() is called.
Thanks for the tip, I found the cause is in
int idle_cpu(int cpu)
{
if (rq->nr_running)
return 0;
Since we only take into account of cfs_rq runnable taking over
cfs_rq->nr_running when forced_idle is set.
Jacob
On Mon, Nov 16, 2015 at 03:26:40PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 16, 2015 at 02:32:11PM -0800, Josh Triplett wrote:
> > On Mon, Nov 16, 2015 at 01:51:26PM -0800, Jacob Pan wrote:
> > > On Mon, 16 Nov 2015 16:06:57 +0100 (CET)
> > > Thomas Gleixner <[email protected]> wrote:
> > >
> > > > > <idle>-0 [000] 30.093474: bprint:
> > > > > __tick_nohz_idle_enter: JPAN: tick_nohz_stop_sched_tick 609 delta
> > > > > 1000000 [JP] but sees delta is exactly 1 tick away. didn't stop
> > > > > tick.
> > > >
> > > > If the delta is 1 tick then it is not supposed to stop it. Did you
> > > > ever try to figure out WHY it is 1 tick?
> > > >
> > > > There are two code pathes which can set it to basemono + TICK_NSEC:
> > > >
> > > > if (rcu_needs_cpu(basemono, &next_rcu) ||
> > > > arch_needs_cpu() || irq_work_needs_cpu()) {
> > > > next_tick = basemono + TICK_NSEC;
> > > > } else {
> > > > next_tmr = get_next_timer_interrupt(basejiff,
> > > > basemono); ts->next_timer = next_tmr;
> > > > /* Take the next rcu event into account */
> > > > next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> > > > }
> > > >
> > > > Can you please figure out WHY the tick is requested to continue
> > > > instead of blindly wreckaging the logic in that code?
> > >
> > > Looks like the it hits in both cases during forced idle.
> > > + Josh
> > > + Paul
> > >
> > > For the first case, it is always related to RCU. I found there are two
> > > CONFIG options to avoid this undesired tick in idle loop.
> > > 1. enable CONFIG_RCU_NOCB_CPU_ALL, offload to orcu kthreads
> > > 2. or enable CONFIG_RCU_FAST_NO_HZ (enter dytick idle w/ rcu callback)
> > >
> > > Either one works but my concern is that users may not realize the
> > > intricate CONFIG_ options and how they translate into energy savings.
> > > Consulted with Josh, it seems we could add a check here to recognize
> > > the forced idle state and relax rcu_needs_cpu() to return false even it
> > > has callbacks. Since we are blocking everybody for a short time (5 ticks
> > > default). It should not impact synchronize and kfree rcu.
> >
> > Right; as long as you're blocking *everybody*, and RCU priority boosting
> > doesn't come into play (meaning a real-time task is waiting on RCU
> > callbacks), then I don't see any harm in blocking RCU callbacks for a
> > while. You'd block completion of synchronize_rcu() and similar, as well
> > as memory reclamation, but since you've blocked *every* CPU systemwide
> > then that doesn't cause a problem.
>
> True enough. But how does RCU distinguish between this being a
> normal idle cycle that might last indefinitely on the one hand and the
> five-jiffy system-wide throttling on the other? OK, maybe there is a
> global variable that says that the just-now-starting idle period is
> system-wide throttling. But then what about the CPU that just went
> idle 10 microseconds ago, and therefore left its timer tick running?
> Fine and well, we could IPI it to wake it up and let it see that we
> are now doing thermal throttling. But then we presumably also have to
> IPI it at the end of the thermal-throttling interval in order for it to
> re-evaluate whether or not it should have the tick going. :-/
>
> On the one hand, I am sure that all of this can be made to work,
> but simply having systems using thermal throttling enable either
> CONFIG_RCU_NOCB_CPU_ALL or CONFIG_RCU_FAST_NO_HZ seems -way- simpler.
> CONFIG_RCU_FAST_NO_HZ is probably the better choice for generic workloads,
> but CONFIG_RCU_NOCB_CPU_ALL is the better choice for embedded workloads
> where it is less likely that RCU callbacks will be posted with continuous
> wild abandon.
>
> Or am I missing something subtle here?
I agree that it seems preferable to make this require an existing RCU
solution rather than adding more complexity to the RCU idle path. One
possible thing that may affect the choice of solution: this needs to
idle *every* CPU, without leaving any CPU awake to handle callbacks or
similar.
- Josh Triplett
On Mon, Nov 16, 2015 at 05:41:03PM -0800, Josh Triplett wrote:
> On Mon, Nov 16, 2015 at 03:26:40PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 16, 2015 at 02:32:11PM -0800, Josh Triplett wrote:
> > > On Mon, Nov 16, 2015 at 01:51:26PM -0800, Jacob Pan wrote:
> > > > On Mon, 16 Nov 2015 16:06:57 +0100 (CET)
> > > > Thomas Gleixner <[email protected]> wrote:
> > > >
> > > > > > <idle>-0 [000] 30.093474: bprint:
> > > > > > __tick_nohz_idle_enter: JPAN: tick_nohz_stop_sched_tick 609 delta
> > > > > > 1000000 [JP] but sees delta is exactly 1 tick away. didn't stop
> > > > > > tick.
> > > > >
> > > > > If the delta is 1 tick then it is not supposed to stop it. Did you
> > > > > ever try to figure out WHY it is 1 tick?
> > > > >
> > > > > There are two code pathes which can set it to basemono + TICK_NSEC:
> > > > >
> > > > > if (rcu_needs_cpu(basemono, &next_rcu) ||
> > > > > arch_needs_cpu() || irq_work_needs_cpu()) {
> > > > > next_tick = basemono + TICK_NSEC;
> > > > > } else {
> > > > > next_tmr = get_next_timer_interrupt(basejiff,
> > > > > basemono); ts->next_timer = next_tmr;
> > > > > /* Take the next rcu event into account */
> > > > > next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> > > > > }
> > > > >
> > > > > Can you please figure out WHY the tick is requested to continue
> > > > > instead of blindly wreckaging the logic in that code?
> > > >
> > > > Looks like the it hits in both cases during forced idle.
> > > > + Josh
> > > > + Paul
> > > >
> > > > For the first case, it is always related to RCU. I found there are two
> > > > CONFIG options to avoid this undesired tick in idle loop.
> > > > 1. enable CONFIG_RCU_NOCB_CPU_ALL, offload to orcu kthreads
> > > > 2. or enable CONFIG_RCU_FAST_NO_HZ (enter dytick idle w/ rcu callback)
> > > >
> > > > Either one works but my concern is that users may not realize the
> > > > intricate CONFIG_ options and how they translate into energy savings.
> > > > Consulted with Josh, it seems we could add a check here to recognize
> > > > the forced idle state and relax rcu_needs_cpu() to return false even it
> > > > has callbacks. Since we are blocking everybody for a short time (5 ticks
> > > > default). It should not impact synchronize and kfree rcu.
> > >
> > > Right; as long as you're blocking *everybody*, and RCU priority boosting
> > > doesn't come into play (meaning a real-time task is waiting on RCU
> > > callbacks), then I don't see any harm in blocking RCU callbacks for a
> > > while. You'd block completion of synchronize_rcu() and similar, as well
> > > as memory reclamation, but since you've blocked *every* CPU systemwide
> > > then that doesn't cause a problem.
> >
> > True enough. But how does RCU distinguish between this being a
> > normal idle cycle that might last indefinitely on the one hand and the
> > five-jiffy system-wide throttling on the other? OK, maybe there is a
> > global variable that says that the just-now-starting idle period is
> > system-wide throttling. But then what about the CPU that just went
> > idle 10 microseconds ago, and therefore left its timer tick running?
> > Fine and well, we could IPI it to wake it up and let it see that we
> > are now doing thermal throttling. But then we presumably also have to
> > IPI it at the end of the thermal-throttling interval in order for it to
> > re-evaluate whether or not it should have the tick going. :-/
> >
> > On the one hand, I am sure that all of this can be made to work,
> > but simply having systems using thermal throttling enable either
> > CONFIG_RCU_NOCB_CPU_ALL or CONFIG_RCU_FAST_NO_HZ seems -way- simpler.
> > CONFIG_RCU_FAST_NO_HZ is probably the better choice for generic workloads,
> > but CONFIG_RCU_NOCB_CPU_ALL is the better choice for embedded workloads
> > where it is less likely that RCU callbacks will be posted with continuous
> > wild abandon.
> >
> > Or am I missing something subtle here?
>
> I agree that it seems preferable to make this require an existing RCU
> solution rather than adding more complexity to the RCU idle path. One
> possible thing that may affect the choice of solution: this needs to
> idle *every* CPU, without leaving any CPU awake to handle callbacks or
> similar.
Fair point. When in the five-jiffy throttling state, what can wake up
a CPU? In an earlier version of this proposal, the answer was "nothing",
but maybe that has changed.
Thanx, Paul
On 11/16/2015 6:53 PM, Paul E. McKenney wrote:
> Fair point. When in the five-jiffy throttling state, what can wake up
> a CPU? In an earlier version of this proposal, the answer was "nothing",
> but maybe that has changed.
device interrupts are likely to wake the cpus.
On Mon, Nov 16, 2015 at 06:57:14PM -0800, Arjan van de Ven wrote:
> On 11/16/2015 6:53 PM, Paul E. McKenney wrote:
> >Fair point. When in the five-jiffy throttling state, what can wake up
> >a CPU? In an earlier version of this proposal, the answer was "nothing",
> >but maybe that has changed.
>
> device interrupts are likely to wake the cpus.
OK, that I cannot help you with. But presumably if the interrupt handler
does a wakeup (or similar), that is deferred to the end of the throttling
interval? Timers are also deferred, including hrtimers?
Thanx, Paul
On Mon, Nov 16, 2015 at 09:04:03PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 16, 2015 at 06:57:14PM -0800, Arjan van de Ven wrote:
> > On 11/16/2015 6:53 PM, Paul E. McKenney wrote:
> > >Fair point. When in the five-jiffy throttling state, what can wake up
> > >a CPU? In an earlier version of this proposal, the answer was "nothing",
> > >but maybe that has changed.
> >
> > device interrupts are likely to wake the cpus.
>
> OK, that I cannot help you with. But presumably if the interrupt handler
> does a wakeup (or similar), that is deferred to the end of the throttling
> interval? Timers are also deferred, including hrtimers?
This throttling thing only throttles 'normal' tasks, real-time tasks
will still run.
On Tue, 17 Nov 2015 11:24:49 +0100
Peter Zijlstra <[email protected]> wrote:
> On Mon, Nov 16, 2015 at 09:04:03PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 16, 2015 at 06:57:14PM -0800, Arjan van de Ven wrote:
> > > On 11/16/2015 6:53 PM, Paul E. McKenney wrote:
> > > >Fair point. When in the five-jiffy throttling state, what can
> > > >wake up a CPU? In an earlier version of this proposal, the
> > > >answer was "nothing", but maybe that has changed.
> > >
> > > device interrupts are likely to wake the cpus.
> >
> > OK, that I cannot help you with. But presumably if the interrupt
> > handler does a wakeup (or similar), that is deferred to the end of
> > the throttling interval? Timers are also deferred, including
> > hrtimers?
>
> This throttling thing only throttles 'normal' tasks, real-time tasks
> will still run.
As an optimization or option, it might be useful to further defer the
next timer interrupt if it falls within the idle injection period. But
I guess we don't know if that timer belongs to a normal task or rt.
Also we there could be more than one 'next' timer interrupts fall into
that injection idle period.
On Tue, Nov 17, 2015 at 04:57:21AM -0800, Jacob Pan wrote:
> On Tue, 17 Nov 2015 11:24:49 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Mon, Nov 16, 2015 at 09:04:03PM -0800, Paul E. McKenney wrote:
> > > On Mon, Nov 16, 2015 at 06:57:14PM -0800, Arjan van de Ven wrote:
> > > > On 11/16/2015 6:53 PM, Paul E. McKenney wrote:
> > > > >Fair point. When in the five-jiffy throttling state, what can
> > > > >wake up a CPU? In an earlier version of this proposal, the
> > > > >answer was "nothing", but maybe that has changed.
> > > >
> > > > device interrupts are likely to wake the cpus.
> > >
> > > OK, that I cannot help you with. But presumably if the interrupt
> > > handler does a wakeup (or similar), that is deferred to the end of
> > > the throttling interval? Timers are also deferred, including
> > > hrtimers?
> >
> > This throttling thing only throttles 'normal' tasks, real-time tasks
> > will still run.
Heh! Then RCU will be delayed or not based on the priority of the
grace-period kthreads and softirqd. ;-)
In addition, this does sound like an excellent test for priority-inversion
situations that might otherwise go unnoticed on overprovisioned systems.
That said, I would expect many types of real-time systems to configure
voltage, frequency, and cooling so as to avoid thermal throttling.
> As an optimization or option, it might be useful to further defer the
> next timer interrupt if it falls within the idle injection period. But
> I guess we don't know if that timer belongs to a normal task or rt.
> Also we there could be more than one 'next' timer interrupts fall into
> that injection idle period.
Understood. This brings me back to my recommendation that throttling
select RCU_FAST_NO_HZ unless RCU_NOCB_CPU_ALL is already set.
Thanx, Paul
* Jacob Pan <[email protected]> wrote:
> With increasingly constrained power and thermal budget, it's often necessary to
> cap power via throttling. Throttling individual CPUs or devices at random times
> can help power capping but may not be optimal in terms of energy efficiency.
> Frequency scaling is also limited by certain range before losing energy
> efficiency.
>
> In general, the optimal solution in terms of energy efficiency is to align idle
> periods such that more shared circuits can be power gated to enter lower power
> states. Combined with energy efficient frequency point, idle injection provides
> a way to scale power and performance efficiently.
>
> This patch introduces a scheduler based idle injection method, it works by
> blocking CFS runqueue synchronously and periodically. The actions on all online
> CPUs are orchestrated by per CPU hrtimers.
>
> Two sysctl knobs are given to the userspace for selecting the
> percentage of idle time as well as the forced idle duration for each
> idle period injected.
What's the purpose of these knobs? Just testing, or will some user-space daemon
set them dynamically?
I.e. what mechanism will drive the throttling in the typical case?
> Since only CFS class is targeted, other high priority tasks are not affected,
> such as EDF and RT tasks as well as softirq and interrupts.
>
> Hotpath in CFS pick_next_task is optimized by Peter Zijlstra, where a new
> runnable flag is introduced to combine forced idle and nr_running.
> +config CFS_IDLE_INJECT
> + bool "Synchronized CFS idle injection"
> + depends on NO_HZ_IDLE && HIGH_RES_TIMERS
> + default n
> + help
> + This feature let scheduler inject synchronized idle time across all online
> + CPUs. Idle injection affects normal tasks only, yeilds to RT and interrupts.
> + Effecitvely, CPUs can be duty cycled between running at the most power
> + efficient performance state and deep idle states.
So there are 3 typos in this single paragraph alone ...
I also think that naming it 'idle injection' is pretty euphemistic: this is forced
idling, right? So why not name it CFS_FORCED_IDLE?
What will such throttling do to latencies, as observed by user-space tasks? What's
the typical expected frequency of the throttling frequency that you are targeting?
Thanks,
Ingo
On Wed, Nov 18, 2015 at 09:36:22AM +0100, Ingo Molnar wrote:
> What will such throttling do to latencies, as observed by user-space tasks? What's
> the typical expected frequency of the throttling frequency that you are targeting?
The default has 5ms (iirc) of forced idle, so depending on what you do,
noticeable to outright painful.
On Wed, Nov 18, 2015 at 11:35:41AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 18, 2015 at 09:36:22AM +0100, Ingo Molnar wrote:
> > What will such throttling do to latencies, as observed by user-space tasks? What's
> > the typical expected frequency of the throttling frequency that you are targeting?
>
> The default has 5ms (iirc) of forced idle, so depending on what you do,
> noticeable to outright painful.
IIUC, it is 5 ticks, not ms.
Which raises the question, doesn't that mean that we get disturbed four
times on each cpu during the forced idle period? So idle injection only
makes sense if the platform has package states with a target residency
less than a jiffy. Or, do we enter NOHZ idle? I haven't looked closely
enough to figure out yet.
On Wed, Nov 18, 2015 at 12:27:04PM +0000, Morten Rasmussen wrote:
> On Wed, Nov 18, 2015 at 11:35:41AM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 18, 2015 at 09:36:22AM +0100, Ingo Molnar wrote:
> > > What will such throttling do to latencies, as observed by user-space tasks? What's
> > > the typical expected frequency of the throttling frequency that you are targeting?
> >
> > The default has 5ms (iirc) of forced idle, so depending on what you do,
> > noticeable to outright painful.
>
> IIUC, it is 5 ticks, not ms.
The code uses hrtimers (badly), this means there _should_ not be a tick
dependency.
> Which raises the question, doesn't that mean that we get disturbed four
> times on each cpu during the forced idle period? So idle injection only
> makes sense if the platform has package states with a target residency
> less than a jiffy. Or, do we enter NOHZ idle? I haven't looked closely
> enough to figure out yet.
The idea is to hit NOHZ.
On Wed, Nov 18, 2015 at 01:49:46PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 18, 2015 at 12:27:04PM +0000, Morten Rasmussen wrote:
> > On Wed, Nov 18, 2015 at 11:35:41AM +0100, Peter Zijlstra wrote:
> > > On Wed, Nov 18, 2015 at 09:36:22AM +0100, Ingo Molnar wrote:
> > > > What will such throttling do to latencies, as observed by user-space tasks? What's
> > > > the typical expected frequency of the throttling frequency that you are targeting?
> > >
> > > The default has 5ms (iirc) of forced idle, so depending on what you do,
> > > noticeable to outright painful.
> >
> > IIUC, it is 5 ticks, not ms.
>
> The code uses hrtimers (badly), this means there _should_ not be a tick
> dependency.
Then I'm confused :-/
I see the hrtimers, but the actual idle duration appears to be in ticks
rather than ms and then converted later.
+/* Duration of idle time in ticks of each injection period */
+unsigned int sysctl_sched_cfs_idle_inject_duration = 5UL;
...and futher down we have:
+ duration_msec = jiffies_to_msecs(sysctl_sched_cfs_idle_inject_duration);
I will go back and look harder.
>
> > Which raises the question, doesn't that mean that we get disturbed four
> > times on each cpu during the forced idle period? So idle injection only
> > makes sense if the platform has package states with a target residency
> > less than a jiffy. Or, do we enter NOHZ idle? I haven't looked closely
> > enough to figure out yet.
>
> The idea is to hit NOHZ.
Nice!
On Wed, 18 Nov 2015 09:36:22 +0100
Ingo Molnar <[email protected]> wrote:
> >
> > This patch introduces a scheduler based idle injection method, it
> > works by blocking CFS runqueue synchronously and periodically. The
> > actions on all online CPUs are orchestrated by per CPU hrtimers.
> >
> > Two sysctl knobs are given to the userspace for selecting the
> > percentage of idle time as well as the forced idle duration for each
> > idle period injected.
>
> What's the purpose of these knobs? Just testing, or will some
> user-space daemon set them dynamically?
>
yes, it is to be used by userspace daemon such as thermal daemon.
Though there are interests from in kernel thermal governor but that is
another story.
> I also think that naming it 'idle injection' is pretty euphemistic:
> this is forced idling, right? So why not name it CFS_FORCED_IDLE?
yes, it is forced idle. sounds good.
On 11/18/2015 12:36 AM, Ingo Molnar wrote:
>
> What will such throttling do to latencies, as observed by user-space tasks? What's
> the typical expected frequency of the throttling frequency that you are targeting?
for this to meaningfully reduce power consumption, deep system power states need to be reached,
and for those to pay of, several milliseconds of idle are required.
for hard realtime stuff that is obviously insane, but I would assume that for those cases
your system is thermally fine. This only kicks in at the end of a "we're in thermal
problems" path, which can happen both on clients (thin devices) as well as servers
(airconditioning issues). The objective for this is to kick in before the
hardware built-in protections kick in (which are power off/reboot depending on a bunch of things).
The frequency of how often these 5 msec get injected depend on how deep the system
is in trouble; and is zero if the system is not in trouble.
The idea is that for the user it is better to inject several 5 msec intervals than it is
to inject one longer period.
You can compare this method to other ways of reducing thermal issues (like lowering cpu frequency),
and in a typical setup, this is done after the benign of those methods are exhausted.
Lowering frequency even lower is usually of a low efficiency (you need to lower the frequency a LOT
to gain a little bit of power in the bottom parts of the frequency ranges), while this idle will
not only put the CPU in low power, but will also put the system memory in low power and usually
a big chunk of the rest of the SOC. In many client systems, memory power consumption is higher than CPU
power consumption (and in big servers, it's also quite sizable), so there is a pretty hard
limit of how much you can do on thermals if you're not also kicking some of the memory power savings.
This means that to achieve a certain amount of reduction, the performance is impacted a lot less than
the more drastic methods you would need on the cpu side, if possible at all.
(stepping your 2 Ghz cpu down to 50Mhz may sound less evil than injecting 5msec of idle time, but in reality
that is impacting user tasks a heck of a lot more than 5msec of not being scheduled)
On Wed, 18 Nov 2015 14:04:41 +0000
Morten Rasmussen <[email protected]> wrote:
> Then I'm confused :-/
>
> I see the hrtimers, but the actual idle duration appears to be in
> ticks rather than ms and then converted later.
>
> +/* Duration of idle time in ticks of each injection period
> */ +unsigned int sysctl_sched_cfs_idle_inject_duration = 5UL;
>
> ...and futher down we have:
>
> + duration_msec =
> jiffies_to_msecs(sysctl_sched_cfs_idle_inject_duration);
I used hrtimers but still round around tick boundaries of ktime.
It is more reliable than jiffies in terms of getting timely updates.
Jacob
On Wed, Nov 18, 2015 at 06:52:33AM -0800, Jacob Pan wrote:
> On Wed, 18 Nov 2015 14:04:41 +0000
> Morten Rasmussen <[email protected]> wrote:
>
> > Then I'm confused :-/
> >
> > I see the hrtimers, but the actual idle duration appears to be in
> > ticks rather than ms and then converted later.
> >
> > +/* Duration of idle time in ticks of each injection period
> > */ +unsigned int sysctl_sched_cfs_idle_inject_duration = 5UL;
> >
> > ...and futher down we have:
> >
> > + duration_msec =
> > jiffies_to_msecs(sysctl_sched_cfs_idle_inject_duration);
> I used hrtimers but still round around tick boundaries of ktime.
> It is more reliable than jiffies in terms of getting timely updates.
Okay. But it does mean that the defeault idle period is 5 ticks (50ms @
HZ=100) and not 5 ms.
On Wed, 18 Nov 2015 15:09:44 +0000
Morten Rasmussen <[email protected]> wrote:
> Okay. But it does mean that the defeault idle period is 5 ticks (50ms
> @ HZ=100) and not 5 ms
correct. my reason is to scale with various HZ values.
On Wed, 18 Nov 2015, Jacob Pan wrote:
> On Wed, 18 Nov 2015 15:09:44 +0000
> Morten Rasmussen <[email protected]> wrote:
>
> > Okay. But it does mean that the defeault idle period is 5 ticks (50ms
> > @ HZ=100) and not 5 ms
> correct. my reason is to scale with various HZ values.
So for smaller HZ values we get longer disruption. That's well thought
out scaling.
Thanks,
tglx
On Fri, Nov 13, 2015 at 11:53:06AM -0800, Jacob Pan wrote:
> With increasingly constrained power and thermal budget, it's often
> necessary to cap power via throttling. Throttling individual CPUs
> or devices at random times can help power capping but may not be
> optimal in terms of energy efficiency. Frequency scaling is also
> limited by certain range before losing energy efficiency.
>
> In general, the optimal solution in terms of energy efficiency is
> to align idle periods such that more shared circuits can be power
> gated to enter lower power states. Combined with energy efficient
> frequency point, idle injection provides a way to scale power and
> performance efficiently.
>
> This patch introduces a scheduler based idle injection method, it
> works by blocking CFS runqueue synchronously and periodically. The
> actions on all online CPUs are orchestrated by per CPU hrtimers.
I fully agree with the idea of synchronous duty cycling of cpus for
thermal management to avoid those inefficient low frequencies where we
loose a lot of energy to static power consumption and get very little
work done. However, I would not necessarily want to punish all cpus
system-wide if we have local overheating in one corner. If would rather
have it apply to only the overheating socket in a multi-socket machine
and only the big cores in a big.LITTLE system.
Several people have already brought up that we should look into
integrating this with the thermal framework. It already has the concepts
of thermal zones and cooling devices which could be used to control
idle-injection separately for different sockets/clusters.
Also, letting user-space control idle-injection opens up for having two
entities (user-space daemon and the in-kernel thermal governor) stepping
on each others toes. Wouldn't the control of idle-injection naturally
belong to thermal governors so we have everything controlled from a
single place?
On 11/18/2015 7:44 AM, Morten Rasmussen wrote:
> I would not necessarily want to punish all cpus
> system-wide if we have local overheating in one corner. If would rather
> have it apply to only the overheating socket in a multi-socket machine
> and only the big cores in a big.LITTLE system.
most of the time thermal issues aren't inside the SOC, but on a system level
due to cheap heat spreaders or outright lack of space due to thinness. But
even if you have one part of the die too hot:
For core level idle injection, no need to synchronize that; the reason to synchronize
is generally that when ALL cores are idle, additional power savings kick in
(like memory going to self refresh, fabrics power gating etc); those additional
power savings are what makes this more efficient than just voltage/frequency
scaling at the bottom of that range... not so much the fact that things are just idle.
On Wed, Nov 18, 2015 at 07:11:31AM -0800, Jacob Pan wrote:
> On Wed, 18 Nov 2015 15:09:44 +0000
> Morten Rasmussen <[email protected]> wrote:
>
> > Okay. But it does mean that the defeault idle period is 5 ticks (50ms
> > @ HZ=100) and not 5 ms
> correct. my reason is to scale with various HZ values.
50ms seems quite long for embedded applications. OTOH, it has to be long
enough to enter a package idle state and stay there for a while. You
would surely notice it when it kicks in on an Android device.
Do you intend to use it only as a last option? One might want to keep
using inefficient frequencies (P-states) if those don't cause UI lag
then.
On Wed, 18 Nov 2015 16:21:27 +0100 (CET)
Thomas Gleixner <[email protected]> wrote:
> >
> > > Okay. But it does mean that the defeault idle period is 5 ticks
> > > (50ms @ HZ=100) and not 5 ms
> > correct. my reason is to scale with various HZ values.
>
> So for smaller HZ values we get longer disruption. That's well thought
> out scaling.
well it might be too long for embedded systems who uses 100HZ. Is there
a better way to scale in sub tick level?
My original thought was for smaller HZ value, I assume they care less
about latency, so the idle injection period is proportional to what
they set for HZ.
[ CCing Steve Rostedt for the tracing bits ]
On Fri, Nov 13, 2015 at 11:53:07AM -0800, Jacob Pan wrote:
> Trace events for idle injection can be used to determine
> timer activities for checking synchronization. In addition they
> also helps to determine when the runqueue is throttled.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> include/linux/sched.h | 5 +++++
> include/trace/events/sched.h | 25 +++++++++++++++++++++++++
> kernel/sched/fair.c | 3 +++
> 3 files changed, 33 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ff551a3..99c79bf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -3189,6 +3189,11 @@ extern int proc_sched_cfs_idle_inject_duration_handler(struct ctl_table *table,
> int write,
> void __user *buffer,
> size_t *length, loff_t *ppos);
> +enum cfs_idle_inject_action {
> + CFS_IDLE_INJECT_TIMER, /* timer sync point */
> + CFS_IDLE_INJECT_FORCED, /* idle forced in rq */
> + CFS_IDLE_INJECT_YIELD_SOFTIRQ /* yield to pending softirq */
> +};
> #endif
>
> #endif
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 539d6bc..52c11c1 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -566,6 +566,31 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
>
> TP_printk("cpu=%d", __entry->cpu)
> );
> +
> +#ifdef CONFIG_CFS_IDLE_INJECT
> +/*
> + * Tracepoint for idle injection
> + */
> +TRACE_EVENT(sched_cfs_idle_inject,
> +
> + TP_PROTO(enum cfs_idle_inject_action action, int throttled),
> +
> + TP_ARGS(action, throttled),
> +
> + TP_STRUCT__entry(
> + __field(enum cfs_idle_inject_action, action)
> + __field(int, throttled)
> + ),
> +
> + TP_fast_assign(
> + __entry->action = action;
> + __entry->throttled = throttled;
> + ),
> +
> + TP_printk("action:%d throttled:%d", __entry->action, __entry->throttled)
> +);
> +#endif
> +
One minor nit: can you use key=value (i.e. "throttled=%d") instead for
consistency with the rest of this file?
Other than that, I know that Peter suggested an enum for the action,
but wouldn't it be better to create an EVENT_CLASS and subclass the
three actions from it? Something like:
DECLARE_EVENT_CLASS(sched_cfs_idle_inject_template,
TP_PROTO(int throttled),
TP_ARGS(throttled),
TP_STRUCT__entry(
__field(int, throttled)
),
TP_fast_assign(
__entry->throttled = throttled;
),
TP_printk("throttled=%d", __entry->throttled)
);
DEFINE_EVENT(sched_cfs_idle_inject_template, sched_cfs_idle_inject_timer,
TP_PROTO(int throttled),
TP_ARGS(throttled));
DEFINE_EVENT(sched_cfs_idle_inject_template, sched_cfs_idle_inject_forced,
TP_PROTO(int throttled),
TP_ARGS(throttled));
DEFINE_EVENT(sched_cfs_idle_inject_template,
sched_cfs_idle_inject_yield_softirq,
TP_PROTO(int throttled),
TP_ARGS(throttled));
> #endif /* _TRACE_SCHED_H */
>
> /* This part must be outside protection */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a0cd777..20027eb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5248,8 +5248,10 @@ idle:
> if (in_forced_idle(cfs_rq)) {
> if (unlikely(local_softirq_pending())) {
> __unthrottle_cfs_rq(cfs_rq);
> + trace_sched_cfs_idle_inject(CFS_IDLE_INJECT_YIELD_SOFTIRQ, 1);
This becomes:
trace_sched_cfs_idle_inject_yield_softirq(1);
> goto again;
> }
> + trace_sched_cfs_idle_inject(CFS_IDLE_INJECT_FORCED, 1);
This becomes:
trace_sched_cfs_idle_inject_forced(1);
> return NULL;
> }
> /*
> @@ -8432,6 +8434,7 @@ static enum hrtimer_restart idle_inject_timer_fn(struct hrtimer *hrtimer)
> throttle_rq(cpu);
> }
> raw_cpu_write(idle_injected, !status);
> + trace_sched_cfs_idle_inject(CFS_IDLE_INJECT_TIMER, !status);
and this would be:
trace_sched_cfs_idle_inject_timer(!status);
>
> return HRTIMER_RESTART;
> }
Cheers,
Javi
On Thu, 19 Nov 2015 14:39:35 +0000
Javi Merino <[email protected]> wrote:
> > +
>
> One minor nit: can you use key=value (i.e. "throttled=%d") instead for
> consistency with the rest of this file?
>
will do.
> Other than that, I know that Peter suggested an enum for the action,
> but wouldn't it be better to create an EVENT_CLASS and subclass the
> three actions from it? Something like:
I agree, it will be more readable in the trace.
thanks you
On Wed, Nov 18, 2015 at 07:51:47AM -0800, Arjan van de Ven wrote:
> On 11/18/2015 7:44 AM, Morten Rasmussen wrote:
> >I would not necessarily want to punish all cpus
> >system-wide if we have local overheating in one corner. If would rather
> >have it apply to only the overheating socket in a multi-socket machine
> >and only the big cores in a big.LITTLE system.
>
> most of the time thermal issues aren't inside the SOC, but on a system level
> due to cheap heat spreaders or outright lack of space due to thinness. But
> even if you have one part of the die too hot:
>
> For core level idle injection, no need to synchronize that; the reason to synchronize
> is generally that when ALL cores are idle, additional power savings kick in
> (like memory going to self refresh, fabrics power gating etc); those additional
> power savings are what makes this more efficient than just voltage/frequency
> scaling at the bottom of that range... not so much the fact that things are just idle.
I could see this technique being useful within the SoC as well.
Synchronized idle injection on all cpus in a cluster would allow us to
reach cluster idle states where resources shared within the cluster can
be gated or powered off as well. But yes, shutting down everything is
more efficient if you are en serious trouble.
I was hoping this could be a good alternative to hotplugging cpus out
for thermal management in non-critical situations, but it isn't that
appealing if you have to throttle all the cpus. I would consider it an
emergency-only mechanism (as in emergency brake) that isn't really
suitable for normal thermal management. In which case: Does this sort of
mechanism belong in the scheduler code?
On Mon, 16 Nov 2015 16:09:10 -0800
Jacob Pan <[email protected]> wrote:
> > > For the second case, which is much more rare, I think we do have
> > > next timer exactly one tick away. Just don't know why tick will
> > > continue into idle loop.
> >
> > Well, it should not be hard to figure that out. There are not so
> > many checks involved when tick_nohz_irq_exit() is called.
> Thanks for the tip, I found the cause is in
> int idle_cpu(int cpu)
> {
> if (rq->nr_running)
> return 0;
>
> Since we only take into account of cfs_rq runnable taking over
> cfs_rq->nr_running when forced_idle is set.
I am not sure what is the best solution. It seems I can add additional
checks like this.
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3520,9 +3520,14 @@ int idle_cpu(int cpu)
if (rq->curr != rq->idle)
return 0;
- if (rq->nr_running)
- return 0;
-
+ if (rq->nr_running) {
+ /* if cfs_rq is in forced idle, nr_running could be
nonzero but still in idle */
+ if ((rq->nr_running != rq->cfs.h_nr_running) ||
+ cfs_rq_runnable(&rq->cfs))
+ return 0;
+ }
To recap the problem statement.
1. When entering idle loop tick_nohz_stop_sched_tick() checks if the
next timer interrupt is exactly one tick away. if so, it will not stop
it to avoid threshing timer disable and enable.
2. so it relies on the next round tick_nohz_irq_exit() to have another
chance to stop the tick
3. with idle injection rq->nr_running could be nonzero when in idle
4. tick_nohz_irq_exit() will not be called if !idle_cpu()
However, idle_cpu() is used by many other callers, e.g. load balance.
Do we want to consider forced idle in those cases? Or we can forgo this
case and ignore it?
Jacob
On Mon, Nov 16, 2015 at 04:09:10PM -0800, Jacob Pan wrote:
> On Mon, 16 Nov 2015 23:01:12 +0100 (CET)
> Thomas Gleixner <[email protected]> wrote:
>
> > > For the second case, which is much more rare, I think we do have
> > > next timer exactly one tick away. Just don't know why tick will
> > > continue into idle loop.
> >
> > Well, it should not be hard to figure that out. There are not so many
> > checks involved when tick_nohz_irq_exit() is called.
> Thanks for the tip, I found the cause is in
> int idle_cpu(int cpu)
Which one? That is, I cannot find a idle_cpu() call from
tick_nohz_irq_exit().
On Thu, 19 Nov 2015 20:06:30 +0100
Peter Zijlstra <[email protected]> wrote:
> Which one? That is, I cannot find a idle_cpu() call from
> tick_nohz_irq_exit().
it is tick_irq_exit(void), i will not call tick_nohz_irq_exit() if
!idle_cpu()
please see my other email for complete story. pasted below.
I am not sure what is the best solution. It seems I can add additional
checks like this.
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3520,9 +3520,14 @@ int idle_cpu(int cpu)
if (rq->curr != rq->idle)
return 0;
- if (rq->nr_running)
- return 0;
-
+ if (rq->nr_running) {
+ /* if cfs_rq is in forced idle, nr_running could be
nonzero but still in idle */
+ if ((rq->nr_running != rq->cfs.h_nr_running) ||
+ cfs_rq_runnable(&rq->cfs))
+ return 0;
+ }
To recap the problem statement.
1. When entering idle loop tick_nohz_stop_sched_tick() checks if the
next timer interrupt is exactly one tick away. if so, it will not stop
it to avoid threshing timer disable and enable.
2. so it relies on the next round tick_nohz_irq_exit() to have another
chance to stop the tick
3. with idle injection rq->nr_running could be nonzero when in idle
4. tick_nohz_irq_exit() will not be called if !idle_cpu()
However, idle_cpu() is used by many other callers, e.g. load balance.
Do we want to consider forced idle in those cases? Or we can forgo this
case and ignore it?
On Thu, Nov 19, 2015 at 11:21:25AM -0800, Jacob Pan wrote:
> On Thu, 19 Nov 2015 20:06:30 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > Which one? That is, I cannot find a idle_cpu() call from
> > tick_nohz_irq_exit().
> it is tick_irq_exit(void), i will not call tick_nohz_irq_exit() if
> !idle_cpu()
Ah, but I think we really only need to test for the idle task there, the
&& need_resched() part of that function takes care of the rest.
Once we have need_resched() set, we'll be on our way to
tick_nohz_idle_exit() to restart the tick again.
---
kernel/softirq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..3bd30404f2ee 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -367,7 +367,7 @@ static inline void tick_irq_exit(void)
int cpu = smp_processor_id();
/* Make sure that timer wheel updates are propagated */
- if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
+ if ((is_idle_task(current) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
if (!in_interrupt())
tick_nohz_irq_exit();
}
On Thu, Nov 19, 2015 at 05:24:07PM +0000, Morten Rasmussen wrote:
> I would consider it an
> emergency-only mechanism (as in emergency brake) that isn't really
> suitable for normal thermal management. In which case: Does this sort of
> mechanism belong in the scheduler code?
I would prefer it not to be, but Thomas is very much opposed to teaching
the nohz code to 'work' on !idle threads.
On Thu, 19 Nov 2015 20:59:05 +0100
Peter Zijlstra <[email protected]> wrote:
> On Thu, Nov 19, 2015 at 11:21:25AM -0800, Jacob Pan wrote:
> > On Thu, 19 Nov 2015 20:06:30 +0100
> > Peter Zijlstra <[email protected]> wrote:
> >
> > > Which one? That is, I cannot find a idle_cpu() call from
> > > tick_nohz_irq_exit().
> > it is tick_irq_exit(void), i will not call tick_nohz_irq_exit() if
> > !idle_cpu()
>
> Ah, but I think we really only need to test for the idle task there,
> the && need_resched() part of that function takes care of the rest.
>
> Once we have need_resched() set, we'll be on our way to
> tick_nohz_idle_exit() to restart the tick again.
Tested, it did the trick also. has less impact than changing idle_cpu().
My other point is the callers of idle_cpu() will see it returns false
but in forced idle. Can't see anything bad since we skip load balance.
Thanks,
Jacob
On Thu, 19 Nov 2015, Peter Zijlstra wrote:
> On Thu, Nov 19, 2015 at 05:24:07PM +0000, Morten Rasmussen wrote:
> > I would consider it an
> > emergency-only mechanism (as in emergency brake) that isn't really
> > suitable for normal thermal management. In which case: Does this sort of
> > mechanism belong in the scheduler code?
>
> I would prefer it not to be, but Thomas is very much opposed to teaching
> the nohz code to 'work' on !idle threads.
The whole concept of faking idle is simply crap.
If you want to avoid that stuff in the scheduler, then create a
mechanism which just defers the next timer interrupt for X
milliseconds and does not any fiddling with NOHZ state and such.
That might hurt RT tasks, but if someone really cares about real-time
and deterministic behaviour, then running the machine on its thermal
limits is simply stupid. In fact any sensible RT system will bring
itself into a safe state way before the machine runs into that
condition.
Thanks,
tglx
On Fri, Nov 20, 2015 at 10:45:14AM +0100, Thomas Gleixner wrote:
> The whole concept of faking idle is simply crap.
:-)
> If you want to avoid that stuff in the scheduler, then create a
> mechanism which just defers the next timer interrupt for X
> milliseconds and does not any fiddling with NOHZ state and such.
Not sure that will actually result in the machine going idle. Device
interrupts will still wake tasks and get them ran.
On Fri, 20 Nov 2015, Peter Zijlstra wrote:
> On Fri, Nov 20, 2015 at 10:45:14AM +0100, Thomas Gleixner wrote:
>
> > The whole concept of faking idle is simply crap.
>
> :-)
>
> > If you want to avoid that stuff in the scheduler, then create a
> > mechanism which just defers the next timer interrupt for X
> > milliseconds and does not any fiddling with NOHZ state and such.
>
> Not sure that will actually result in the machine going idle. Device
> interrupts will still wake tasks and get them ran.
That's not what I meant. If you don't want to control all that from
the scheduler than you are back to that thread which "runs" at RT
priority and does
if (machine_on_fire) {
defer_timer_interrupt(5ms);
end = now + 5ms:
while (now < end)
mwait();
}
That's what the existing code does, but the above does not longer
claim it's idle and confuses the hell out of nohz and whatever. It's
just a "runaway" RT task which "hogs" the CPU for 5ms and makes the
next timer interrupt firing late.
Thanks,
tglx
On Fri, Nov 20, 2015 at 11:58:12AM +0100, Thomas Gleixner wrote:
> That's not what I meant. If you don't want to control all that from
> the scheduler than you are back to that thread which "runs" at RT
> priority and does
>
> if (machine_on_fire) {
> defer_timer_interrupt(5ms);
> end = now + 5ms:
> while (now < end)
> mwait();
> }
>
> That's what the existing code does, but the above does not longer
> claim it's idle and confuses the hell out of nohz and whatever. It's
> just a "runaway" RT task which "hogs" the CPU for 5ms and makes the
> next timer interrupt firing late.
Right; so the naive way of implementing that is by simply programing the
timer hardware 5ms in the future and leaving it at that.
The problem with that would be a device interrupt happening and mucking
with timers, this would result in the timer hardware being reprogrammed
to a 'sane' value. I see two solutions for that:
- add another check in tick_program_event(); or,
- muck about with the evtdev pointer, such that we (temporarily) neuter
its clock_event_device::set_next_*() methods.
The later is fugly, but avoids any runtime overhead.
This all makes the idle-injection muck hard depend on hres_active, but I
think that's a sane constraint anyway.
On Fri, 20 Nov 2015, Peter Zijlstra wrote:
> On Fri, Nov 20, 2015 at 11:58:12AM +0100, Thomas Gleixner wrote:
> > That's not what I meant. If you don't want to control all that from
> > the scheduler than you are back to that thread which "runs" at RT
> > priority and does
> >
> > if (machine_on_fire) {
> > defer_timer_interrupt(5ms);
> > end = now + 5ms:
> > while (now < end)
> > mwait();
> > }
> >
> > That's what the existing code does, but the above does not longer
> > claim it's idle and confuses the hell out of nohz and whatever. It's
> > just a "runaway" RT task which "hogs" the CPU for 5ms and makes the
> > next timer interrupt firing late.
>
> Right; so the naive way of implementing that is by simply programing the
> timer hardware 5ms in the future and leaving it at that.
>
> The problem with that would be a device interrupt happening and mucking
> with timers, this would result in the timer hardware being reprogrammed
> to a 'sane' value. I see two solutions for that:
>
> - add another check in tick_program_event(); or,
>
> - muck about with the evtdev pointer, such that we (temporarily) neuter
> its clock_event_device::set_next_*() methods.
>
> The later is fugly, but avoids any runtime overhead.
Yes, it's fugly, but it does not touch any of the states.
> This all makes the idle-injection muck hard depend on hres_active, but I
> think that's a sane constraint anyway.
It makes it actually depend on NO_HZ || HRES. It comes with a few
other restrictions as well: I'm not going to support that for TSCs
which stop in C3. It's ugly enough already and I really don't want to
muck with the broadcast device.
One other thing is that the caller has to ensure that the unblock()
happens in a timely manner. I really would recommend to just disable
preemption around that machine on fire wait loop and screw the RT
tasks. We screw them anyway when they depend on a timer.
Untested patch below.
Thanks,
tglx
8<----------------
Subject: tick/nohz: Allow blocking the tick to prevent fire
From: Thomas Gleixner <[email protected]>
Date: Fri, 20 Nov 2015 14:28:17 +0100
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/tick.h | 5 ++
kernel/time/Kconfig | 5 ++
kernel/time/tick-sched.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/time/tick-sched.h | 7 ++-
4 files changed, 114 insertions(+), 2 deletions(-)
Index: tip/include/linux/tick.h
===================================================================
--- tip.orig/include/linux/tick.h
+++ tip/include/linux/tick.h
@@ -203,4 +203,9 @@ static inline void tick_nohz_task_switch
__tick_nohz_task_switch();
}
+#ifdef CONFIG_TICK_THROTTLING
+int tick_nohz_block_tick(u64 delay);
+void tick_nohz_unblock_tick(void);
+#endif
+
#endif
Index: tip/kernel/time/Kconfig
===================================================================
--- tip.orig/kernel/time/Kconfig
+++ tip/kernel/time/Kconfig
@@ -60,6 +60,11 @@ menu "Timers subsystem"
config TICK_ONESHOT
bool
+# Special functions for tick throttling to avoid fire extinguishers
+config TICK_THROTTLING
+ bool
+ depends on TICK_ONESHOT
+
config NO_HZ_COMMON
bool
depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
Index: tip/kernel/time/tick-sched.c
===================================================================
--- tip.orig/kernel/time/tick-sched.c
+++ tip/kernel/time/tick-sched.c
@@ -1119,6 +1119,105 @@ void tick_setup_sched_timer(void)
}
#endif /* HIGH_RES_TIMERS */
+#ifdef CONFIG_TICK_THROTTLING
+/*
+ * An interrupt might have been pending already, when we programmed
+ * the throttler time. Nothing to do here. The device is armed and the
+ * interrupt will fire again. If this is the real wakeup event then
+ * the unblock call will retrigger it.
+ */
+static void tick_throttling_handler(struct clock_event_device *dev)
+{
+}
+
+static int tick_throttling_noop(unsigned long evt,
+ struct clock_event_device *d)
+{
+ return 0;
+}
+
+static struct clock_event_device tick_throttler = {
+ .name = "throttler",
+ .features = CLOCK_EVT_FEAT_ONESHOT,
+ .event_handler = tick_throttling_handler,
+ .set_next_event = tick_throttling_noop,
+ .mult = 1,
+ .shift = 0,
+ .irq = -1,
+};
+
+/**
+ * tick_nohz_block_tick - Force block the tick to prevent fire
+ * @delay: Defer the tick for X nano seconds
+ *
+ * This is a special interface for thermal emergencies. Do not use
+ * otherwise! Make sure to call tick_nohz_block_tick() right after
+ * the delay ends to undo the damage.
+ */
+int tick_nohz_block_tick(u64 delay)
+{
+ struct tick_device *td;
+ unsigned long flags;
+ int ret = -EBUSY;
+ ktime_t until;
+
+ if (!tick_nohz_active)
+ return -EBUSY;
+
+ local_irq_save(flags);
+ td = this_cpu_ptr(&tick_cpu_device);
+
+ /* No way to do that with broadcast */
+ if (td->evtdev->features & CLOCK_EVT_FEAT_C3STOP)
+ goto out;
+
+ /* Yes, I do not trust people! */
+ if (WARN_ON_ONCE(td->evtdev == &tick_throttler))
+ goto out;
+
+ if (delay > td->evtdev->max_delta_ticks) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ until = ktime_add_ns(ktime_get(), delay);
+ if (!tick_program_event(until, 0)) {
+ td->real_evtdev = td->evtdev;
+ td->evtdev = &tick_throttler;
+ ret = 0;
+ }
+out:
+ local_irq_restore(flags);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tick_nohz_block_tick);
+
+/**
+ * tick_nohz_force_unblock_tick - Undo the force blocking of the tick
+ *
+ * Pairs with tick_nohz_block_tick(). Can be called unconditionally
+ * even if the tick was not blocked by tick_nohz_block_tick().
+ */
+void tick_nohz_unblock_tick(void)
+{
+ struct tick_device *td;
+ unsigned long flags;
+
+ if (!tick_nohz_active)
+ return;
+
+ local_irq_save(flags);
+ td = this_cpu_ptr(&tick_cpu_device);
+ if (td->evtdev == &tick_throttler) {
+ td->evtdev = td->real_evtdev;
+ /* Force a timer interrupt now */
+ tick_program_event(ktime_get(), 1);
+ }
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(tick_nohz_unblock_tick);
+#endif /* TICK_THROTTLING */
+
#if defined CONFIG_NO_HZ_COMMON || defined CONFIG_HIGH_RES_TIMERS
void tick_cancel_sched_timer(int cpu)
{
Index: tip/kernel/time/tick-sched.h
===================================================================
--- tip.orig/kernel/time/tick-sched.h
+++ tip/kernel/time/tick-sched.h
@@ -9,8 +9,11 @@ enum tick_device_mode {
};
struct tick_device {
- struct clock_event_device *evtdev;
- enum tick_device_mode mode;
+ struct clock_event_device *evtdev;
+ enum tick_device_mode mode;
+#ifdef CONFIG_TICK_THROTTLING
+ struct clock_event_device *real_evtdev;
+#endif
};
enum tick_nohz_mode {
On Fri, Nov 13, 2015 at 11:53:06AM -0800, Jacob Pan wrote:
> With increasingly constrained power and thermal budget, it's often
> necessary to cap power via throttling. Throttling individual CPUs
> or devices at random times can help power capping but may not be
> optimal in terms of energy efficiency. Frequency scaling is also
> limited by certain range before losing energy efficiency.
>
> In general, the optimal solution in terms of energy efficiency is
> to align idle periods such that more shared circuits can be power
> gated to enter lower power states. Combined with energy efficient
> frequency point, idle injection provides a way to scale power and
> performance efficiently.
>
> This patch introduces a scheduler based idle injection method, it
> works by blocking CFS runqueue synchronously and periodically. The
> actions on all online CPUs are orchestrated by per CPU hrtimers.
>
> Two sysctl knobs are given to the userspace for selecting the
> percentage of idle time as well as the forced idle duration for each
> idle period injected.
>
> Since only CFS class is targeted, other high priority tasks are not
> affected, such as EDF and RT tasks as well as softirq and interrupts.
>
> Hotpath in CFS pick_next_task is optimized by Peter Zijlstra, where
> a new runnable flag is introduced to combine forced idle and
> nr_running.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> include/linux/sched.h | 11 ++
> include/linux/sched/sysctl.h | 5 +
> init/Kconfig | 10 ++
> kernel/sched/fair.c | 353 ++++++++++++++++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 54 ++++++-
> kernel/sysctl.c | 21 +++
> 6 files changed, 449 insertions(+), 5 deletions(-)
I've tested this series on Juno (2xCortex-A57 4xCortex-A53). If you
idle inject for 50% of the time, when I run 6 busy loops the scheduler
sometimes keeps two of them in the same cpu while the another cpu is
completely idle. Without idle injection the scheduler does the
sensible thing: put one busy loop in each CPU. I'm running systemd
and this only happens with CONFIG_SCHED_AUTOGROUP=y. If I unset
CONFIG_SCHED_AUTOGROUP, the tasks are spread across all cpus as usual.
See below part of the trace that shows this problem. CPU3 has two
100% tasks: 1554 and 1549 but the scheduler never moves one of the
tasks to CPU4, which has an empty runqueue. Both cpus are in the same
domain. Juri helped me add two additional trace points to track the
load of a task and cpu. This tracepoints are added at the end of
update_load_avg().
<idle>-0 [002] 164.739796: sched_cfs_idle_inject_timer: throttled=0
<idle>-0 [000] 164.739797: sched_cfs_idle_inject_timer: throttled=0
<idle>-0 [005] 164.739797: sched_cfs_idle_inject_timer: throttled=0
<idle>-0 [001] 164.739797: sched_cfs_idle_inject_timer: throttled=0
<idle>-0 [003] 164.739797: sched_cfs_idle_inject_timer: throttled=0
<idle>-0 [004] 164.739798: sched_cfs_idle_inject_timer: throttled=0
<idle>-0 [002] 164.739802: sched_load_avg_cpu: cpu=2 load_avg=171 util_avg=406
<idle>-0 [002] 164.739803: sched_load_avg_task: comm=busy_loop pid=1552 cpu=2 load_avg=1006 util_avg=400 load_sum=48043453 util_sum=19130537 period_contrib=173
<idle>-0 [001] 164.739803: sched_load_avg_cpu: cpu=1 load_avg=170 util_avg=405
<idle>-0 [002] 164.739804: sched_load_avg_cpu: cpu=2 load_avg=1014 util_avg=403
<idle>-0 [001] 164.739804: sched_load_avg_task: comm=busy_loop pid=1551 cpu=1 load_avg=1008 util_avg=401 load_sum=48161276 util_sum=19177731 period_contrib=288
<idle>-0 [005] 164.739804: sched_load_avg_cpu: cpu=5 load_avg=169 util_avg=404
<idle>-0 [002] 164.739805: sched_switch: swapper/2:0 [120] R ==> busy_loop:1552 [120]
<idle>-0 [001] 164.739805: sched_load_avg_cpu: cpu=1 load_avg=1024 util_avg=407
<idle>-0 [003] 164.739805: sched_load_avg_cpu: cpu=3 load_avg=340 util_avg=405
<idle>-0 [000] 164.739805: sched_load_avg_cpu: cpu=0 load_avg=168 util_avg=400
<idle>-0 [001] 164.739806: sched_switch: swapper/1:0 [120] R ==> busy_loop:1551 [120]
<idle>-0 [005] 164.739806: sched_load_avg_task: comm=busy_loop pid=1550 cpu=5 load_avg=1010 util_avg=402 load_sum=48229881 util_sum=19205027 period_contrib=355
<idle>-0 [003] 164.739807: sched_load_avg_task: comm=busy_loop pid=1549 cpu=3 load_avg=1012 util_avg=193 load_sum=48316673 util_sum=9247244 period_contrib=441
<idle>-0 [000] 164.739807: sched_load_avg_task: comm=busy_loop pid=1553 cpu=0 load_avg=1005 util_avg=400 load_sum=48003551 util_sum=19119112 period_contrib=134
<idle>-0 [005] 164.739808: sched_load_avg_cpu: cpu=5 load_avg=1002 util_avg=399
<idle>-0 [003] 164.739808: sched_load_avg_cpu: cpu=3 load_avg=2045 util_avg=407
<idle>-0 [000] 164.739809: sched_load_avg_cpu: cpu=0 load_avg=1008 util_avg=401
<idle>-0 [005] 164.739810: sched_switch: swapper/5:0 [120] R ==> busy_loop:1550 [120]
<idle>-0 [003] 164.739810: sched_switch: swapper/3:0 [120] R ==> busy_loop:1549 [120]
<idle>-0 [000] 164.739811: sched_switch: swapper/0:0 [120] R ==> busy_loop:1553 [120]
busy_loop-1552 [002] 164.743793: sched_stat_runtime: comm=busy_loop pid=1552 runtime=3991560 [ns] vruntime=605432548 [ns]
busy_loop-1549 [003] 164.743794: sched_stat_runtime: comm=busy_loop pid=1549 runtime=3990040 [ns] vruntime=382380848 [ns]
busy_loop-1552 [002] 164.743794: sched_load_avg_task: comm=busy_loop pid=1552 cpu=2 load_avg=1024 util_avg=456 load_sum=48889883 util_sum=21796057 period_contrib=999
busy_loop-1553 [000] 164.743794: sched_stat_runtime: comm=busy_loop pid=1553 runtime=3990180 [ns] vruntime=590391894 [ns]
busy_loop-1551 [001] 164.743794: sched_stat_runtime: comm=busy_loop pid=1551 runtime=3992100 [ns] vruntime=272056341 [ns]
busy_loop-1550 [005] 164.743794: sched_stat_runtime: comm=busy_loop pid=1550 runtime=3990920 [ns] vruntime=198320034 [ns]
busy_loop-1552 [002] 164.743795: sched_load_avg_cpu: cpu=2 load_avg=1010 util_avg=450
busy_loop-1551 [001] 164.743796: sched_load_avg_task: comm=busy_loop pid=1551 cpu=1 load_avg=1004 util_avg=447 load_sum=47958941 util_sum=21380913 period_contrib=90
busy_loop-1549 [003] 164.743796: sched_load_avg_task: comm=busy_loop pid=1549 cpu=3 load_avg=1007 util_avg=257 load_sum=48112396 util_sum=12285572 period_contrib=241
busy_loop-1552 [002] 164.743796: sched_load_avg_cpu: cpu=2 load_avg=170 util_avg=453
busy_loop-1553 [000] 164.743796: sched_load_avg_task: comm=busy_loop pid=1553 cpu=0 load_avg=1023 util_avg=456 load_sum=48847931 util_sum=21780791 period_contrib=958
busy_loop-1551 [001] 164.743796: sched_load_avg_cpu: cpu=1 load_avg=1020 util_avg=454
busy_loop-1550 [005] 164.743797: sched_load_avg_task: comm=busy_loop pid=1550 cpu=5 load_avg=1005 util_avg=448 load_sum=48026522 util_sum=21410614 period_contrib=156
busy_loop-1549 [003] 164.743797: sched_load_avg_cpu: cpu=3 load_avg=2036 util_avg=454
busy_loop-1553 [000] 164.743798: sched_load_avg_cpu: cpu=0 load_avg=1004 util_avg=447
busy_loop-1551 [001] 164.743798: sched_load_avg_cpu: cpu=1 load_avg=169 util_avg=452
busy_loop-1550 [005] 164.743798: sched_load_avg_cpu: cpu=5 load_avg=1020 util_avg=455
busy_loop-1553 [000] 164.743800: sched_load_avg_cpu: cpu=0 load_avg=171 util_avg=456
busy_loop-1549 [003] 164.743800: sched_load_avg_cpu: cpu=3 load_avg=339 util_avg=452
busy_loop-1550 [005] 164.743800: sched_load_avg_cpu: cpu=5 load_avg=168 util_avg=450
busy_loop-1552 [002] 164.747792: sched_stat_runtime: comm=busy_loop pid=1552 runtime=3999320 [ns] vruntime=609431868 [ns]
busy_loop-1553 [000] 164.747793: sched_stat_runtime: comm=busy_loop pid=1553 runtime=3999380 [ns] vruntime=594391274 [ns]
busy_loop-1549 [003] 164.747793: sched_stat_runtime: comm=busy_loop pid=1549 runtime=3999540 [ns] vruntime=386380388 [ns]
busy_loop-1552 [002] 164.747794: sched_load_avg_task: comm=busy_loop pid=1552 cpu=2 load_avg=1019 util_avg=499 load_sum=48694671 util_sum=23849523 period_contrib=808
busy_loop-1551 [001] 164.747794: sched_stat_runtime: comm=busy_loop pid=1551 runtime=3999880 [ns] vruntime=276056221 [ns]
busy_loop-1550 [005] 164.747795: sched_stat_runtime: comm=busy_loop pid=1550 runtime=3999280 [ns] vruntime=202319314 [ns]
busy_loop-1552 [002] 164.747795: sched_load_avg_cpu: cpu=2 load_avg=1006 util_avg=492
busy_loop-1551 [001] 164.747795: sched_load_avg_task: comm=busy_loop pid=1551 cpu=1 load_avg=1022 util_avg=500 load_sum=48813533 util_sum=23907693 period_contrib=924
busy_loop-1553 [000] 164.747795: sched_load_avg_task: comm=busy_loop pid=1553 cpu=0 load_avg=1019 util_avg=499 load_sum=48652717 util_sum=23832040 period_contrib=767
busy_loop-1549 [003] 164.747796: sched_load_avg_task: comm=busy_loop pid=1549 cpu=3 load_avg=1003 util_avg=315 load_sum=47917292 util_sum=15063949 period_contrib=50
busy_loop-1551 [001] 164.747796: sched_load_avg_cpu: cpu=1 load_avg=1016 util_avg=497
busy_loop-1552 [002] 164.747796: sched_load_avg_cpu: cpu=2 load_avg=169 util_avg=496
busy_loop-1550 [005] 164.747797: sched_load_avg_task: comm=busy_loop pid=1550 cpu=5 load_avg=1023 util_avg=501 load_sum=48880090 util_sum=23938753 period_contrib=989
busy_loop-1553 [000] 164.747797: sched_load_avg_cpu: cpu=0 load_avg=1022 util_avg=500
busy_loop-1549 [003] 164.747797: sched_load_avg_cpu: cpu=3 load_avg=2028 util_avg=496
busy_loop-1551 [001] 164.747797: sched_load_avg_cpu: cpu=1 load_avg=169 util_avg=495
busy_loop-1550 [005] 164.747798: sched_load_avg_cpu: cpu=5 load_avg=1016 util_avg=497
busy_loop-1553 [000] 164.747799: sched_load_avg_cpu: cpu=0 load_avg=170 util_avg=499
busy_loop-1549 [003] 164.747800: sched_load_avg_cpu: cpu=3 load_avg=337 util_avg=494
busy_loop-1550 [005] 164.747800: sched_load_avg_cpu: cpu=5 load_avg=168 util_avg=492
busy_loop-1552 [002] 164.751792: sched_stat_runtime: comm=busy_loop pid=1552 runtime=4000260 [ns] vruntime=613432128 [ns]
busy_loop-1549 [003] 164.751793: sched_stat_runtime: comm=busy_loop pid=1549 runtime=3999760 [ns] vruntime=390380148 [ns]
busy_loop-1553 [000] 164.751793: sched_stat_runtime: comm=busy_loop pid=1553 runtime=3999920 [ns] vruntime=598391194 [ns]
busy_loop-1552 [002] 164.751793: sched_load_avg_task: comm=busy_loop pid=1552 cpu=2 load_avg=1015 util_avg=538 load_sum=48500452 util_sum=25717351 period_contrib=618
busy_loop-1550 [005] 164.751793: sched_stat_runtime: comm=busy_loop pid=1550 runtime=3999920 [ns] vruntime=206319234 [ns]
busy_loop-1552 [002] 164.751794: sched_load_avg_cpu: cpu=2 load_avg=1024 util_avg=542
busy_loop-1551 [001] 164.751794: sched_stat_runtime: comm=busy_loop pid=1551 runtime=4000120 [ns] vruntime=280056341 [ns]
busy_loop-1549 [003] 164.751795: sched_load_avg_task: comm=busy_loop pid=1549 cpu=3 load_avg=1021 util_avg=376 load_sum=48771927 util_sum=17985591 period_contrib=884
busy_loop-1553 [000] 164.751795: sched_load_avg_task: comm=busy_loop pid=1553 cpu=0 load_avg=1015 util_avg=538 load_sum=48458496 util_sum=25697835 period_contrib=577
busy_loop-1551 [001] 164.751795: sched_load_avg_task: comm=busy_loop pid=1551 cpu=1 load_avg=1018 util_avg=539 load_sum=48619308 util_sum=25780552 period_contrib=734
busy_loop-1550 [005] 164.751795: sched_load_avg_task: comm=busy_loop pid=1550 cpu=5 load_avg=1019 util_avg=540 load_sum=48685865 util_sum=25814558 period_contrib=799
busy_loop-1552 [002] 164.751796: sched_load_avg_cpu: cpu=2 load_avg=169 util_avg=535
busy_loop-1551 [001] 164.751796: sched_load_avg_cpu: cpu=1 load_avg=1011 util_avg=536
busy_loop-1553 [000] 164.751797: sched_load_avg_cpu: cpu=0 load_avg=1018 util_avg=539
busy_loop-1549 [003] 164.751797: sched_load_avg_cpu: cpu=3 load_avg=2020 util_avg=535
busy_loop-1550 [005] 164.751797: sched_load_avg_cpu: cpu=5 load_avg=1012 util_avg=536
busy_loop-1551 [001] 164.751797: sched_load_avg_cpu: cpu=1 load_avg=168 util_avg=533
busy_loop-1553 [000] 164.751799: sched_load_avg_cpu: cpu=0 load_avg=169 util_avg=538
busy_loop-1549 [003] 164.751799: sched_load_avg_cpu: cpu=3 load_avg=336 util_avg=533
busy_loop-1550 [005] 164.751800: sched_load_avg_cpu: cpu=5 load_avg=171 util_avg=543
busy_loop-1549 [003] 164.751807: sched_stat_runtime: comm=busy_loop pid=1549 runtime=13700 [ns] vruntime=390393848 [ns]
busy_loop-1549 [003] 164.751809: sched_load_avg_task: comm=busy_loop pid=1549 cpu=3 load_avg=1021 util_avg=376 load_sum=48785239 util_sum=17998903 period_contrib=897
busy_loop-1549 [003] 164.751811: sched_load_avg_cpu: cpu=3 load_avg=2020 util_avg=535
busy_loop-1549 [003] 164.751812: sched_load_avg_task: comm=busy_loop pid=1554 cpu=3 load_avg=1015 util_avg=163 load_sum=48472554 util_sum=7827475 period_contrib=593
busy_loop-1549 [003] 164.751814: sched_load_avg_cpu: cpu=3 load_avg=2020 util_avg=535
busy_loop-1549 [003] 164.751816: sched_switch: busy_loop:1549 [120] R ==> busy_loop:1554 [120]
busy_loop-1552 [002] 164.755792: sched_stat_runtime: comm=busy_loop pid=1552 runtime=3999800 [ns] vruntime=617431928 [ns]
busy_loop-1553 [000] 164.755793: sched_stat_runtime: comm=busy_loop pid=1553 runtime=3999880 [ns] vruntime=602391074 [ns]
busy_loop-1552 [002] 164.755793: sched_load_avg_task: comm=busy_loop pid=1552 cpu=2 load_avg=1011 util_avg=574 load_sum=48306205 util_sum=27414009 period_contrib=428
busy_loop-1550 [005] 164.755793: sched_stat_runtime: comm=busy_loop pid=1550 runtime=3999780 [ns] vruntime=210319014 [ns]
busy_loop-1554 [003] 164.755793: sched_stat_runtime: comm=busy_loop pid=1554 runtime=3986540 [ns] vruntime=382907621 [ns]
busy_loop-1552 [002] 164.755794: sched_load_avg_cpu: cpu=2 load_avg=1019 util_avg=578
busy_loop-1551 [001] 164.755794: sched_stat_runtime: comm=busy_loop pid=1551 runtime=3999860 [ns] vruntime=284056201 [ns]
busy_loop-1553 [000] 164.755795: sched_load_avg_task: comm=busy_loop pid=1553 cpu=0 load_avg=1010 util_avg=573 load_sum=48264247 util_sum=27392629 period_contrib=387
busy_loop-1551 [001] 164.755795: sched_load_avg_task: comm=busy_loop pid=1551 cpu=1 load_avg=1014 util_avg=575 load_sum=48425055 util_sum=27481823 period_contrib=544
busy_loop-1552 [002] 164.755795: sched_load_avg_cpu: cpu=2 load_avg=168 util_avg=570
busy_loop-1550 [005] 164.755795: sched_load_avg_task: comm=busy_loop pid=1550 cpu=5 load_avg=1015 util_avg=576 load_sum=48491612 util_sum=27518531 period_contrib=609
busy_loop-1554 [003] 164.755796: sched_load_avg_task: comm=busy_loop pid=1554 cpu=3 load_avg=1010 util_avg=230 load_sum=48265186 util_sum=10993484 period_contrib=390
busy_loop-1551 [001] 164.755796: sched_load_avg_cpu: cpu=1 load_avg=1007 util_avg=571
busy_loop-1553 [000] 164.755796: sched_load_avg_cpu: cpu=0 load_avg=1014 util_avg=575
busy_loop-1550 [005] 164.755797: sched_load_avg_cpu: cpu=5 load_avg=1008 util_avg=572
busy_loop-1554 [003] 164.755797: sched_load_avg_cpu: cpu=3 load_avg=2012 util_avg=571
busy_loop-1551 [001] 164.755797: sched_load_avg_cpu: cpu=1 load_avg=171 util_avg=581
busy_loop-1553 [000] 164.755799: sched_load_avg_cpu: cpu=0 load_avg=168 util_avg=574
busy_loop-1550 [005] 164.755799: sched_load_avg_cpu: cpu=5 load_avg=170 util_avg=579
busy_loop-1554 [003] 164.755799: sched_load_avg_cpu: cpu=3 load_avg=342 util_avg=581
busy_loop-1552 [002] 164.759791: sched_cfs_idle_inject_timer: throttled=1
busy_loop-1551 [001] 164.759791: sched_cfs_idle_inject_timer: throttled=1
busy_loop-1550 [005] 164.759792: sched_cfs_idle_inject_timer: throttled=1
busy_loop-1554 [003] 164.759792: sched_cfs_idle_inject_timer: throttled=1
busy_loop-1553 [000] 164.759792: sched_cfs_idle_inject_timer: throttled=1
Cheers,
Javi
On Thu, 19 Nov 2015 17:24:07 +0000
Morten Rasmussen <[email protected]> wrote:
> I could see this technique being useful within the SoC as well.
> Synchronized idle injection on all cpus in a cluster would allow us to
> reach cluster idle states where resources shared within the cluster
> can be gated or powered off as well. But yes, shutting down
> everything is more efficient if you are en serious trouble.
>
> I was hoping this could be a good alternative to hotplugging cpus out
> for thermal management in non-critical situations, but it isn't that
> appealing if you have to throttle all the cpus. I would consider it an
> emergency-only mechanism (as in emergency brake) that isn't really
> suitable for normal thermal management. In which case: Does this sort
> of mechanism belong in the scheduler code?
I would think this is not an emergency-only mechanism. I know shutting
down all CPUs at the same time sounds bad but here are advantage of
having this in the scheduler in that we can still have certain QoS.
e.g. we yeild softirq, RT. etc. There are other workload info readily
available in the scheduler to better use this knob. e.g. sync with
other input timers such as media.
For normal thermal conditions, Skylake for example, the max efficient
frequency is ~900MHz. If thermal restriction happens (e.g. you want to
do fanless), would you like to lose energy efficiency by further
lowering frequency to 500Mhz? or do idle injection and take 5ms latency
for normal tasks? Idle injection gives you near linear
performance-power scaling.
Thanks,
Jacob
On Mon, Nov 23, 2015 at 05:56:44PM +0000, Javi Merino wrote:
> I've tested this series on Juno (2xCortex-A57 4xCortex-A53). If you
> idle inject for 50% of the time, when I run 6 busy loops the scheduler
> sometimes keeps two of them in the same cpu while the another cpu is
> completely idle. Without idle injection the scheduler does the
> sensible thing: put one busy loop in each CPU. I'm running systemd
> and this only happens with CONFIG_SCHED_AUTOGROUP=y. If I unset
> CONFIG_SCHED_AUTOGROUP, the tasks are spread across all cpus as usual.
That's not a plus for this patch though; but a bug report against
AUTOGROUP/cgroup muck, right?
On Mon, Nov 23, 2015 at 07:07:06PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 23, 2015 at 05:56:44PM +0000, Javi Merino wrote:
> > I've tested this series on Juno (2xCortex-A57 4xCortex-A53). If you
> > idle inject for 50% of the time, when I run 6 busy loops the scheduler
> > sometimes keeps two of them in the same cpu while the another cpu is
> > completely idle. Without idle injection the scheduler does the
> > sensible thing: put one busy loop in each CPU. I'm running systemd
> > and this only happens with CONFIG_SCHED_AUTOGROUP=y. If I unset
> > CONFIG_SCHED_AUTOGROUP, the tasks are spread across all cpus as usual.
>
> That's not a plus for this patch though; but a bug report against
> AUTOGROUP/cgroup muck, right?
The bug only happens when you apply this series and you set the system
to idle inject 50% of the time. SCHED_AUTOGROUP is already part of
the kernel and behaves properly with the kernel as is. I think that
this patch should not introduce new bugs.
Cheers,
Javi
On Tue, Nov 24, 2015 at 09:12:40AM +0000, Javi Merino wrote:
> On Mon, Nov 23, 2015 at 07:07:06PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 23, 2015 at 05:56:44PM +0000, Javi Merino wrote:
> > > I've tested this series on Juno (2xCortex-A57 4xCortex-A53). If you
> > > idle inject for 50% of the time, when I run 6 busy loops the scheduler
> > > sometimes keeps two of them in the same cpu while the another cpu is
> > > completely idle. Without idle injection the scheduler does the
> > > sensible thing: put one busy loop in each CPU. I'm running systemd
> > > and this only happens with CONFIG_SCHED_AUTOGROUP=y. If I unset
> > > CONFIG_SCHED_AUTOGROUP, the tasks are spread across all cpus as usual.
> >
> > That's not a plus for this patch though; but a bug report against
> > AUTOGROUP/cgroup muck, right?
>
> The bug only happens when you apply this series and you set the system
> to idle inject 50% of the time. SCHED_AUTOGROUP is already part of
> the kernel and behaves properly with the kernel as is. I think that
> this patch should not introduce new bugs.
Ah, then I misunderstood your email, agreed.
On Tue, 24 Nov 2015 11:09:56 +0100
Peter Zijlstra <[email protected]> wrote:
> On Tue, Nov 24, 2015 at 09:12:40AM +0000, Javi Merino wrote:
> > On Mon, Nov 23, 2015 at 07:07:06PM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 23, 2015 at 05:56:44PM +0000, Javi Merino wrote:
> > > > I've tested this series on Juno (2xCortex-A57 4xCortex-A53).
> > > > If you idle inject for 50% of the time, when I run 6 busy loops
> > > > the scheduler sometimes keeps two of them in the same cpu while
> > > > the another cpu is completely idle. Without idle injection the
> > > > scheduler does the sensible thing: put one busy loop in each
> > > > CPU. I'm running systemd and this only happens with
> > > > CONFIG_SCHED_AUTOGROUP=y. If I unset CONFIG_SCHED_AUTOGROUP,
> > > > the tasks are spread across all cpus as usual.
> > >
> > > That's not a plus for this patch though; but a bug report against
> > > AUTOGROUP/cgroup muck, right?
> >
> > The bug only happens when you apply this series and you set the
> > system to idle inject 50% of the time. SCHED_AUTOGROUP is already
> > part of the kernel and behaves properly with the kernel as is. I
> > think that this patch should not introduce new bugs.
>
> Ah, then I misunderstood your email, agreed.
First of all, thanks for testing.
Just trying to reproduce this. So let me understand your set up.
- 8 cores in total?
- you first set 50% idle
- then launch 6 busy loops
How often you see this happen?
On Fri, 20 Nov 2015 11:58:12 +0100 (CET)
Thomas Gleixner <[email protected]> wrote:
> if (machine_on_fire) {
Just to add data to show this is not just for emergency. I have an
broadwell laptop, here is the test I did.
- clamp cpu freq to 900MHz (close to max efficiency)
- run gpu and cpu load (cut-the-rop, glxgears)
- turn fans off
run tmon (tools/thermal/tmon) to show temperatures and thermal trip
points. (sorry about the long lines, quite a few thermal sensors)
Also run turbostat to show CPU idle etc.
Observations:
without idle injection:
------------------------------------------------------------------
- package temp rise to 66C crossing the second active trip points
(2nd fan speed should kick in if not disabled)
- package power 9.47w
- cpu c0 ~90%, 0% core c7
------------------------------------------------------------------
- with idle injection 20%
- cpu c0 ~67%, ~13% core c7, ~7% package c-states
- package power 4.53w
- package temp 52C
glxgears maintains the same frame rate. didnt measure cut-the-rope
but still plays.
This is a large 14 inch laptop with much better cooling than a fanless
tablet/convertible. So hitting thermal constraints can happen in _NORMAL_
condition (even at 900MHz), not just on fire case.
Sync up idle can save energy efficiency and i think scheduler is a better
place to solve that problem, perhaps with PM QoS constraint will have less
perceived intrusive behavior.
Thanks,
Jacob
│Trip Points: AAAAC C AAPHCPP APHCPP PHCPP AAPHCPP AAAAPHCPP APHCPP AAAAPHC PP │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────── COOLING DEVICES ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ID Cooling Dev Cur Max Thermal Zone Binding │
│00 TFN1 4 11 │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ ││││││││││││ │
│01 Fan 0 1 │││*│││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ ││││││││││││ │
│02 Fan 0 1 ││*││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ ││││││││││││ │
│03 Fan 0 1 │*│││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ ││││││││││││ │
│04 Fan 0 1 *││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ ││││││││││││ │
│05 CHRG 4 4 │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ ││││││││││││ │
│06 intel_powerc -1 50 │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ │││││││││││ ││││││││││││ │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 10 20 30 40 50 60 70 80 90 100 110 120 │
│acpitz 0:[ 66][→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→A55→→→→→→→A65 A75 A85 C111 ] │
│pch_wi 1:[ 59][→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→ C115 ] │
│INT340 2:[ 20][→→→→→→→→→→→→→→→→→→→→→ ] │
│ SEN1 3:[ 47][P0→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→ A58 A63 P71 H100103 ] │
│ SEN2 4:[ 51][P0→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→ A63 P71 H100103 ] │
│ SEN3 5:[ 46][P0→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→ P71 H100103 ] │
│ SEN4 6:[ 45][P0→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→ A58 A63 P71 H100103 ] │
│ SEN6 7:[ 46][P0→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→ A48 A53 A58 A63 P71 H100103 ] │
│ SEN7 8:[ 27][P0→→→→→→→→→→→→→→→→→→→→→→→→→→ A63 P71 H100103 ] │
│ B0D4 9:[ 66][→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→A55→→→→→→→A65 A75 A85 P102045 ] │
│x86_pk10:[ 66][P0→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→ ] │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
turbostat without idle injection
1 3 776 86.37 898 2295 0 11.90
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 Pkg%pc8 Pkg%pc9 Pk%pc10 PkgWatt CorWatt GFXWatt
- - 801 89.16 898 2295 0 9.77 0.55 0.07 0.44 66 70 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9.47 4.31 1.17
0 0 801 89.26 898 2294 0 9.76 0.57 0.07 0.34 64 70 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9.47 4.31 1.17
0 2 803 89.43 898 2294 0 9.58
1 1 810 90.27 898 2295 0 8.59 0.52 0.07 0.55 66
1 3 787 87.69 898 2295 0 11.17
turbostat with 20% idle injection
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 Pkg%pc8 Pkg%pc9 Pk%pc10 PkgWatt CorWatt GFXWatt
- - 601 66.92 898 2295 0 13.48 4.65 1.09 13.86 52 52 3.50 0.60 0.46 2.30 0.00 0.00 0.00 4.53 1.99 0.17
0 0 605 67.36 898 2294 0 12.75 5.95 1.51 12.42 51 52 3.50 0.60 0.46 2.30 0.00 0.00 0.00 4.53 1.99 0.17
0 2 583 64.94 898 2295 0 15.18
1 1 605 67.37 898 2295 0 13.31 3.34 0.67 15.30 52
1 3 611 68.02 898 2295 0 12.66
~
~
~
> defer_timer_interrupt(5ms);
> end = now + 5ms:
> while (now < end)
> mwait();
> }
[Jacob Pan]
On Tue, Nov 24, 2015 at 03:10:59AM -0800, Jacob Pan wrote:
> On Tue, 24 Nov 2015 11:09:56 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, Nov 24, 2015 at 09:12:40AM +0000, Javi Merino wrote:
> > > On Mon, Nov 23, 2015 at 07:07:06PM +0100, Peter Zijlstra wrote:
> > > > On Mon, Nov 23, 2015 at 05:56:44PM +0000, Javi Merino wrote:
> > > > > I've tested this series on Juno (2xCortex-A57 4xCortex-A53).
> > > > > If you idle inject for 50% of the time, when I run 6 busy loops
> > > > > the scheduler sometimes keeps two of them in the same cpu while
> > > > > the another cpu is completely idle. Without idle injection the
> > > > > scheduler does the sensible thing: put one busy loop in each
> > > > > CPU. I'm running systemd and this only happens with
> > > > > CONFIG_SCHED_AUTOGROUP=y. If I unset CONFIG_SCHED_AUTOGROUP,
> > > > > the tasks are spread across all cpus as usual.
> > > >
> > > > That's not a plus for this patch though; but a bug report against
> > > > AUTOGROUP/cgroup muck, right?
> > >
> > > The bug only happens when you apply this series and you set the
> > > system to idle inject 50% of the time. SCHED_AUTOGROUP is already
> > > part of the kernel and behaves properly with the kernel as is. I
> > > think that this patch should not introduce new bugs.
> >
> > Ah, then I misunderstood your email, agreed.
> First of all, thanks for testing.
> Just trying to reproduce this. So let me understand your set up.
> - 8 cores in total?
- 6 cores in total
> - you first set 50% idle
> - then launch 6 busy loops
Correct
> How often you see this happen?
About 1 in 10. Sometimes, even though it ends up doing the right
thing, it takes some time (>500ms) to move things around. This never
happens with idle injection disabled (kernel.sched_cfs_idle_inject_pct
= 0).
Admittedly, the situation is a lot better than with 4.3. When I first
tried this series on top of v4.3-rc6, the scheduler consistently
misbehaved with 50% idle injection, leaving up to 4 100% tasks on a
cpu while three cpus remain completely idle
Cheers,
Javi
On Tue, 24 Nov 2015 12:00:07 +0000
Javi Merino <[email protected]> wrote:
> > - you first set 50% idle
> > - then launch 6 busy loops
>
> Correct
Do you mind sharing the workload or test scripts? Just to reproduce
here.
Thanks,
Jacob
On Tue, Nov 24, 2015 at 10:22:07AM -0800, Jacob Pan wrote:
> On Tue, 24 Nov 2015 12:00:07 +0000
> Javi Merino <[email protected]> wrote:
>
> > > - you first set 50% idle
> > > - then launch 6 busy loops
> >
> > Correct
> Do you mind sharing the workload or test scripts? Just to reproduce
> here.
I'm using rt-app[0] with the attached config. To run the test I use
workload-automation[1], but I don't think you need workload automation.
[0] https://git.linaro.org/power/rt-app.git
[1] https://github.com/ARM-software/workload-automation/
* Jacob Pan <[email protected]> wrote:
> On Wed, 18 Nov 2015 09:36:22 +0100
> Ingo Molnar <[email protected]> wrote:
>
> > >
> > > This patch introduces a scheduler based idle injection method, it
> > > works by blocking CFS runqueue synchronously and periodically. The
> > > actions on all online CPUs are orchestrated by per CPU hrtimers.
> > >
> > > Two sysctl knobs are given to the userspace for selecting the
> > > percentage of idle time as well as the forced idle duration for each
> > > idle period injected.
> >
> > What's the purpose of these knobs? Just testing, or will some
> > user-space daemon set them dynamically?
> >
> yes, it is to be used by userspace daemon such as thermal daemon.
> Though there are interests from in kernel thermal governor but that is
> another story.
Yeah, so let me make this very clear: for a kernel scheduling feature to be
self-sufficient is not 'another story', but a must-have aspect for this feature
to become upstream acceptable.
We don't add scheduler features that rely on pushing 'policy' to user-space.
That's poor design with many disadvantages. This feature should offer a reasonable
and automatic in-kernel default behavior with numbers that prove that it works.
Keeping an essential part of the feature in user-space earns a NAK from me.
Thanks,
Ingo
* Peter Zijlstra <[email protected]> wrote:
> On Wed, Nov 18, 2015 at 09:36:22AM +0100, Ingo Molnar wrote:
> > What will such throttling do to latencies, as observed by user-space tasks? What's
> > the typical expected frequency of the throttling frequency that you are targeting?
>
> The default has 5ms (iirc) of forced idle, so depending on what you do,
> noticeable to outright painful.
That's absolutely painful!
Thanks,
Ingo
On Fri, 27 Nov 2015 10:17:02 +0100
Ingo Molnar <[email protected]> wrote:
> > > > Two sysctl knobs are given to the userspace for selecting the
> > > > percentage of idle time as well as the forced idle duration for
> > > > each idle period injected.
> > >
> > > What's the purpose of these knobs? Just testing, or will some
> > > user-space daemon set them dynamically?
> > >
> > yes, it is to be used by userspace daemon such as thermal daemon.
> > Though there are interests from in kernel thermal governor but that
> > is another story.
>
> Yeah, so let me make this very clear: for a kernel scheduling feature
> to be self-sufficient is not 'another story', but a must-have aspect
> for this feature to become upstream acceptable.
>
> We don't add scheduler features that rely on pushing 'policy' to
> user-space. That's poor design with many disadvantages. This feature
> should offer a reasonable and automatic in-kernel default behavior
> with numbers that prove that it works.
>
Sorry about the late response, have been sick all this time.
So my intention here are two folds and two steps.
1. for system under thermal/power limit but still want to operate at
optimal energy efficiency point
2. synchronize idle time for better energy efficiency (runtime identify
suitable workload)
This patchset is intended for #1. Lay ground work for #2.
The knobs in #1 can be used by in kernel thermal-aware scheduling as
some early discussion pointed out. Thus not pushing policy to userspace.
For #2, my plan is to have a PM QoS like knob with sensible default to
specify scheduler latency tolerance w.r.t. idle synchronization.
Similar to timer slack (where default 5ms slack is used), we can give a
default tolerance in the forms of two parameters:
- forced_idle_duration (default 5ms)
- forced_to_natural_idle_ratio (default 50%)
e.g. when playing a online game with 40% natural idle time, idle
injection code will try to 20% synchronized idle time.
> Keeping an essential part of the feature in user-space earns a NAK
> from me.
I agree this should be self contained. My intension is to have the
essential part used in kernel by thermal aware scheduling or PM QoS
with clear intuitive mode of operation.