LinuxLists.cc - [RFC PATCH 0/3] CFS idle injection

2015-11-03 00:12:16

Subject: [RFC PATCH 0/3] CFS idle injection

Hi Peter and all,

A while ago, we had discussion about how powerclamp is broken in the
sense of turning off idle ticks in the forced idle period.
https://lkml.org/lkml/2014/12/18/369

It was suggested to replace the current kthread play idle loop with a
timer based runqueue throttling scheme. I finally got around to implement
this and code is much simpler. I also have good test results in terms of
efficiency, scalability, etc.
http://events.linuxfoundation.org/sites/events/files/slides/LinuxCon_Japan_2015_idle_injection1_0.pdf
slide #18+ shows the data on client and server.

I have two choices for this code:
1) be part of existing powerclamp driver but require exporting some
sched APIs.
2) be part of sched since the genernal rule applies when it comes down
to sycnhronized idle time for best power savings.

The patches below are for #2. There is a known problem with LOW RES timer
mode that I am working on. But I am hoping to get review earlier.

We are entering a very power limited environment on client side, frequency
scaling can only be efficient at certain range. e.g. on SKL, upto ~900MHz,
anything below, it is increasingly more efficient to do C-states insertion
if coordinated.

Looking forward, there are use case beyond thermal/power capping. I think
we can consolidate ballanced partial busy workload that are evenly
distributed among CPUs.

Please let me know what you think.

Thanks,

Jacob Pan (3):
ktime: add a roundup function
timer: relax tick stop in idle entry
sched: introduce synchronized idle injection

include/linux/ktime.h | 10 ++
include/linux/sched.h | 12 ++
include/linux/sched/sysctl.h | 5 +
include/trace/events/sched.h | 23 +++
init/Kconfig | 8 +
kernel/sched/fair.c | 345 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 3 +
kernel/sysctl.c | 20 +++
kernel/time/tick-sched.c | 2 +-
9 files changed, 427 insertions(+), 1 deletion(-)

--
1.9.1

2015-11-03 00:11:27

by Jacob Pan

[permalink] [raw]

Subject: [RFC PATCH 1/3] ktime: add a roundup function

ktime roundup function can be used to keep timer aligned and
prevent drift for recurring timeouts.

Signed-off-by: Jacob Pan <[email protected]>
---
include/linux/ktime.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/ktime.h b/include/linux/ktime.h
index 2b6a204..2e293fa 100644
--- a/include/linux/ktime.h
+++ b/include/linux/ktime.h
@@ -233,6 +233,16 @@ static inline ktime_t ktime_sub_us(const ktime_t kt, const u64 usec)

extern ktime_t ktime_add_safe(const ktime_t lhs, const ktime_t rhs);

+static inline ktime_t ktime_roundup(ktime_t x, ktime_t y)
+{
+ u64 temp_tv64;
+
+ temp_tv64 = x.tv64 + y.tv64 - 1;
+ temp_tv64 = div64_u64(temp_tv64, y.tv64);
+ x.tv64 = temp_tv64 * y.tv64;
+
+ return x;
+}
/**
* ktime_to_timespec_cond - convert a ktime_t variable to timespec
* format only if the variable contains data
--
1.9.1

2015-11-03 00:11:41

by Jacob Pan

[permalink] [raw]

Subject: [RFC PATCH 2/3] timer: relax tick stop in idle entry

Upon entering idle, we can turn off tick if the next timeout
is exactly one tick away. Otherwise, we could enter inner idle loop
with tick still enabled, without resched set, the tick will continue
during idle therefore less optimal in terms of energy savings.

Signed-off-by: Jacob Pan <[email protected]>
---
kernel/time/tick-sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7c7ec45..bcadaab 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -606,7 +606,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
* restart it proper.
*/
delta = next_tick - basemono;
- if (delta <= (u64)TICK_NSEC) {
+ if (delta < (u64)TICK_NSEC) {
tick.tv64 = 0;
if (!ts->tick_stopped)
goto out;
--
1.9.1

2015-11-03 00:11:38

by Jacob Pan

[permalink] [raw]

Subject: [RFC PATCH 3/3] sched: introduce synchronized idle injection

With increasingly constrained power and thermal budget, it's often
necessary to cap power via throttling. Throttling individual CPUs
or devices at random times can help power capping but may not be
optimal in terms of energy efficiency.

In general, the optimal solution in terms of energy efficiency is
to align idle periods such that more shared circuits can be power
gated to enter lower power states.

This patch introduces a scheduler based idle injection method, it
works by blocking CFS runqueue synchronously and periodically. The
actions on all online CPUs are orchestrated by per CPU hrtimers.

Two sysctl knobs are given to the userspce for selecting the
percentage of idle time as well as the forced idle duration for each
idle period injected.

Since only CFS class is targeted, other high priority tasks are not
affected, such as RT and interrupts.

Signed-off-by: Jacob Pan <[email protected]>
---
include/linux/sched.h | 12 ++
include/linux/sched/sysctl.h | 5 +
include/trace/events/sched.h | 23 +++
init/Kconfig | 8 +
kernel/sched/fair.c | 345 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 3 +
kernel/sysctl.c | 20 +++
7 files changed, 416 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b7b9501..ae8be25 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3181,3 +3181,15 @@ static inline unsigned long rlimit_max(unsigned int limit)
}

#endif
+
+#ifdef CONFIG_CFS_IDLE_INJECT
+extern int proc_sched_cfs_idle_inject_pct_handler(struct ctl_table *table,
+ int write,
+ void __user *buffer,
+ size_t *length, loff_t *ppos);
+extern int proc_sched_cfs_idle_inject_duration_handler(struct ctl_table *table,
+ int write,
+ void __user *buffer,
+ size_t *length, loff_t *ppos);
+
+#endif
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731..d32da45 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -81,6 +81,11 @@ extern unsigned int sysctl_sched_cfs_bandwidth_slice;
extern unsigned int sysctl_sched_autogroup_enabled;
#endif

+#ifdef CONFIG_CFS_IDLE_INJECT
+extern unsigned int sysctl_sched_cfs_idle_inject_pct;
+extern unsigned int sysctl_sched_cfs_idle_inject_duration;
+#endif
+
extern int sched_rr_timeslice;

extern int sched_rr_handler(struct ctl_table *table, int write,
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 539d6bc..0490e8b 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -566,6 +566,29 @@ TRACE_EVENT(sched_wake_idle_without_ipi,

TP_printk("cpu=%d", __entry->cpu)
);
+
+/*
+ * Tracepoint for idle injection
+ */
+TRACE_EVENT(sched_cfs_idle_inject,
+
+ TP_PROTO(int cpu, int throttled),
+
+ TP_ARGS(cpu, throttled),
+
+ TP_STRUCT__entry(
+ __field(int, cpu)
+ __field(int, throttled)
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = cpu;
+ __entry->throttled = throttled;
+ ),
+
+ TP_printk("cpu=%d throttled=%d", __entry->cpu, __entry->throttled)
+);
+
#endif /* _TRACE_SCHED_H */

/* This part must be outside protection */
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f7..1f2960a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1086,6 +1086,14 @@ menuconfig CGROUP_SCHED
bandwidth allocation to such task groups. It uses cgroups to group
tasks.

+config CFS_IDLE_INJECT
+ bool "Synchronized CFS idle injection"
+ default n
+ help
+ This option allows users to inject synchronized idle time across all CPUs.
+ The feature will align idle time such that the entire CPU package can be duty
+ cycled by going into the deepest/lowest power states.
+
if CGROUP_SCHED
config FAIR_GROUP_SCHED
bool "Group scheduling for SCHED_OTHER"
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e2e348..18b27c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -30,6 +30,7 @@
#include <linux/mempolicy.h>
#include <linux/migrate.h>
#include <linux/task_work.h>
+#include <linux/suspend.h>

#include <trace/events/sched.h>

@@ -114,6 +115,17 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
#endif

+/*
+ * Knobs for controlling percentage of time when idle is forced across all
+ * CPUs. This is a power management feature intended for achieving deepest
+ * and broadest idle without lower CPU frequencies to less optimal level.
+ * No action is taken if CPUs are natually idle.
+ */
+#ifdef CONFIG_CFS_IDLE_INJECT
+unsigned int sysctl_sched_cfs_idle_inject_pct;
+unsigned int sysctl_sched_cfs_idle_inject_duration = 10UL;
+#endif
+
static inline void update_load_add(struct load_weight *lw, unsigned long inc)
{
lw->weight += inc;
@@ -5136,6 +5148,16 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
struct task_struct *p;
int new_tasks;

+#ifdef CONFIG_CFS_IDLE_INJECT
+ if (cfs_rq->force_throttled &&
+ !idle_cpu(cpu_of(rq)) &&
+ !unlikely(local_softirq_pending())) {
+ /* forced idle, pick no task */
+ trace_sched_cfs_idle_inject(cpu_of(rq), 1);
+ update_curr(cfs_rq);
+ return NULL;
+ }
+#endif
again:
#ifdef CONFIG_FAIR_GROUP_SCHED
if (!cfs_rq->nr_running)
@@ -8317,3 +8339,326 @@ __init void init_sched_fair_class(void)
#endif /* SMP */

}
+
+#ifdef CONFIG_CFS_IDLE_INJECT
+static atomic_t idle_inject_active;
+static DEFINE_PER_CPU(struct hrtimer, idle_inject_timer);
+static DEFINE_PER_CPU(bool, idle_injected);
+/* protect injection parameters from runtime changes */
+static DEFINE_SPINLOCK(idle_inject_lock);
+
+/* Track which CPUs are being injected with idle period */
+static unsigned long *idle_inject_cpumask;
+
+/* Default idle injection duration for each period. */
+#define DEFAULT_DURATION_MSECS (10)
+
+static unsigned int duration; /* idle inject duration in msec. */
+static unsigned int inject_interval; /* idle inject duration in msec. */
+static unsigned int idle_pct; /* percentage of time idle is forced */
+/* starting reference time for all CPUs to align idle period */
+static ktime_t inject_start_time;
+static int prepare_idle_inject(void);
+
+static void throttle_rq(int cpu, int throttle)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+ rq->cfs.force_throttled = throttle;
+}
+
+static enum hrtimer_restart idle_inject_timer_fn(struct hrtimer *hrtimer)
+{
+ int cpu = smp_processor_id();
+ struct hrtimer *hrt = this_cpu_ptr(&idle_inject_timer);
+ ktime_t now, delta, period;
+ bool status;
+
+ now = hrtimer_cb_get_time(hrt);
+
+ status = raw_cpu_read(idle_injected);
+ if (status) {
+ /*
+ * We were injecting idle in the last phase, let's forward the
+ * timer to the next period
+ *
+ * status: 1 0 1 0
+ * ____ ____________________ _______
+ * |________| |_________|
+ *
+ * |duration| interval |
+ *
+ * ^ we are here
+ * forward to here: ^
+ */
+ delta = ktime_sub(now, inject_start_time);
+ period = ktime_add(ms_to_ktime(duration),
+ ms_to_ktime(inject_interval));
+ delta = ktime_roundup(delta, period);
+ hrtimer_set_expires(hrt, ktime_add(delta, inject_start_time));
+ } else {
+ /*
+ * We were not injecting idle in the last phase, let's forward
+ * timer after forced idle duration
+ * ____ ____________________ _______
+ * |________| |_________|
+ *
+ * |duration| interval |
+ *
+ * ^ we are here
+ * ^ forward timer to here
+ */
+ hrtimer_set_expires(hrt, ktime_add(ms_to_ktime(duration), now));
+ }
+ raw_cpu_write(idle_injected, !status);
+ throttle_rq(cpu, !status);
+ trace_sched_cfs_idle_inject(cpu, status);
+ /* call scheduler to check rq */
+ resched_cpu(cpu);
+
+ return HRTIMER_RESTART;
+}
+
+static void idle_inject_timer_start(void *info)
+{
+ int cpu = smp_processor_id();
+ struct hrtimer *hrt = this_cpu_ptr(&idle_inject_timer);
+
+ this_cpu_write(idle_injected, true);
+ set_bit(cpu, idle_inject_cpumask);
+ hrtimer_start(hrt, ms_to_ktime(duration), HRTIMER_MODE_ABS_PINNED);
+ hrtimer_set_expires(hrt, *(ktime_t *)info);
+}
+
+static int start_idle_inject(void)
+{
+ int ret;
+ ktime_t now = ktime_get();
+
+ if (!atomic_read(&idle_inject_active)) {
+ /* called once per activation of idle injection */
+ ret = prepare_idle_inject();
+ if (ret)
+ return ret;
+ }
+ /* prevent cpu hotplug */
+ get_online_cpus();
+
+ /* set a future time to let all per cpu timers expires the same time */
+ now = ktime_roundup(now, ms_to_ktime(duration));
+
+ /* start one timer per online cpu */
+ inject_start_time = now;
+ on_each_cpu(idle_inject_timer_start, &now, 1);
+ atomic_set(&idle_inject_active, 1);
+
+ put_online_cpus();
+
+ return 0;
+}
+
+static void stop_idle_inject(void)
+{
+ int i;
+ struct hrtimer *hrt;
+
+ if (bitmap_weight(idle_inject_cpumask, num_possible_cpus())) {
+ for_each_set_bit(i, idle_inject_cpumask, num_possible_cpus()) {
+ hrt = &per_cpu(idle_inject_timer, i);
+ hrtimer_cancel(hrt);
+ throttle_rq(i, 0);
+ }
+ }
+}
+
+static int idle_inject_cpu_callback(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ unsigned long cpu = (unsigned long)hcpu;
+ struct hrtimer *hrt = &per_cpu(idle_inject_timer, cpu);
+ ktime_t now, delta, period;
+
+ if (!atomic_read(&idle_inject_active))
+ goto exit_ok;
+
+ switch (action) {
+ case CPU_STARTING:
+ raw_cpu_write(idle_injected, true);
+
+ hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+ hrt->function = idle_inject_timer_fn;
+ set_bit(cpu, idle_inject_cpumask);
+
+ now = hrtimer_cb_get_time(hrt);
+ hrtimer_start(hrt, ms_to_ktime(duration),
+ HRTIMER_MODE_ABS_PINNED);
+ /*
+ * When a new CPU comes online, we need to make sure it aligns
+ * its phase with the rest of the CPUs. So we set the
+ * timer to the next period based on the common starting time,
+ * then start injecting idle time.
+ */
+ spin_lock_irq(&idle_inject_lock);
+ delta = ktime_sub(now, inject_start_time);
+ period = ktime_add(ms_to_ktime(duration),
+ ms_to_ktime(inject_interval));
+ delta = ktime_roundup(delta, period);
+ spin_unlock_irq(&idle_inject_lock);
+ hrtimer_set_expires(hrt, ktime_add(delta, inject_start_time));
+ break;
+ case CPU_DYING:
+ clear_bit(cpu, idle_inject_cpumask);
+ hrtimer_cancel(hrt);
+ raw_cpu_write(idle_injected, false);
+ throttle_rq(cpu, 0);
+ break;
+ default:
+ return NOTIFY_DONE;
+ }
+exit_ok:
+ return NOTIFY_OK;
+}
+
+static int idle_inject_pm_callback(struct notifier_block *self,
+ unsigned long action, void *hcpu)
+{
+ switch (action) {
+ case PM_HIBERNATION_PREPARE:
+ case PM_SUSPEND_PREPARE:
+ if (atomic_read(&idle_inject_active))
+ stop_idle_inject();
+ break;
+ case PM_POST_HIBERNATION:
+ case PM_POST_SUSPEND:
+ printk("POST SUSPEND restart idle injection\n");
+ start_idle_inject();
+ break;
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block idle_inject_pm_notifier = {
+ .notifier_call = idle_inject_pm_callback,
+};
+
+static struct notifier_block idle_inject_cpu_notifier = {
+ .notifier_call = idle_inject_cpu_callback,
+};
+
+static void end_idle_inject(void) {
+ unregister_hotcpu_notifier(&idle_inject_cpu_notifier);
+ unregister_pm_notifier(&idle_inject_pm_notifier);
+ atomic_set(&idle_inject_active, 0);
+ kfree(idle_inject_cpumask);
+}
+
+static int prepare_idle_inject(void)
+{
+ int retval = 0;
+ int bitmap_size;
+ int cpu;
+ struct hrtimer *hrt;
+
+ bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
+ idle_inject_cpumask = kzalloc(bitmap_size, GFP_KERNEL);
+ if (!idle_inject_cpumask)
+ return -ENOMEM;
+
+ retval = register_pm_notifier(&idle_inject_pm_notifier);
+ if (retval)
+ goto exit_free;
+ retval = register_hotcpu_notifier(&idle_inject_cpu_notifier);
+ if (retval)
+ goto exit_unregister_pm;
+ get_online_cpus();
+ for_each_online_cpu(cpu) {
+ hrt = &per_cpu(idle_inject_timer, cpu);
+ hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+ hrt->function = idle_inject_timer_fn;
+ }
+ put_online_cpus();
+
+ if (!duration)
+ duration = DEFAULT_DURATION_MSECS;
+
+ return 0;
+exit_unregister_pm:
+ unregister_pm_notifier(&idle_inject_pm_notifier);
+exit_free:
+ kfree(idle_inject_cpumask);
+ return retval;
+}
+
+int proc_sched_cfs_idle_inject_pct_handler(struct ctl_table *table,
+ int write,
+ void __user *buffer,
+ size_t *length, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret)
+ goto out;
+
+ if (idle_pct != sysctl_sched_cfs_idle_inject_pct) {
+ if (!idle_pct)
+ start_idle_inject();
+ else if (!sysctl_sched_cfs_idle_inject_pct) {
+ stop_idle_inject();
+ end_idle_inject();
+ }
+
+ /* recompute injection parameters */
+ spin_lock_irq(&idle_inject_lock);
+ idle_pct = sysctl_sched_cfs_idle_inject_pct;
+ /*
+ * duration is fixed for each injection period, we adjust
+ * non idle interval to satisfy the idle percentage set
+ * by the user. e.g. if duration is 10 and we want 33% idle
+ * then interval is 20.
+ * 33% idle
+ * ____ ___________________ _________
+ * |________| |________| 33% idle
+ * ____ ________ _______
+ * |________| |________| 50% idle
+ *
+ * |duration|interval|
+ */
+ if (idle_pct)
+ inject_interval = (duration * (100 - idle_pct))
+ / idle_pct;
+
+ spin_unlock_irq(&idle_inject_lock);
+
+ }
+out:
+ return ret;
+}
+
+int proc_sched_cfs_idle_inject_duration_handler(struct ctl_table *table,
+ int write,
+ void __user *buffer,
+ size_t *length, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret)
+ goto out;
+
+ if (duration == sysctl_sched_cfs_idle_inject_duration)
+ goto out;
+ /* recompute injection parameters */
+ spin_lock_irq(&idle_inject_lock);
+ duration = jiffies_to_msecs(sysctl_sched_cfs_idle_inject_duration);
+ if (idle_pct)
+ inject_interval = (duration * (100 - idle_pct)) / idle_pct;
+
+ spin_unlock_irq(&idle_inject_lock);
+out:
+ return ret;
+}
+
+#endif /* CONFIG_CFS_IDLE_INJECT */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6d2a119..589d30a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -417,6 +417,9 @@ struct cfs_rq {
struct list_head throttled_list;
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_CFS_IDLE_INJECT
+ int force_throttled;
+#endif
};

static inline int rt_bandwidth_enabled(void)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e69201d..78c304b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -433,6 +433,26 @@ static struct ctl_table kern_table[] = {
.extra1 = &one,
},
#endif
+#ifdef CONFIG_CFS_IDLE_INJECT
+ {
+ .procname = "sched_cfs_idle_inject_pct",
+ .data = &sysctl_sched_cfs_idle_inject_pct,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_sched_cfs_idle_inject_pct_handler,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
+ {
+ .procname = "sched_cfs_idle_inject_duration",
+ .data = &sysctl_sched_cfs_idle_inject_duration,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_sched_cfs_idle_inject_duration_handler,
+ .extra1 = &four,
+ .extra2 = &one_hundred,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
--
1.9.1

2015-11-03 13:31:31

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

> @@ -5136,6 +5148,16 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> struct task_struct *p;
> int new_tasks;
>
> +#ifdef CONFIG_CFS_IDLE_INJECT
> + if (cfs_rq->force_throttled &&
> + !idle_cpu(cpu_of(rq)) &&
> + !unlikely(local_softirq_pending())) {
> + /* forced idle, pick no task */
> + trace_sched_cfs_idle_inject(cpu_of(rq), 1);
> + update_curr(cfs_rq);
> + return NULL;
> + }
> +#endif
> again:
> #ifdef CONFIG_FAIR_GROUP_SCHED
> if (!cfs_rq->nr_running)

So this is horrible...

This is a fast path, and you just put at least one cachemiss in it, a
branch (without hint) and some goofy code (wth are we checking
softirqs?).

How about you frob things such that cfs_rq->nr_running == 0 and we'll
hit the idle: path, at that point you can test if we're forced idle and
skip the load-balancing attempt.

There's probably a fair number of icky cases to deal with if you frob
cfs_rq->nr_running, like the enqueue path which will add to it. We'll
have to come up with something to not slow that down either.

The thing is, both schedule and enqueue are very hot and this is code
that will 'never' run.

2015-11-03 14:16:13

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Tue, 3 Nov 2015 14:31:20 +0100
Peter Zijlstra <[email protected]> wrote:

> > @@ -5136,6 +5148,16 @@ pick_next_task_fair(struct rq *rq, struct
> > task_struct *prev) struct task_struct *p;
> > int new_tasks;
> >
> > +#ifdef CONFIG_CFS_IDLE_INJECT
> > + if (cfs_rq->force_throttled &&
> > + !idle_cpu(cpu_of(rq)) &&
> > + !unlikely(local_softirq_pending())) {
> > + /* forced idle, pick no task */
> > + trace_sched_cfs_idle_inject(cpu_of(rq), 1);
> > + update_curr(cfs_rq);
> > + return NULL;
> > + }
> > +#endif
> > again:
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > if (!cfs_rq->nr_running)
>
> So this is horrible...
>
> This is a fast path, and you just put at least one cachemiss in it, a
> branch (without hint) and some goofy code (wth are we checking
> softirqs?).
>
softirq is checked here since it is one of the conditions to stop
sched tick. can_stop_idle_tick(). but we don't have to check here, you
are right.
> How about you frob things such that cfs_rq->nr_running == 0 and we'll
> hit the idle: path, at that point you can test if we're forced idle
> and skip the load-balancing attempt.
>
> There's probably a fair number of icky cases to deal with if you frob
> cfs_rq->nr_running, like the enqueue path which will add to it. We'll
> have to come up with something to not slow that down either.
>
> The thing is, both schedule and enqueue are very hot and this is code
> that will 'never' run.
Fair enough, I will give that a try. I understand we don't want to
sacrifice the hot patch for some code almost 'never' run. But I also
have follow up plan to use this code for consolidating/synchronizing
idle during balanced semi-active workload. In that case, it may run
more often. e.g.

Before:
CPU0 ______||| || |___________| || || |_____
CPU1 _________||| || |_______| || |_______

After:

CPU0 ______||| || |___________| || || |_____
CPU1 ______||| || |___________| || |_______

The goal is to have overlapping idle time if the load is already
balanced. The energy saving can be significant.

Jacob

2015-11-03 16:45:11

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Tue, 3 Nov 2015 14:31:20 +0100
Peter Zijlstra <[email protected]> wrote:

> > @@ -5136,6 +5148,16 @@ pick_next_task_fair(struct rq *rq, struct
> > task_struct *prev) struct task_struct *p;
> > int new_tasks;
> >
> > +#ifdef CONFIG_CFS_IDLE_INJECT
> > + if (cfs_rq->force_throttled &&
> > + !idle_cpu(cpu_of(rq)) &&
> > + !unlikely(local_softirq_pending())) {
> > + /* forced idle, pick no task */
> > + trace_sched_cfs_idle_inject(cpu_of(rq), 1);
> > + update_curr(cfs_rq);
> > + return NULL;
> > + }
> > +#endif
> > again:
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > if (!cfs_rq->nr_running)
>
> So this is horrible...
>
> This is a fast path, and you just put at least one cachemiss in it, a
> branch (without hint) and some goofy code (wth are we checking
> softirqs?).
>
softirq is checked here since it is one of the conditions to stop
sched tick. can_stop_idle_tick(). but we don't have to check here, you
are right.
> How about you frob things such that cfs_rq->nr_running == 0 and we'll
> hit the idle: path, at that point you can test if we're forced idle
> and skip the load-balancing attempt.
>
> There's probably a fair number of icky cases to deal with if you frob
> cfs_rq->nr_running, like the enqueue path which will add to it. We'll
> have to come up with something to not slow that down either.
>
> The thing is, both schedule and enqueue are very hot and this is code
> that will 'never' run.
Fair enough, I will give that a try, I guess it would be costly
and hard to scale if we were to dequeue/enqueue every se for every
period of injection plus locking. Let me get some data first.

I understand we don't want to sacrifice the hot patch for some code
almost 'never' run. But I also have follow up plan to use this code for
consolidating/synchronizing idle during balanced semi-active workload.
In that case, it may run more often. e.g.

Before:
CPU0 ______||| || |___________| || || |_____
CPU1 _________||| || |_______| || |_______

After:

CPU0 ______||| || |___________| || || |_____
CPU1 ______||| || |___________| || |_______

The goal is to have overlapping idle time if the load is already
balanced. The energy saving can be significant.

Jacob

2015-11-04 06:07:01

by Eduardo Valentin

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

Hello Jacob,

On Mon, Nov 02, 2015 at 04:10:25PM -0800, Jacob Pan wrote:
> Hi Peter and all,
>
> A while ago, we had discussion about how powerclamp is broken in the
> sense of turning off idle ticks in the forced idle period.
> https://lkml.org/lkml/2014/12/18/369
>
> It was suggested to replace the current kthread play idle loop with a
> timer based runqueue throttling scheme. I finally got around to implement
> this and code is much simpler. I also have good test results in terms of
> efficiency, scalability, etc.
> http://events.linuxfoundation.org/sites/events/files/slides/LinuxCon_Japan_2015_idle_injection1_0.pdf
> slide #18+ shows the data on client and server.
>
> I have two choices for this code:
> 1) be part of existing powerclamp driver but require exporting some
> sched APIs.
> 2) be part of sched since the genernal rule applies when it comes down
> to sycnhronized idle time for best power savings.
>
> The patches below are for #2. There is a known problem with LOW RES timer
> mode that I am working on. But I am hoping to get review earlier.
>

I also like #2 too. Specially now that it is not limited to a specific
platform. One question though, could you still keep the cooling device
support of it? In some systems, it might make sense to enable / disable
idle injections based on temperature.

Was there any particular reason you dropped the cooling device support?

BR,

Eduardo Valentin

> We are entering a very power limited environment on client side, frequency
> scaling can only be efficient at certain range. e.g. on SKL, upto ~900MHz,
> anything below, it is increasingly more efficient to do C-states insertion
> if coordinated.
>
> Looking forward, there are use case beyond thermal/power capping. I think
> we can consolidate ballanced partial busy workload that are evenly
> distributed among CPUs.
>
> Please let me know what you think.
>
> Thanks,
>
>
> Jacob Pan (3):
> ktime: add a roundup function
> timer: relax tick stop in idle entry
> sched: introduce synchronized idle injection
>
> include/linux/ktime.h | 10 ++
> include/linux/sched.h | 12 ++
> include/linux/sched/sysctl.h | 5 +
> include/trace/events/sched.h | 23 +++
> init/Kconfig | 8 +
> kernel/sched/fair.c | 345 +++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 3 +
> kernel/sysctl.c | 20 +++
> kernel/time/tick-sched.c | 2 +-
> 9 files changed, 427 insertions(+), 1 deletion(-)
>
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-11-04 16:58:33

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On Tue, 3 Nov 2015 22:06:55 -0800
Eduardo Valentin <[email protected]> wrote:

> Hello Jacob,
>
> On Mon, Nov 02, 2015 at 04:10:25PM -0800, Jacob Pan wrote:
> > Hi Peter and all,
> >
> > A while ago, we had discussion about how powerclamp is broken in the
> > sense of turning off idle ticks in the forced idle period.
> > https://lkml.org/lkml/2014/12/18/369
> >
> > It was suggested to replace the current kthread play idle loop with
> > a timer based runqueue throttling scheme. I finally got around to
> > implement this and code is much simpler. I also have good test
> > results in terms of efficiency, scalability, etc.
> > http://events.linuxfoundation.org/sites/events/files/slides/LinuxCon_Japan_2015_idle_injection1_0.pdf
> > slide #18+ shows the data on client and server.
> >
> > I have two choices for this code:
> > 1) be part of existing powerclamp driver but require exporting some
> > sched APIs.
> > 2) be part of sched since the genernal rule applies when it comes
> > down to sycnhronized idle time for best power savings.
> >
> > The patches below are for #2. There is a known problem with LOW RES
> > timer mode that I am working on. But I am hoping to get review
> > earlier.
> >
>
> I also like #2 too. Specially now that it is not limited to a specific
> platform. One question though, could you still keep the cooling device
> support of it? In some systems, it might make sense to enable /
> disable idle injections based on temperature.
>
One of the key difference between 1 and 2 is that #2 is open loop
control, since we don't have CPU c-states info baked into scheduler. To
close the loop, perhaps we can export some internal APIs to the thermal
subsystem then the thermal governors can pick the condition to inject
idle.
> Was there any particular reason you dropped the cooling device
> support?
>
I did sysctl instead of thermal sysfs to conform the rest of the sched
tuning knobs. We could also have a proxy cooling device to call
internal APIs mentioned above.

Another reason is that, I intend to extend beyond thermal. Where we can
consolidate/sync idle work in semi-active and balanced workload.

Thanks for the suggestions,

Jacob
> BR,
>
> Eduardo Valentin
>
>
> > We are entering a very power limited environment on client side,
> > frequency scaling can only be efficient at certain range. e.g. on
> > SKL, upto ~900MHz, anything below, it is increasingly more
> > efficient to do C-states insertion if coordinated.
> >
> > Looking forward, there are use case beyond thermal/power capping. I
> > think we can consolidate ballanced partial busy workload that are
> > evenly distributed among CPUs.
> >
> > Please let me know what you think.
> >
> > Thanks,
> >
> >
> > Jacob Pan (3):
> > ktime: add a roundup function
> > timer: relax tick stop in idle entry
> > sched: introduce synchronized idle injection
> >
> > include/linux/ktime.h | 10 ++
> > include/linux/sched.h | 12 ++
> > include/linux/sched/sysctl.h | 5 +
> > include/trace/events/sched.h | 23 +++
> > init/Kconfig | 8 +
> > kernel/sched/fair.c | 345
> > +++++++++++++++++++++++++++++++++++++++++++
> > kernel/sched/sched.h | 3 + kernel/sysctl.c
> > | 20 +++ kernel/time/tick-sched.c | 2 +-
> > 9 files changed, 427 insertions(+), 1 deletion(-)
> >
> > --
> > 1.9.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-kernel" in the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/

[Jacob Pan]

2015-11-04 17:06:52

by srinivas pandruvada

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On Wed, 2015-11-04 at 08:58 -0800, Jacob Pan wrote:
> On Tue, 3 Nov 2015 22:06:55 -0800
> Eduardo Valentin <[email protected]> wrote:
>
> > Hello Jacob,
> >
> > On Mon, Nov 02, 2015 at 04:10:25PM -0800, Jacob Pan wrote:
> > > Hi Peter and all,
> > >
> > > A while ago, we had discussion about how powerclamp is broken in the
> > > sense of turning off idle ticks in the forced idle period.
> > > https://lkml.org/lkml/2014/12/18/369
> > >
> > > It was suggested to replace the current kthread play idle loop with
> > > a timer based runqueue throttling scheme. I finally got around to
> > > implement this and code is much simpler. I also have good test
> > > results in terms of efficiency, scalability, etc.
> > > http://events.linuxfoundation.org/sites/events/files/slides/LinuxCon_Japan_2015_idle_injection1_0.pdf
> > > slide #18+ shows the data on client and server.
> > >
> > > I have two choices for this code:
> > > 1) be part of existing powerclamp driver but require exporting some
> > > sched APIs.
> > > 2) be part of sched since the genernal rule applies when it comes
> > > down to sycnhronized idle time for best power savings.
> > >
> > > The patches below are for #2. There is a known problem with LOW RES
> > > timer mode that I am working on. But I am hoping to get review
> > > earlier.
> > >
> >
> > I also like #2 too. Specially now that it is not limited to a specific
> > platform. One question though, could you still keep the cooling device
> > support of it? In some systems, it might make sense to enable /
> > disable idle injections based on temperature.
> >
> One of the key difference between 1 and 2 is that #2 is open loop
> control, since we don't have CPU c-states info baked into scheduler. To
> close the loop, perhaps we can export some internal APIs to the thermal
> subsystem then the thermal governors can pick the condition to inject
> idle.
> > Was there any particular reason you dropped the cooling device
> > support?
> >
> I did sysctl instead of thermal sysfs to conform the rest of the sched
> tuning knobs. We could also have a proxy cooling device to call
> internal APIs mentioned above.
I think we should have cooling device as we are already using this
cooling device. Once it pass RFC stage,I think we should consider add
this.
Thanks,
Srinivas
>
> Another reason is that, I intend to extend beyond thermal. Where we can
> consolidate/sync idle work in semi-active and balanced workload.
>
> Thanks for the suggestions,
>
> Jacob
> > BR,
> >
> > Eduardo Valentin
> >
> >
> > > We are entering a very power limited environment on client side,
> > > frequency scaling can only be efficient at certain range. e.g. on
> > > SKL, upto ~900MHz, anything below, it is increasingly more
> > > efficient to do C-states insertion if coordinated.
> > >
> > > Looking forward, there are use case beyond thermal/power capping. I
> > > think we can consolidate ballanced partial busy workload that are
> > > evenly distributed among CPUs.
> > >
> > > Please let me know what you think.
> > >
> > > Thanks,
> > >
> > >
> > > Jacob Pan (3):
> > > ktime: add a roundup function
> > > timer: relax tick stop in idle entry
> > > sched: introduce synchronized idle injection
> > >
> > > include/linux/ktime.h | 10 ++
> > > include/linux/sched.h | 12 ++
> > > include/linux/sched/sysctl.h | 5 +
> > > include/trace/events/sched.h | 23 +++
> > > init/Kconfig | 8 +
> > > kernel/sched/fair.c | 345
> > > +++++++++++++++++++++++++++++++++++++++++++
> > > kernel/sched/sched.h | 3 + kernel/sysctl.c
> > > | 20 +++ kernel/time/tick-sched.c | 2 +-
> > > 9 files changed, 427 insertions(+), 1 deletion(-)
> > >
> > > --
> > > 1.9.1
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > linux-kernel" in the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
>
> [Jacob Pan]

2015-11-04 18:43:59

by Eduardo Valentin

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

Hello Jacob, Srinivas,

On Wed, Nov 04, 2015 at 09:05:52AM -0800, Srinivas Pandruvada wrote:
> On Wed, 2015-11-04 at 08:58 -0800, Jacob Pan wrote:

<cut>
> > > > I have two choices for this code:
> > > > 1) be part of existing powerclamp driver but require exporting some
> > > > sched APIs.
> > > > 2) be part of sched since the genernal rule applies when it comes
> > > > down to sycnhronized idle time for best power savings.
> > > >
> > > > The patches below are for #2. There is a known problem with LOW RES
> > > > timer mode that I am working on. But I am hoping to get review
> > > > earlier.
> > > >
> > >
> > > I also like #2 too. Specially now that it is not limited to a specific
> > > platform. One question though, could you still keep the cooling device
> > > support of it? In some systems, it might make sense to enable /
> > > disable idle injections based on temperature.
> > >
> > One of the key difference between 1 and 2 is that #2 is open loop
> > control, since we don't have CPU c-states info baked into scheduler. To
> > close the loop, perhaps we can export some internal APIs to the thermal
> > subsystem then the thermal governors can pick the condition to inject
> > idle.

Jacob,

I also like this direction. Having the proper APIs exported, creating a
cooling device that use them would be natural path. Then, one could
create a thermal zone plugging a governor and the idle injection cooling
device that uses the exported APIs.

> > > Was there any particular reason you dropped the cooling device
> > > support?
> > >
> > I did sysctl instead of thermal sysfs to conform the rest of the sched
> > tuning knobs. We could also have a proxy cooling device to call
> > internal APIs mentioned above.

Agreed here then.

> I think we should have cooling device as we are already using this
> cooling device. Once it pass RFC stage,I think we should consider add
> this.

Srinivas,
Yes, that seens to be a good path to follow. Thanks.

> Thanks,
> Srinivas
> >
> > Another reason is that, I intend to extend beyond thermal. Where we can
> > consolidate/sync idle work in semi-active and balanced workload.

I see.

BR,

Eduardo Valentin

2015-11-05 10:09:29

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Tue, Nov 03, 2015 at 08:45:01AM -0800, Jacob Pan wrote:
> Fair enough, I will give that a try, I guess it would be costly
> and hard to scale if we were to dequeue/enqueue every se for every
> period of injection plus locking. Let me get some data first.

Yeah, don't go dequeue/enqueue everything, the regular throttle code
doesn't do that either.

Throttling the root is more interesting though, becuase you can simply
dequeue the child group representative, whereas you cannot dequeue the
root (for being the root means there is no parent to dequeue from).

> I understand we don't want to sacrifice the hot patch for some code
> almost 'never' run. But I also have follow up plan to use this code for
> consolidating/synchronizing idle during balanced semi-active workload.
> In that case, it may run more often. e.g.
>
> Before:
> CPU0 ______||| || |___________| || || |_____
> CPU1 _________||| || |_______| || |_______
>
> After:
>
> CPU0 ______||| || |___________| || || |_____
> CPU1 ______||| || |___________| || |_______
>
> The goal is to have overlapping idle time if the load is already
> balanced. The energy saving can be significant.

I can see such a scheme having a fairly big impact on latency, esp. with
forced idleness such as this. That's not going to be popular for many
workloads.

2015-11-05 10:12:31

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

People, trim your emails!

On Wed, Nov 04, 2015 at 08:58:30AM -0800, Jacob Pan wrote:

> > I also like #2 too. Specially now that it is not limited to a specific
> > platform. One question though, could you still keep the cooling device
> > support of it? In some systems, it might make sense to enable /
> > disable idle injections based on temperature.

> One of the key difference between 1 and 2 is that #2 is open loop
> control, since we don't have CPU c-states info baked into scheduler.

_yet_, there's people working on that. The whole power aware scheduling
stuff needs that.

> To close the loop, perhaps we can export some internal APIs to the
> thermal subsystem then the thermal governors can pick the condition to
> inject idle.

I would much rather that all be part of the power aware stuff, such that
the scheduler itself is aware of thermal limits and can migrate load
away if needed.

2015-11-05 13:59:58

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Tue, Nov 03, 2015 at 02:31:20PM +0100, Peter Zijlstra wrote:
> > @@ -5136,6 +5148,16 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> > struct task_struct *p;
> > int new_tasks;
> >
> > +#ifdef CONFIG_CFS_IDLE_INJECT
> > + if (cfs_rq->force_throttled &&
> > + !idle_cpu(cpu_of(rq)) &&
> > + !unlikely(local_softirq_pending())) {
> > + /* forced idle, pick no task */
> > + trace_sched_cfs_idle_inject(cpu_of(rq), 1);
> > + update_curr(cfs_rq);
> > + return NULL;
> > + }
> > +#endif
> > again:
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > if (!cfs_rq->nr_running)
>
> So this is horrible...

So this isn't ideal either (I rather liked the previous approach of a
random task assuming idle, but tglx hated that). This should at least
not touch extra cachelines in the hot paths, although it does add a few
extra instructions :/

Very limited testing didn't show anything horrible.

Your throttle would:

raw_spin_lock_irqsave(&rq->lock, flags);
rq->cfs.forced_idle = true;
resched = rq->cfs.runnable;
rq->cfs.runnable = false;
raw_spin_unlock_irqrestore(&rq->lock, flags);
if (resched)
resched_cpu(cpu_of(rq));

And your unthrottle:

raw_spin_lock_irqsave(&rq->lock, flags);
rq->cfs.forced_idle = false;
resched = rq->cfs.runnable = !!rq->cfs.nr_running;
raw_spin_unlock_irqrestore(&rq->lock, flags);
if (resched)
resched_cpu(cpu_of(rq));

---
kernel/sched/fair.c | 13 +++++++++----
kernel/sched/sched.h | 1 +
2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 824aa9f..1f0c809 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2341,7 +2341,8 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
list_add(&se->group_node, &rq->cfs_tasks);
}
#endif
- cfs_rq->nr_running++;
+ if (!cfs_rq->nr_running++ && !cfs_rq->forced_idle)
+ cfs_rq->runnable = true;
}

static void
@@ -2354,7 +2355,8 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
account_numa_dequeue(rq_of(cfs_rq), task_of(se));
list_del_init(&se->group_node);
}
- cfs_rq->nr_running--;
+ if (!--cfs_rq->nr_running && !cfs_rq->forced_idle)
+ cfs_rq->runnable = false;
}

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -5204,7 +5206,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)

again:
#ifdef CONFIG_FAIR_GROUP_SCHED
- if (!cfs_rq->nr_running)
+ if (!cfs_rq->runnable)
goto idle;

if (prev->sched_class != &fair_sched_class)
@@ -5283,7 +5285,7 @@ simple:
cfs_rq = &rq->cfs;
#endif

- if (!cfs_rq->nr_running)
+ if (!cfs_rq->runnable)
goto idle;

put_prev_task(rq, prev);
@@ -5302,6 +5304,9 @@ simple:
return p;

idle:
+ if (cfs_rq->forced_idle)
+ return NULL;
+
/*
* This is OK, because current is on_cpu, which avoids it being picked
* for load-balance and preemption/IRQs are still disabled avoiding
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efd3bfc..33d355d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -347,6 +347,7 @@ struct cfs_bandwidth { };
struct cfs_rq {
struct load_weight load;
unsigned int nr_running, h_nr_running;
+ unsigned int runnable, forced_idle;

u64 exec_clock;
u64 min_vruntime;

2015-11-05 14:23:00

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On 11/5/2015 2:09 AM, Peter Zijlstra wrote:

> I can see such a scheme having a fairly big impact on latency, esp. with
> forced idleness such as this. That's not going to be popular for many
> workloads.

idle injection is a last ditch effort in thermal management, before
this gets used the hardware already has clamped you to a low frequency,
reduced memory speeds, probably dimmed your screen etc etc.

at this point there are 3 choices
1) Shut off the device
2) do uncoordinated idle injection for 40% of the time
3) do coordinated idle injection for 5% of the time

as much as force injecting idle in a synchronized way sucks, the alternatives are worse.

>

2015-11-05 14:33:38

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote:
> On 11/5/2015 2:09 AM, Peter Zijlstra wrote:
>
> >I can see such a scheme having a fairly big impact on latency, esp. with
> >forced idleness such as this. That's not going to be popular for many
> >workloads.
>
> idle injection is a last ditch effort in thermal management, before
> this gets used the hardware already has clamped you to a low frequency,
> reduced memory speeds, probably dimmed your screen etc etc.
>
> at this point there are 3 choices
> 1) Shut off the device
> 2) do uncoordinated idle injection for 40% of the time
> 3) do coordinated idle injection for 5% of the time
>
> as much as force injecting idle in a synchronized way sucks, the alternatives are worse.

OK, it wasn't put that way. I figured it was a way to use less power on
any workload with idle time on.

That said; what kind of devices are we talking about here; mobile with
pittyful heat dissipation? Surely a well designed server or desktop
class system should never get into this situation in the first place.

2015-11-05 14:48:45

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Thu, Nov 05, 2015 at 03:33:32PM +0100, Peter Zijlstra wrote:

> > idle injection is a last ditch effort in thermal management

It just grates at me a bit that we have to touch hot paths for such
scenarios :/

2015-11-05 15:28:54

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On 11/5/2015 6:33 AM, Peter Zijlstra wrote:
> On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote:
>> On 11/5/2015 2:09 AM, Peter Zijlstra wrote:
>>
>>> I can see such a scheme having a fairly big impact on latency, esp. with
>>> forced idleness such as this. That's not going to be popular for many
>>> workloads.
>>
>> idle injection is a last ditch effort in thermal management, before
>> this gets used the hardware already has clamped you to a low frequency,
>> reduced memory speeds, probably dimmed your screen etc etc.
>>
>> at this point there are 3 choices
>> 1) Shut off the device
>> 2) do uncoordinated idle injection for 40% of the time
>> 3) do coordinated idle injection for 5% of the time
>>
>> as much as force injecting idle in a synchronized way sucks, the alternatives are worse.
>
> OK, it wasn't put that way. I figured it was a way to use less power on
> any workload with idle time on.

so idle injection (as with pretty much every thermal management feature) is NOT a way to save
on battery life. Every known method pretty much ends up sacrificing more in terms of performance
than you gain in instant power that over time you end up using more (drain battery basically).

idle injection, if synchronized, is one of the more effective ones, e.g. give up the least efficiency
compared to, say, unsynchronized or even inserting idle cycles in the CPU (T-states)...
not even speaking of just turning the system off.

>
> That said; what kind of devices are we talking about here; mobile with
> pittyful heat dissipation? Surely a well designed server or desktop
> class system should never get into this situation in the first place.

a well designed server may not, but the datacenter it is in may.
for example if the AC goes out, but also, sometimes the datacenter's peak heat dissapation
can exceed the AC capacity (which is outside temperature dependent.. yay global warming),
which may require an urgent reduction over a series of machines for the duration of the peak load/peak temperature
(usually just inserting a little bit, say 1%, over all servers will do)

>It just grates at me a bit that we have to touch hot paths for such
scenarios :/

well we have this as a driver right now that does not touch hot paths,
but it seems you and tglx also hate that approach with a passion....

2015-11-05 15:33:58

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Thu, 5 Nov 2015 15:33:32 +0100
Peter Zijlstra <[email protected]> wrote:

> On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote:
> > On 11/5/2015 2:09 AM, Peter Zijlstra wrote:
> >
> > >I can see such a scheme having a fairly big impact on latency,
> > >esp. with forced idleness such as this. That's not going to be
> > >popular for many workloads.
> >
> > idle injection is a last ditch effort in thermal management, before
> > this gets used the hardware already has clamped you to a low
> > frequency, reduced memory speeds, probably dimmed your screen etc
> > etc.
> >
Just to clarify, the low frequency here is not necessarily the minimum
frequency. It is usually the Pe (max efficiency).
> > at this point there are 3 choices
> > 1) Shut off the device
> > 2) do uncoordinated idle injection for 40% of the time
> > 3) do coordinated idle injection for 5% of the time
> >
> > as much as force injecting idle in a synchronized way sucks, the
> > alternatives are worse.
>
> OK, it wasn't put that way. I figured it was a way to use less power
> on any workload with idle time on.
>
> That said; what kind of devices are we talking about here; mobile with
> pittyful heat dissipation? Surely a well designed server or desktop
> class system should never get into this situation in the first place.
Yes, Mobile devices, especially fan-less, are the targets. On one side
we all desire high performance, but it does not come free. The
performance tax might limit the ability to scale at the low end.
e.g. on skylake-y P1 is 1.5GHz, Pe(max efficiency, dynamic) is ~900MHz,
Pmin is 400Mhz.
When thermal limit occurs, there are two options
1. limit freq to Pmin 400Mhz and run 100%
2. let CPU run at ~800Mhz but inject idle at 50%

#2 option provides better performance per watt since it can scale
nearly linearly, i.e. 50% performance at 50% power. For my own limited
testing and this can vary greatly by parts, running at Pmin vs Pe can
lose 30% perf per watt.

2015-11-05 16:06:24

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On 11/5/2015 7:32 AM, Jacob Pan wrote:
> On Thu, 5 Nov 2015 15:33:32 +0100
> Peter Zijlstra <[email protected]> wrote:
>
>> On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote:
>>> On 11/5/2015 2:09 AM, Peter Zijlstra wrote:
>>>
>>>> I can see such a scheme having a fairly big impact on latency,
>>>> esp. with forced idleness such as this. That's not going to be
>>>> popular for many workloads.
>>>
>>> idle injection is a last ditch effort in thermal management, before
>>> this gets used the hardware already has clamped you to a low
>>> frequency, reduced memory speeds, probably dimmed your screen etc
>>> etc.
>>>
> Just to clarify, the low frequency here is not necessarily the minimum
> frequency. It is usually the Pe (max efficiency).

to translate that from Intelese to English:
The system already is at the lowest frequency that's relatively efficient. To go even lower in instant power
consumption (e.g. heat) by even a little bit, a LOT of frequency needs to be sacrificed.

Idle injection sucks. But it's more efficient (at the point that it would get used) than any other methods,
so it also sucks less than those other methods for the same amount of reduction in heat generation.
It only gets used if the system HAS to reduce the heat generation, either because it's a mobile device with
little cooling capacity, or because the airconditioning in your big datacenter is currently
not able to keep up.

2015-11-05 16:52:09

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Thu, Nov 05, 2015 at 07:28:50AM -0800, Arjan van de Ven wrote:
> well we have this as a driver right now that does not touch hot paths,
> but it seems you and tglx also hate that approach with a passion....

The current code is/was broken, but when I tried fixing it, tglx
objected to the entire approach yes...

Thomas, no arm twisting can convince you to reconsider the fake idle
task approach?

2015-11-05 18:48:34

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Thu, 5 Nov 2015, Arjan van de Ven wrote:
> On 11/5/2015 6:33 AM, Peter Zijlstra wrote:
>
> > It just grates at me a bit that we have to touch hot paths for such
> scenarios :/
>
> well we have this as a driver right now that does not touch hot paths,
> but it seems you and tglx also hate that approach with a passion....

Right. It's a horror to deal with tasks which try to impersonate idle
while actually running as highest priority task in the system. We
really want the scheduler to know about it. Yes, we probably have to
pay the price for some extra check in the hot path, but that's way
more sensible than figuring out how 'want to be idle' RT tasks
wreckage the world and some more.

Thanks,

tglx

2015-11-05 18:55:55

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Thu, 5 Nov 2015, Peter Zijlstra wrote:
> On Thu, Nov 05, 2015 at 07:28:50AM -0800, Arjan van de Ven wrote:
> > well we have this as a driver right now that does not touch hot paths,
> > but it seems you and tglx also hate that approach with a passion....
>
> The current code is/was broken, but when I tried fixing it, tglx
> objected to the entire approach yes...
>
> Thomas, no arm twisting can convince you to reconsider the fake idle
> task approach?

As long as the thing just calls mwait or whatever twisting might be
successful, but I'm not going to change my mind on something calling
into the idle related routines (timers, rcu, etc..) and violates all
sensible assumptions we make in that code. It's complex enough already
and we really do not need subtle wreckage by half baken and unreviewed
drivers to add more complexity.

Thanks,

tglx

2015-11-05 19:28:29

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Thu, 5 Nov 2015 11:09:22 +0100
Peter Zijlstra <[email protected]> wrote:

> > Before:
> > CPU0 ______||| || |___________| || || |_____
> > CPU1 _________||| || |_______| || |_______
> >
> > After:
> >
> > CPU0 ______||| || |___________| || || |_____
> > CPU1 ______||| || |___________| || |_______
> >
> > The goal is to have overlapping idle time if the load is already
> > balanced. The energy saving can be significant.
>
> I can see such a scheme having a fairly big impact on latency, esp.
> with forced idleness such as this. That's not going to be popular for
> many workloads.
agreed, it would be for limited workload. the key is to identify such
workloads at runtime. I am thinking to use the load average of
the busiest CPU as reference for consolidation, will not go beyond
that.
For the patch I have today and if you play a game like this one
http://www.agame.com/game/cut-the-rope

2015-11-05 19:33:37

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Thu, 5 Nov 2015 11:27:32 -0800
Jacob Pan <[email protected]> wrote:

> On Thu, 5 Nov 2015 11:09:22 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > > Before:
> > > CPU0 ______||| || |___________| || || |_____
> > > CPU1 _________||| || |_______| || |_______
> > >
> > > After:
> > >
> > > CPU0 ______||| || |___________| || || |_____
> > > CPU1 ______||| || |___________| || |_______
> > >
> > > The goal is to have overlapping idle time if the load is already
> > > balanced. The energy saving can be significant.
> >
> > I can see such a scheme having a fairly big impact on latency, esp.
> > with forced idleness such as this. That's not going to be popular
> > for many workloads.
> agreed, it would be for limited workload. the key is to identify such
> workloads at runtime. I am thinking to use the load average of
> the busiest CPU as reference for consolidation, will not go beyond
> that.
> For the patch I have today and if you play a game like this one
> http://www.agame.com/game/cut-the-rope
sorry, hit the wrong button before finishing the email.
and set duration to 5, and 20% idle, it does not affect user experience
much. It saves ~15% power on my BDW ultrabook. Other unbalanced
workload such as video playback don't benefit, should be avoided.

2015-11-05 23:37:24

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Thu, 5 Nov 2015 14:59:52 +0100
Peter Zijlstra <[email protected]> wrote:

> On Tue, Nov 03, 2015 at 02:31:20PM +0100, Peter Zijlstra wrote:
> > > @@ -5136,6 +5148,16 @@ pick_next_task_fair(struct rq *rq, struct
> > > task_struct *prev) struct task_struct *p;
> > > int new_tasks;
> > >
> > > +#ifdef CONFIG_CFS_IDLE_INJECT
> > > + if (cfs_rq->force_throttled &&
> > > + !idle_cpu(cpu_of(rq)) &&
> > > + !unlikely(local_softirq_pending())) {
> > > + /* forced idle, pick no task */
> > > + trace_sched_cfs_idle_inject(cpu_of(rq), 1);
> > > + update_curr(cfs_rq);
> > > + return NULL;
> > > + }
> > > +#endif
> > > again:
> > > #ifdef CONFIG_FAIR_GROUP_SCHED
> > > if (!cfs_rq->nr_running)
> >
> > So this is horrible...
>
> So this isn't ideal either (I rather liked the previous approach of a
> random task assuming idle, but tglx hated that). This should at least
> not touch extra cachelines in the hot paths, although it does add a
> few extra instructions :/
>
> Very limited testing didn't show anything horrible.
>
I did some testing with the code below, it shows random
[ 150.442597] NOHZ: local_softirq_pending 02
[ 153.032673] NOHZ: local_softirq_pending 202
[ 153.203785] NOHZ: local_softirq_pending 202
[ 153.206486] NOHZ: local_softirq_pending 282
I recalled that was why i checked for local_softirq_pending in the
initial patch, still trying to find out how we can avoid that. These
also causes non stop sched ticks in the inner idle loop.

> Your throttle would:
>
> raw_spin_lock_irqsave(&rq->lock, flags);
> rq->cfs.forced_idle = true;
> resched = rq->cfs.runnable;
> rq->cfs.runnable = false;
> raw_spin_unlock_irqrestore(&rq->lock, flags);
> if (resched)
> resched_cpu(cpu_of(rq));
>
> And your unthrottle:
>
> raw_spin_lock_irqsave(&rq->lock, flags);
> rq->cfs.forced_idle = false;
> resched = rq->cfs.runnable = !!rq->cfs.nr_running;
> raw_spin_unlock_irqrestore(&rq->lock, flags);
> if (resched)
> resched_cpu(cpu_of(rq));
>
> ---
> kernel/sched/fair.c | 13 +++++++++----
> kernel/sched/sched.h | 1 +
> 2 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 824aa9f..1f0c809 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2341,7 +2341,8 @@ account_entity_enqueue(struct cfs_rq *cfs_rq,
> struct sched_entity *se) list_add(&se->group_node, &rq->cfs_tasks);
> }
> #endif
> - cfs_rq->nr_running++;
> + if (!cfs_rq->nr_running++ && !cfs_rq->forced_idle)
> + cfs_rq->runnable = true;
> }
>
> static void
> @@ -2354,7 +2355,8 @@ account_entity_dequeue(struct cfs_rq *cfs_rq,
> struct sched_entity *se) account_numa_dequeue(rq_of(cfs_rq),
> task_of(se)); list_del_init(&se->group_node);
> }
> - cfs_rq->nr_running--;
> + if (!--cfs_rq->nr_running && !cfs_rq->forced_idle)
> + cfs_rq->runnable = false;
> }
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -5204,7 +5206,7 @@ pick_next_task_fair(struct rq *rq, struct
> task_struct *prev)
> again:
> #ifdef CONFIG_FAIR_GROUP_SCHED
> - if (!cfs_rq->nr_running)
> + if (!cfs_rq->runnable)
> goto idle;
>
> if (prev->sched_class != &fair_sched_class)
> @@ -5283,7 +5285,7 @@ simple:
> cfs_rq = &rq->cfs;
> #endif
>
> - if (!cfs_rq->nr_running)
> + if (!cfs_rq->runnable)
> goto idle;
>
> put_prev_task(rq, prev);
> @@ -5302,6 +5304,9 @@ simple:
> return p;
>
> idle:
> + if (cfs_rq->forced_idle)
> + return NULL;
> +
> /*
> * This is OK, because current is on_cpu, which avoids it
> being picked
> * for load-balance and preemption/IRQs are still disabled
> avoiding diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index efd3bfc..33d355d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -347,6 +347,7 @@ struct cfs_bandwidth { };
> struct cfs_rq {
> struct load_weight load;
> unsigned int nr_running, h_nr_running;
> + unsigned int runnable, forced_idle;
>
> u64 exec_clock;
> u64 min_vruntime;

[Jacob Pan]

2015-11-06 07:45:18

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Thu, Nov 05, 2015 at 03:36:25PM -0800, Jacob Pan wrote:

> I did some testing with the code below, it shows random
> [ 150.442597] NOHZ: local_softirq_pending 02
> [ 153.032673] NOHZ: local_softirq_pending 202
> [ 153.203785] NOHZ: local_softirq_pending 202
> [ 153.206486] NOHZ: local_softirq_pending 282
> I recalled that was why i checked for local_softirq_pending in the
> initial patch, still trying to find out how we can avoid that. These
> also causes non stop sched ticks in the inner idle loop.

Check the softirq stuff before calling throttle ?

2015-11-06 16:51:16

by Punit Agrawal

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

Peter Zijlstra <[email protected]> writes:

> People, trim your emails!
>
> On Wed, Nov 04, 2015 at 08:58:30AM -0800, Jacob Pan wrote:
>
>> > I also like #2 too. Specially now that it is not limited to a specific
>> > platform. One question though, could you still keep the cooling device
>> > support of it? In some systems, it might make sense to enable /
>> > disable idle injections based on temperature.
>
>> One of the key difference between 1 and 2 is that #2 is open loop
>> control, since we don't have CPU c-states info baked into scheduler.
>
> _yet_, there's people working on that. The whole power aware scheduling
> stuff needs that.
>
>> To close the loop, perhaps we can export some internal APIs to the
>> thermal subsystem then the thermal governors can pick the condition to
>> inject idle.
>
> I would much rather that all be part of the power aware stuff, such that
> the scheduler itself is aware of thermal limits and can migrate load
> away if needed.

I was wondering if we could use cpu capacity as the interface between
the thermal sub-system and the scheduler. This would be better than
dealing with frequency caps and idle injection percentages directly in
the scheduler.

We've been playing with making the scheduler respect capacity caps due
to thermal constraints and have tasks migrated away to less capped
cores.

It would be great if in addition to the frequency caps, we could add
idle injection to the arsenal. This would allow building policies on top
such as -

* pure idle injection where frequency capping is unsuitable (or
unavailable)
* a smooth continuum of capacities using a combination of frequency and
capacity capping
* idle injection once frequencies have been capped to the lowest
feasible values (as suggested in the cover letter)

One question about the implementation in these patches - should the
implementation hook into pick_next_task in core instead of CFS? Higher
priority tasks might get in the way of idle injection.

> -- To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-11-06 18:30:08

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On 05/11/15 10:12, Peter Zijlstra wrote:
>
> People, trim your emails!
>
> On Wed, Nov 04, 2015 at 08:58:30AM -0800, Jacob Pan wrote:
>
>>> I also like #2 too. Specially now that it is not limited to a specific
>>> platform. One question though, could you still keep the cooling device
>>> support of it? In some systems, it might make sense to enable /
>>> disable idle injections based on temperature.
>
>> One of the key difference between 1 and 2 is that #2 is open loop
>> control, since we don't have CPU c-states info baked into scheduler.
>
> _yet_, there's people working on that. The whole power aware scheduling
> stuff needs that.

Isn't the idle state information (rq->idle_state) already used in
find_idlest_cpu()?

What we use in energy aware scheduling is quite similar but since we're
interested in the index information of the c-state (to access the right
element of the idle_state vectors of the energy model, we added
rq->idle_state_idx.

[...]

2015-11-06 19:11:27

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On Fri, 6 Nov 2015 18:30:01 +0000
Dietmar Eggemann <[email protected]> wrote:

> On 05/11/15 10:12, Peter Zijlstra wrote:
> >
> > People, trim your emails!
> >
> > On Wed, Nov 04, 2015 at 08:58:30AM -0800, Jacob Pan wrote:
> >
> >>> I also like #2 too. Specially now that it is not limited to a
> >>> specific platform. One question though, could you still keep the
> >>> cooling device support of it? In some systems, it might make
> >>> sense to enable / disable idle injections based on temperature.
> >
> >> One of the key difference between 1 and 2 is that #2 is open loop
> >> control, since we don't have CPU c-states info baked into
> >> scheduler.
> >
> > _yet_, there's people working on that. The whole power aware
> > scheduling stuff needs that.
>
> Isn't the idle state information (rq->idle_state) already used in
> find_idlest_cpu()?
>
> What we use in energy aware scheduling is quite similar but since
> we're interested in the index information of the c-state (to access
> the right element of the idle_state vectors of the energy model, we
> added rq->idle_state_idx.
>
what i am interested is not per cpu idle state but rather at the package
level or domain. It must be an indication for the overlapped idle time.
Usually has to come from HW counters.

2015-11-06 20:53:29

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On Fri, 06 Nov 2015 16:50:15 +0000
Punit Agrawal <[email protected]> wrote:

> * idle injection once frequencies have been capped to the lowest
> feasible values (as suggested in the cover letter)
>
actually, I was suggesting to start considering idle injection once
frequency capped to the energy efficient point, which can be much
higher than the lowest frequency. The idea being, deep idle power is
negligible compared to running power which allows near linear
power-perf scaling for balanced workload.
Below energy efficient frequency, continuous lowering frequency may
lose disproportion performance vs. power. i.e. worse than linear.

> One question about the implementation in these patches - should the
> implementation hook into pick_next_task in core instead of CFS? Higher
> priority tasks might get in the way of idle injection.
My take is that RT and throttling will never go well together since they
are conflicting in principle.

2015-11-06 21:55:54

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On 11/06/2015 07:10 PM, Jacob Pan wrote:
> On Fri, 6 Nov 2015 18:30:01 +0000
> Dietmar Eggemann <[email protected]> wrote:
>
>> On 05/11/15 10:12, Peter Zijlstra wrote:
>>>
>>> People, trim your emails!
>>>
>>> On Wed, Nov 04, 2015 at 08:58:30AM -0800, Jacob Pan wrote:
>>>
>>>>> I also like #2 too. Specially now that it is not limited to a
>>>>> specific platform. One question though, could you still keep the
>>>>> cooling device support of it? In some systems, it might make
>>>>> sense to enable / disable idle injections based on temperature.
>>>
>>>> One of the key difference between 1 and 2 is that #2 is open loop
>>>> control, since we don't have CPU c-states info baked into
>>>> scheduler.
>>>
>>> _yet_, there's people working on that. The whole power aware
>>> scheduling stuff needs that.
>>
>> Isn't the idle state information (rq->idle_state) already used in
>> find_idlest_cpu()?
>>
>> What we use in energy aware scheduling is quite similar but since
>> we're interested in the index information of the c-state (to access
>> the right element of the idle_state vectors of the energy model, we
>> added rq->idle_state_idx.
>>
> what i am interested is not per cpu idle state but rather at the package
> level or domain. It must be an indication for the overlapped idle time.
> Usually has to come from HW counters.

I see. We have a similar problem with the Energy Model (EM) on cluster
level (sched domain level DIE). We iterate over the cpus of a sched
group and declare the shallowest cpu idle state as the cluster idle
state to index our EM. On a typical ARM system we have (active, WFI,
cpu-off and cluster-off). But I guess for you the idle state index is
only for core idle states and you can't draw any conclusions from this
for the package idle states.

2015-11-06 23:50:25

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Fri, 6 Nov 2015 08:45:10 +0100
Peter Zijlstra <[email protected]> wrote:

> On Thu, Nov 05, 2015 at 03:36:25PM -0800, Jacob Pan wrote:
>
> > I did some testing with the code below, it shows random
> > [ 150.442597] NOHZ: local_softirq_pending 02
> > [ 153.032673] NOHZ: local_softirq_pending 202
> > [ 153.203785] NOHZ: local_softirq_pending 202
> > [ 153.206486] NOHZ: local_softirq_pending 282
> > I recalled that was why i checked for local_softirq_pending in the
> > initial patch, still trying to find out how we can avoid that. These
> > also causes non stop sched ticks in the inner idle loop.
>
> Check the softirq stuff before calling throttle ?

yes, played with it but it seems there are other cases causing pending
softirq in idle in addition to throttle. I still haven't figure it out,
this problem only shows up in heavy irq, network load. e.g. compile
kernel over NFS. Debugging.

2015-11-09 11:57:57

by Punit Agrawal

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

Jacob Pan <[email protected]> writes:

> On Fri, 06 Nov 2015 16:50:15 +0000
> Punit Agrawal <[email protected]> wrote:
>
>> * idle injection once frequencies have been capped to the lowest
>> feasible values (as suggested in the cover letter)
>>
> actually, I was suggesting to start considering idle injection once
> frequency capped to the energy efficient point, which can be much
> higher than the lowest frequency. The idea being, deep idle power is
> negligible compared to running power which allows near linear
> power-perf scaling for balanced workload.
> Below energy efficient frequency, continuous lowering frequency may
> lose disproportion performance vs. power. i.e. worse than linear.
>

I agree. I was making that assumption that with the ability to inject
idle states, there wouldn't be a need to expose the inefficient
frequency states.

Do you still see a reason to do that?

>> One question about the implementation in these patches - should the
>> implementation hook into pick_next_task in core instead of CFS? Higher
>> priority tasks might get in the way of idle injection.
> My take is that RT and throttling will never go well together since they
> are conflicting in principle.

I am not sure I follow. If RT (or other higher priority classes) can't
be throttled then the CPUs are not able to contribute towards
constraining power consumption and hence temperature.

This is especially true in certain platforms where tasks belong to the
RT class to maintain user experience, e.g., audio and video.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-11-09 14:15:44

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On Mon, Nov 09, 2015 at 11:56:51AM +0000, Punit Agrawal wrote:
> Jacob Pan <[email protected]> writes:
> > My take is that RT and throttling will never go well together since they
> > are conflicting in principle.
>
> I am not sure I follow. If RT (or other higher priority classes) can't
> be throttled then the CPUs are not able to contribute towards
> constraining power consumption and hence temperature.
>
> This is especially true in certain platforms where tasks belong to the
> RT class to maintain user experience, e.g., audio and video.

Audio/Video playback generally doesn't take a _lot_ of time these days.
What is important though is _when_ it happens.

And media playback typically already has a very well defined and stable
cadence (24Hz or whatnot). What you want is for your idle injector to
sync up with that, not disrupt it.

For other workloads, missing a deadline is about as bad as destroying
the chip, complete system shutdown might be safer than getting delayed.
(The very tired scenario of a saw, a laser and your finger; you want to
shut down the entire machine rather than just cut off your finger.)

2015-11-09 14:36:59

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On Mon, 09 Nov 2015 11:56:51 +0000
Punit Agrawal <[email protected]> wrote:

> > actually, I was suggesting to start considering idle injection once
> > frequency capped to the energy efficient point, which can be much
> > higher than the lowest frequency. The idea being, deep idle power is
> > negligible compared to running power which allows near linear
> > power-perf scaling for balanced workload.
> > Below energy efficient frequency, continuous lowering frequency may
> > lose disproportion performance vs. power. i.e. worse than linear.
> >
>
> I agree. I was making that assumption that with the ability to inject
> idle states, there wouldn't be a need to expose the inefficient
> frequency states.
>
> Do you still see a reason to do that?
yes, but it is up to a governor or management sw to decide when to to
pick what mechanism. there may be certain workload scale better with
frequency change. e.g. unbalanced workload, we don't want to inject
idle to all cpus if just one is busy. but it is also unlikely to run
into thermal issue in this case.

2015-11-09 14:43:30

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On Mon, 9 Nov 2015 15:15:34 +0100
Peter Zijlstra <[email protected]> wrote:

> On Mon, Nov 09, 2015 at 11:56:51AM +0000, Punit Agrawal wrote:
> > Jacob Pan <[email protected]> writes:
> > > My take is that RT and throttling will never go well together
> > > since they are conflicting in principle.
> >
> > I am not sure I follow. If RT (or other higher priority classes)
> > can't be throttled then the CPUs are not able to contribute towards
> > constraining power consumption and hence temperature.
> >
> > This is especially true in certain platforms where tasks belong to
> > the RT class to maintain user experience, e.g., audio and video.
>
> Audio/Video playback generally doesn't take a _lot_ of time these
> days. What is important though is _when_ it happens.
>
> And media playback typically already has a very well defined and
> stable cadence (24Hz or whatnot). What you want is for your idle
> injector to sync up with that, not disrupt it.
>
Agreed, i have tested idle injection on video playback (mostly one cpu
busy, no sync with gpu), it does not do well to improve energy
efficiency. With the video playback being offloaded, there is no
thermal condition either. So outside the scope of this first patchst
trying to solve. The ability to sync with external pattern, could be the
next step. kind of like pll in hw :).

> For other workloads, missing a deadline is about as bad as destroying
> the chip, complete system shutdown might be safer than getting
> delayed. (The very tired scenario of a saw, a laser and your finger;
> you want to shut down the entire machine rather than just cut off
> your finger.)
>
>
>

2015-11-09 21:24:03

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On Fri, 6 Nov 2015 21:55:49 +0000
Dietmar Eggemann <[email protected]> wrote:

> > what i am interested is not per cpu idle state but rather at the
> > package level or domain. It must be an indication for the
> > overlapped idle time. Usually has to come from HW counters.
>
> I see. We have a similar problem with the Energy Model (EM) on
> cluster level (sched domain level DIE). We iterate over the cpus of a
> sched group and declare the shallowest cpu idle state as the cluster
> idle state to index our EM. On a typical ARM system we have (active,
> WFI, cpu-off and cluster-off). But I guess for you the idle state
> index is only for core idle states and you can't draw any conclusions
> from this for the package idle states.
what is WFI?
For Intel, idle states are hints to the HW. The FW decides how far the
idle can go based on many factors, device states included, some are
visible to the OS some are not. We just to help mature such deep idle
conditions.

2015-11-09 21:45:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On Mon, Nov 09, 2015 at 01:23:04PM -0800, Jacob Pan wrote:
> what is WFI?

Wait For Interrupt; very like the x86 HLT thing.

> For Intel, idle states are hints to the HW. The FW decides how far the
> idle can go based on many factors, device states included, some are
> visible to the OS some are not. We just to help mature such deep idle
> conditions.

On some ARM you have to manually orchestrate cluster idle, which is
clustered idle states in cpuidle. The up-side is that you explicitly
know about them, the down side is that its cross CPU bits and a freak
show (think doing cross CPU atomics while a CPU isn't in the coherency
domain yet).

2015-11-10 00:20:13

by Jacob Pan

[permalink] [raw]

Subject: Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

On Fri, 6 Nov 2015 15:49:29 -0800
Jacob Pan <[email protected]> wrote:

> > Check the softirq stuff before calling throttle ?
>
> yes, played with it but it seems there are other cases causing pending
> softirq in idle in addition to throttle. I still haven't figure it
> out, this problem only shows up in heavy irq, network load. e.g.
> compile kernel over NFS. Debugging.
ok, I added a check for softirq_pending and a retry. seems to work. Now
idle injection will allow softirq and softirqd to run. The caveat is
that during that injection period, if softirqd does not run for the
entire duration, other normal tasks would also run during forced idle.
But just for that period. I guess we have to strike for the right
balance for QoS and overhead. For most workload, pending softirq is
rare so the tasks slip under softirqd are also rare. Will send out V2
soon.

Thanks,

Jacob

2015-11-10 10:07:42

by Juri Lelli

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

Hi,

On 9 November 2015 at 14:15, Peter Zijlstra <[email protected]> wrote:
> On Mon, Nov 09, 2015 at 11:56:51AM +0000, Punit Agrawal wrote:
>> Jacob Pan <[email protected]> writes:
>> > My take is that RT and throttling will never go well together since they
>> > are conflicting in principle.
>>
>> I am not sure I follow. If RT (or other higher priority classes) can't
>> be throttled then the CPUs are not able to contribute towards
>> constraining power consumption and hence temperature.
>>
>> This is especially true in certain platforms where tasks belong to the
>> RT class to maintain user experience, e.g., audio and video.
>
> Audio/Video playback generally doesn't take a _lot_ of time these days.
> What is important though is _when_ it happens.
>
> And media playback typically already has a very well defined and stable
> cadence (24Hz or whatnot). What you want is for your idle injector to
> sync up with that, not disrupt it.
>

Do you think that using SCHED_DEADLINE here would be completely
foolish? I mean, we would have the duty_cycle/period thing for free, it
would be know to the scheduler (as to maybe address Thomas' concerns)
and we could think to make idle injection part of system analysis (for
the soft-RT use cases).

Thanks,

- Juri

> For other workloads, missing a deadline is about as bad as destroying
> the chip, complete system shutdown might be safer than getting delayed.
> (The very tired scenario of a saw, a laser and your finger; you want to
> shut down the entire machine rather than just cut off your finger.)
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-11-10 10:34:46

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On Tue, Nov 10, 2015 at 10:07:35AM +0000, Juri Lelli wrote:
> Do you think that using SCHED_DEADLINE here would be completely
> foolish? I mean, we would have the duty_cycle/period thing for free, it
> would be know to the scheduler (as to maybe address Thomas' concerns)
> and we could think to make idle injection part of system analysis (for
> the soft-RT use cases).

DEADLINE would be awesome, but I think we need work on two fronts before
we can really sell it as the awesome that it is ;-)

- greedy and or statistical bounds
- !priv

The first is such that we can better deal with the erratic nature of
media decode without going full worst case on it, and the second just
makes it so much more accessible.

2015-11-10 10:56:40

by Juri Lelli

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] CFS idle injection

On 11/10/15, Peter Zijlstra wrote:
> On Tue, Nov 10, 2015 at 10:07:35AM +0000, Juri Lelli wrote:
> > Do you think that using SCHED_DEADLINE here would be completely
> > foolish? I mean, we would have the duty_cycle/period thing for free, it
> > would be know to the scheduler (as to maybe address Thomas' concerns)
> > and we could think to make idle injection part of system analysis (for
> > the soft-RT use cases).
>
> DEADLINE would be awesome, but I think we need work on two fronts before
> we can really sell it as the awesome that it is ;-)
>
> - greedy and or statistical bounds
> - !priv
>
> The first is such that we can better deal with the erratic nature of
> media decode without going full worst case on it, and the second just
> makes it so much more accessible.
>

Right. For the first point I think we just need to make our off-line
calculation right, not that we need to modify implementation. On the
second point we need more work yes. Also, another thing that is missing
is frequency (uarch) scaling for reservations parameters, something alike
what we are doing for CFS; this last point might be solved sooner :-).

Thanks,

- Juri