LinuxLists.cc - [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

[permalink] [raw]

Subject: [PATCH 1/3] [kidled]: introduce kidled.

From: Salman Qazi <[email protected]>

kidled is a kernel thread that implements idle cycle injection for
the purposes of power capping. It measures the naturally occuring
idle time as necessary to avoid injecting idle cycles when the
CPU is already sufficiently idle. The actual idle cycle injection
takes places in a realtime kernel thread, where as the measurements
take place in hrtimer callback functions.

Signed-off-by: Salman Qazi <[email protected]>
---
Documentation/kidled.txt | 40 +++
arch/x86/Kconfig | 1
arch/x86/include/asm/idle.h | 1
arch/x86/kernel/process_64.c | 2
drivers/misc/Gconfig.ici | 1
include/linux/kidled.h | 45 +++
kernel/Kconfig.ici | 6
kernel/Makefile | 1
kernel/kidled.c | 547 ++++++++++++++++++++++++++++++++++++++++++
kernel/softirq.c | 15 +
kernel/sysctl.c | 11 +
11 files changed, 664 insertions(+), 6 deletions(-)
create mode 100644 Documentation/kidled.txt
create mode 100644 drivers/misc/Gconfig.ici
create mode 100644 include/linux/kidled.h
create mode 100644 kernel/Kconfig.ici
create mode 100644 kernel/kidled.c

diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt
new file mode 100644
index 0000000..1149e3f
--- /dev/null
+++ b/Documentation/kidled.txt
@@ -0,0 +1,40 @@
+Idle Cycle Injector:
+====================
+
+Overview:
+
+Provides a kernel interface for causing the CPUs to have some
+minimum percentage of the idle time.
+
+Interfaces:
+
+Under /proc/sys/kernel/kidled/, we can find the following files:
+
+cpu/*/interval
+cpu/*/min_idle_percent
+cpu/*/stats
+
+interval specifies the period of time over which we attempt to make the
+CPU min_idle_percent idle. stats provides three fields. The first is
+the naturally occuring idle time. The second is the busy time, and the last
+is the injected idle time. All three values are reported in the units of
+nanoseconds.
+
+** VERY IMPORTANT NOTE: ** In all kernel stats except for cpu/*/stats, the
+injected idle cycles are by convention reported as busy time, attributed to
+kidled.
+
+
+Operation:
+
+The injecting component of the idle cycle injector is the kernel thread
+kidled. The measurements to determine when to inject idle cycles is done
+in hrtimer callbacks. The idea is to avoid injecting idle cycles when
+the CPU is already sufficiently idle. This is accomplished by always setting
+the next timer expiry to the minimum of when we expect to run out of CPU time
+(running at full tilt) or the end of the interval. When the timer expires,
+we evaluate if we need to inject idle cycles right away to avoid blowing our
+quota. If that's the case, then we inject idle cycles until the end of the
+interval.
+
+
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..cd384e1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -754,6 +754,7 @@ config SCHED_MC
increased overhead in some places. If unsure say N here.

source "kernel/Kconfig.preempt"
+source "kernel/Kconfig.ici"

config X86_UP_APIC
bool "Local APIC support on uniprocessors"
diff --git a/arch/x86/include/asm/idle.h b/arch/x86/include/asm/idle.h
index 38d8737..e36c5b4 100644
--- a/arch/x86/include/asm/idle.h
+++ b/arch/x86/include/asm/idle.h
@@ -10,6 +10,7 @@ void idle_notifier_unregister(struct notifier_block *n);

#ifdef CONFIG_X86_64
void enter_idle(void);
+void __exit_idle(void);
void exit_idle(void);
#else /* !CONFIG_X86_64 */
static inline void enter_idle(void) { }
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 126f0b4..a7c8932 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -77,7 +77,7 @@ void enter_idle(void)
atomic_notifier_call_chain(&idle_notifier, IDLE_START, NULL);
}

-static void __exit_idle(void)
+void __exit_idle(void)
{
if (x86_test_and_clear_bit_percpu(0, is_idle) == 0)
return;
diff --git a/drivers/misc/Gconfig.ici b/drivers/misc/Gconfig.ici
new file mode 100644
index 0000000..ecad2be
--- /dev/null
+++ b/drivers/misc/Gconfig.ici
@@ -0,0 +1 @@
+CONFIG_IDLE_CYCLE_INJECTOR=y
diff --git a/include/linux/kidled.h b/include/linux/kidled.h
new file mode 100644
index 0000000..7940dfa
--- /dev/null
+++ b/include/linux/kidled.h
@@ -0,0 +1,45 @@
+/*
+ * Copyright 2008 Google Inc.
+ *
+ * Author: [email protected]
+ *
+ */
+
+#include <linux/tick.h>
+
+#ifndef _IDLED_H
+#define _IDLED_H
+
+DECLARE_PER_CPU(unsigned long, cpu_lazy_inject_count);
+
+static inline s64 current_cpu_lazy_inject_count(void)
+{
+ /* We'll update this value in the idle cycle injector */
+ return __get_cpu_var(cpu_lazy_inject_count);
+}
+
+static inline s64 current_cpu_inject_count(void)
+{
+ return current_cpu_lazy_inject_count();
+}
+
+
+static inline s64 current_cpu_idle_count(void)
+{
+ int cpu = smp_processor_id();
+ struct tick_sched *ts = tick_get_tick_sched(cpu);
+ return ktime_to_ns(ts->idle_sleeptime) + current_cpu_inject_count();
+}
+
+static inline s64 current_cpu_busy_count(void)
+{
+ int cpu = smp_processor_id();
+ struct tick_sched *ts = tick_get_tick_sched(cpu);
+ return ktime_to_ns(ktime_sub(ktime_get(), ts->idle_sleeptime)) -
+ current_cpu_inject_count();
+}
+
+void kidled_interrupt_enter(void);
+void set_cpu_idle_ratio(int cpu, long idle_time, long busy_time);
+void get_cpu_idle_ratio(int cpu, long *idle_time, long *busy_time);
+#endif
diff --git a/kernel/Kconfig.ici b/kernel/Kconfig.ici
new file mode 100644
index 0000000..db5db95
--- /dev/null
+++ b/kernel/Kconfig.ici
@@ -0,0 +1,6 @@
+config IDLE_CYCLE_INJECTOR
+ bool "Idle Cycle Injector"
+ default n
+ help
+ Reduces power consumption by making sure that each CPU is
+ idle the given percentage of time.
diff --git a/kernel/Makefile b/kernel/Makefile
index 864ff75..fc82197 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -24,6 +24,7 @@ CFLAGS_REMOVE_sched_clock.o = -pg
CFLAGS_REMOVE_perf_event.o = -pg
endif

+obj-$(CONFIG_IDLE_CYCLE_INJECTOR) += kidled.o
obj-$(CONFIG_FREEZER) += freezer.o
obj-$(CONFIG_PROFILING) += profile.o
obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
diff --git a/kernel/kidled.c b/kernel/kidled.c
new file mode 100644
index 0000000..f590178
--- /dev/null
+++ b/kernel/kidled.c
@@ -0,0 +1,547 @@
+/*
+ * Copyright 2008 Google Inc.
+ *
+ * Idle Cycle Injector, also affectionately known as "kidled".
+ *
+ * Allows us to force each processor to have a specific amount of idle
+ * cycles for the purposes of controlling the power consumed by the machine.
+ *
+ * Authors:
+ *
+ * Salman Qazi <[email protected]>
+ * Ken Chen <[email protected]>
+ */
+
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/cpu.h>
+#include <linux/timer.h>
+#include <linux/uaccess.h>
+#include <linux/proc_fs.h>
+#include <linux/sched.h>
+#include <linux/kidled.h>
+#include <linux/poll.h>
+#include <linux/hrtimer.h>
+#include <linux/spinlock.h>
+#include <linux/sysctl.h>
+#include <linux/irqflags.h>
+#include <linux/timer.h>
+#include <asm/atomic.h>
+#include <asm/idle.h>
+
+#ifdef CONFIG_HIGH_RES_TIMERS
+#define SLEEP_GRANULARITY (20*NSEC_PER_USEC)
+#else
+#define SLEEP_GRANULARITY (NSEC_PER_MSEC)
+#endif
+
+#define KIDLED_PRIO (MAX_RT_PRIO - 2)
+#define KIDLED_DEFAULT_INTERVAL (100 * NSEC_PER_MSEC)
+
+struct kidled_inputs {
+ spinlock_t lock;
+ long idle_time;
+ long busy_time;
+};
+
+static int kidled_init_completed;
+static DEFINE_PER_CPU(struct task_struct *, kidled_thread);
+static DEFINE_PER_CPU(struct kidled_inputs, kidled_inputs);
+
+DEFINE_PER_CPU(unsigned long, cpu_lazy_inject_count);
+
+struct monitor_cpu_data {
+ int cpu;
+ long base_clock_count;
+ long base_cpu_count;
+ long max_clock_time;
+ long max_cpu_time;
+ long clock_time;
+ long cpu_time;
+};
+
+static DEFINE_PER_CPU(struct monitor_cpu_data, monitor_cpu_data);
+
+
+static DEFINE_PER_CPU(int, in_lazy_inject);
+static DEFINE_PER_CPU(unsigned long, inject_start);
+static void __enter_lazy_inject(void)
+{
+ if (!__get_cpu_var(in_lazy_inject)) {
+ __get_cpu_var(inject_start) = ktime_to_ns(ktime_get());
+ __get_cpu_var(in_lazy_inject) = 1;
+ }
+ enter_idle();
+}
+
+static void __exit_lazy_inject(void)
+{
+ if (__get_cpu_var(in_lazy_inject)) {
+ get_cpu_var(cpu_lazy_inject_count) +=
+ ktime_to_ns(ktime_get()) - __get_cpu_var(inject_start);
+ __get_cpu_var(in_lazy_inject) = 0;
+ }
+ __exit_idle();
+}
+
+static void enter_lazy_inject(void)
+{
+ local_irq_disable();
+ __enter_lazy_inject();
+ local_irq_enable();
+}
+
+static void exit_lazy_inject(void)
+{
+ local_irq_disable();
+ __exit_lazy_inject();
+ local_irq_enable();
+}
+
+/* Caller must have interrupts disabled */
+void kidled_interrupt_enter(void)
+{
+ if (!kidled_init_completed)
+ return;
+
+ __exit_lazy_inject();
+}
+
+static DEFINE_PER_CPU(int, still_lazy_injecting);
+static enum hrtimer_restart lazy_inject_timer_func(struct hrtimer *timer)
+{
+ __get_cpu_var(still_lazy_injecting) = 0;
+ return HRTIMER_NORESTART;
+}
+
+static void do_idle(void)
+{
+ void (*idle)(void) = NULL;
+
+ idle = pm_idle;
+ if (!idle)
+ idle = default_idle;
+
+ /* Put CPU to sleep until next interrupt */
+ idle();
+}
+
+/* Halts the CPU for the given number of nanoseconds.
+ *
+ * The cond_resched in there must be used responsibly, in the sense
+ * that we should have a minimal amount of work that the kernel
+ * wants done even when we are injecting idle cycles. This work
+ * should be accounted for by higher level users.
+ */
+static void lazy_inject(long nsecs, long interval)
+{
+ struct hrtimer halt_timer;
+
+ if (nsecs <= 0)
+ return;
+
+ __get_cpu_var(still_lazy_injecting) = 1;
+ hrtimer_init(&halt_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hrtimer_set_expires(&halt_timer, ktime_set(0, nsecs));
+ halt_timer.function = lazy_inject_timer_func;
+ hrtimer_start(&halt_timer, ktime_set(0, nsecs), HRTIMER_MODE_REL);
+
+ while (__get_cpu_var(still_lazy_injecting)) {
+
+ enter_lazy_inject();
+
+ /* Put CPU to sleep until next interrupt */
+ do_idle();
+ exit_lazy_inject();
+
+ /* The supervising userland thread needs to run with
+ * minimal latency. We yield to higher priority threads
+ */
+ cond_resched();
+ }
+ __get_cpu_var(still_lazy_injecting) = 0;
+ hrtimer_cancel(&halt_timer);
+}
+
+static DEFINE_PER_CPU(int, still_monitoring);
+
+/*
+ * Tells us when we would need to wake up next.
+ */
+long get_next_timer(struct monitor_cpu_data *data)
+{
+ long lazy;
+
+ lazy = min(data->max_cpu_time - data->cpu_time,
+ data->max_clock_time - data->clock_time);
+
+ lazy -= SLEEP_GRANULARITY - 1;
+
+ return lazy;
+}
+
+/*
+ * Figures out if the idle cycle injector needs to be woken up at the moment.
+ * If yes, then we go ahead and wake it up. If no, then we figure out the
+ * next time when we should make the same decision. The idea is to always
+ * make the decision before the applications use up the available CPU or
+ * clock time.
+ *
+ */
+static enum hrtimer_restart monitor_cpu_timer_func(struct hrtimer *timer)
+{
+ long next_timer;
+ struct monitor_cpu_data *data = &__get_cpu_var(monitor_cpu_data);
+
+ BUG_ON(data->cpu != smp_processor_id());
+ data->clock_time = ktime_to_ns(ktime_get()) - data->base_clock_count;
+ data->cpu_time = current_cpu_busy_count() - data->base_cpu_count;
+
+ if ((data->max_clock_time - data->clock_time < SLEEP_GRANULARITY) ||
+ (data->max_cpu_time - data->cpu_time < SLEEP_GRANULARITY)) {
+ __get_cpu_var(still_monitoring) = 0;
+
+ wake_up_process(__get_cpu_var(kidled_thread));
+ return HRTIMER_NORESTART;
+ } else {
+ next_timer = get_next_timer(data);
+
+ hrtimer_forward_now(timer, ktime_set(0, next_timer));
+ return HRTIMER_RESTART;
+ }
+}
+
+/*
+ * Allow other processes to use CPU for up to max_clock_time
+ * clock time, and max_cpu_time CPU time.
+ *
+ * Accurate only up to resolution of hrtimers.
+ *
+ * @return: Clock time left
+ */
+static unsigned long monitor_cpu(long max_clock_time, long max_cpu_time,
+ long *left_cpu_time)
+{
+ long first_timer;
+ struct hrtimer sleep_timer;
+ struct monitor_cpu_data *data = &__get_cpu_var(monitor_cpu_data);
+ data->max_clock_time = max_clock_time;
+ data->max_cpu_time = max_cpu_time;
+ data->base_clock_count = ktime_to_ns(ktime_get());
+ data->base_cpu_count = current_cpu_busy_count();
+ data->clock_time = 0;
+ data->cpu_time = 0;
+ data->cpu = smp_processor_id();
+
+ first_timer = get_next_timer(data);
+ if (first_timer <= 0) {
+ if (left_cpu_time)
+ *left_cpu_time = max_cpu_time;
+
+ return max_clock_time;
+ }
+
+ __get_cpu_var(still_monitoring) = 1;
+ hrtimer_init(&sleep_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hrtimer_set_expires(&sleep_timer, ktime_set(0, first_timer));
+ sleep_timer.function = monitor_cpu_timer_func;
+ hrtimer_start(&sleep_timer, ktime_set(0, first_timer),
+ HRTIMER_MODE_REL);
+ while (1) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (!__get_cpu_var(still_monitoring))
+ break;
+ schedule();
+ }
+
+ __get_cpu_var(still_monitoring) = 0;
+ hrtimer_cancel(&sleep_timer);
+
+ if (left_cpu_time)
+ *left_cpu_time = max(data->max_cpu_time - data->cpu_time, 0L);
+
+ return max(data->max_clock_time - data->clock_time, 0L);
+}
+
+static int kidled(void *p)
+{
+ struct kidled_inputs *inputs = (struct kidled_inputs *)p;
+ long idle_time = 0;
+ long busy_time = 0;
+ long old_idle_time;
+ long old_busy_time;
+ long interval = 0;
+ unsigned long nsecs_left = 0;
+ __get_cpu_var(still_lazy_injecting) = 0;
+ allow_signal(SIGHUP);
+
+ while (1) {
+ old_idle_time = idle_time;
+ old_busy_time = busy_time;
+ spin_lock(&inputs->lock);
+ busy_time = inputs->busy_time;
+ idle_time = inputs->idle_time;
+
+ /* Just incase we get spurious SIGHUPs */
+ if ((old_idle_time != idle_time) ||
+ (old_busy_time != busy_time)) {
+ interval = idle_time + busy_time;
+ }
+ flush_signals(current);
+ spin_unlock(&inputs->lock);
+
+ /* Keep overhead low when dormant */
+ if (idle_time == 0) {
+ while (!signal_pending(current)) {
+ schedule_timeout_interruptible(
+ MAX_SCHEDULE_TIMEOUT);
+ }
+ }
+
+ while (!signal_pending(current)) {
+ nsecs_left = monitor_cpu(interval, busy_time, NULL);
+ lazy_inject(nsecs_left, interval);
+ }
+ }
+}
+
+void set_cpu_idle_ratio(int cpu, long idle_time, long busy_time)
+{
+ spin_lock(&per_cpu(kidled_inputs, cpu).lock);
+ per_cpu(kidled_inputs, cpu).idle_time = idle_time;
+ per_cpu(kidled_inputs, cpu).busy_time = busy_time;
+ send_sig(SIGHUP, per_cpu(kidled_thread, cpu), 1);
+ spin_unlock(&per_cpu(kidled_inputs, cpu).lock);
+}
+
+void get_cpu_idle_ratio(int cpu, long *idle_time, long *busy_time)
+{
+ spin_lock(&per_cpu(kidled_inputs, cpu).lock);
+ *idle_time = per_cpu(kidled_inputs, cpu).idle_time;
+ *busy_time = per_cpu(kidled_inputs, cpu).busy_time;
+ spin_unlock(&per_cpu(kidled_inputs, cpu).lock);
+}
+
+static long get_kidled_interval(int cpu)
+{
+ long idle_time;
+ long busy_time;
+ get_cpu_idle_ratio(cpu, &idle_time, &busy_time);
+ return idle_time + busy_time;
+}
+
+static void set_kidled_interval(int cpu, long interval)
+{
+ int old_interval;
+ spin_lock(&per_cpu(kidled_inputs, cpu).lock);
+ old_interval = per_cpu(kidled_inputs, cpu).busy_time +
+ per_cpu(kidled_inputs, cpu).idle_time;
+ per_cpu(kidled_inputs, cpu).idle_time =
+ (per_cpu(kidled_inputs, cpu).idle_time
+ * interval) / old_interval;
+ per_cpu(kidled_inputs, cpu).busy_time = interval -
+ per_cpu(kidled_inputs, cpu).idle_time;
+ send_sig(SIGHUP, per_cpu(kidled_thread, cpu), 1);
+ spin_unlock(&per_cpu(kidled_inputs, cpu).lock);
+}
+
+static int proc_min_idle_percent(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ long idle_time;
+ long busy_time;
+ int ratio;
+ struct ctl_table fake = {};
+ int zero = 0;
+ int hundred = 100;
+ int ret;
+
+ int cpu = (int)((long)table->extra1);
+
+ fake.data = &ratio;
+ fake.maxlen = sizeof(int);
+ fake.extra1 = &zero;
+ fake.extra2 = &hundred;
+
+
+ if (!write) {
+ get_cpu_idle_ratio(cpu, &idle_time, &busy_time);
+ ratio = (int)((idle_time * 100) / (idle_time + busy_time));
+ }
+
+ ret = proc_dointvec_minmax(&fake, write, buffer, lenp, ppos);
+
+ if (!ret && write) {
+ int idle_interval;
+
+ idle_interval = get_kidled_interval(cpu);
+ idle_time = ((long)ratio * idle_interval) / 100;
+
+ /* round down new_idle to timer resolution */
+ idle_time = (idle_time / SLEEP_GRANULARITY) *
+ SLEEP_GRANULARITY;
+
+ set_cpu_idle_ratio(cpu, idle_time,
+ idle_interval - idle_time);
+ }
+
+ return ret;
+}
+
+static int proc_interval(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ long idle_time;
+ long busy_time;
+ int interval;
+ struct ctl_table fake = {};
+ int min = 1;
+ int max = 500;
+ int ret;
+
+ int cpu = (int)((long)table->extra1);
+
+ fake.data = &interval;
+ fake.maxlen = sizeof(int);
+ fake.extra1 = &min;
+ fake.extra2 = &max;
+
+
+ if (!write) {
+ get_cpu_idle_ratio(cpu, &idle_time, &busy_time);
+ interval = (int)((idle_time + busy_time) / NSEC_PER_MSEC);
+ }
+
+ ret = proc_dointvec_minmax(&fake, write, buffer, lenp, ppos);
+
+ if (!ret && write)
+ set_kidled_interval(cpu, (long)interval * NSEC_PER_MSEC);
+
+ return ret;
+}
+
+static void getstats(void *info)
+{
+ unsigned long *stats = (unsigned long *)info;
+ stats[0] = current_cpu_idle_count();
+ stats[1] = current_cpu_busy_count();
+ stats[2] = current_cpu_lazy_inject_count();
+}
+
+
+static int proc_stats(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+ unsigned long stats[3];
+ int cpu = (int)((long)table->extra1);
+ struct ctl_table fake = {};
+
+ if (write)
+ return -EINVAL;
+
+ fake.data = stats;
+ fake.maxlen = 3*sizeof(unsigned long);
+
+ ret = smp_call_function_single(cpu, getstats, &stats, 1);
+ if (ret)
+ return ret;
+
+ return proc_doulongvec_minmax(&fake, write, buffer, lenp, ppos);
+
+}
+
+#define NUM_CPU_CTLS 3
+#define CPU_NUM_SIZE 5
+
+static struct ctl_table kidled_cpu_dir_prot[NUM_CPU_CTLS + 1] = {
+ {
+ .procname = "min_idle_percent",
+ .proc_handler = proc_min_idle_percent,
+ .mode = 0644,
+ },
+ {
+ .procname = "interval",
+ .proc_handler = proc_interval,
+ .mode = 0644,
+ },
+ {
+ .procname = "stats",
+ .proc_handler = proc_stats,
+ .mode = 0444,
+ },
+
+ { }
+
+};
+static DEFINE_PER_CPU(char[CPU_NUM_SIZE], cpu_num);
+
+static DEFINE_PER_CPU(struct ctl_table[NUM_CPU_CTLS + 1],
+ kidled_cpu_dir_table);
+
+/* This is the kidled/cpu/ directory */
+static struct ctl_table kidled_cpu_table[NR_CPUS + 1];
+
+static int zero;
+
+struct ctl_table kidled_table[] = {
+ {
+ .procname = "cpu",
+ .mode = 0555,
+ .child = kidled_cpu_table,
+ },
+ { }
+};
+
+static int __init kidled_init(void)
+{
+ int cpu;
+ int i;
+
+ /*
+ * One priority level below maximum. The next higher priority level
+ * will be used by a userland thread supervising us.
+ */
+ struct sched_param param = { .sched_priority = KIDLED_PRIO };
+
+ if (!proc_mkdir("driver/kidled", NULL))
+ return 1;
+
+ for_each_online_cpu(cpu) {
+ spin_lock_init(&per_cpu(kidled_inputs, cpu).lock);
+ per_cpu(kidled_inputs, cpu).idle_time = 0;
+ per_cpu(kidled_inputs, cpu).busy_time =
+ KIDLED_DEFAULT_INTERVAL;
+ per_cpu(kidled_thread, cpu) = kthread_create(kidled,
+ &per_cpu(kidled_inputs, cpu), "kidled/%d", cpu);
+ if (IS_ERR(per_cpu(kidled_thread, cpu))) {
+ printk(KERN_ERR "Failed to start kidled on CPU %d\n",
+ cpu);
+ BUG();
+ }
+
+ kthread_bind(per_cpu(kidled_thread, cpu), cpu);
+ sched_setscheduler(per_cpu(kidled_thread, cpu),
+ SCHED_FIFO, &param);
+ wake_up_process(per_cpu(kidled_thread, cpu));
+
+ snprintf(per_cpu(cpu_num, cpu), CPU_NUM_SIZE, "%d", cpu);
+ kidled_cpu_table[cpu].procname = per_cpu(cpu_num, cpu);
+ kidled_cpu_table[cpu].mode = 0555;
+ kidled_cpu_table[cpu].child = per_cpu(kidled_cpu_dir_table,
+ cpu);
+
+ memcpy(per_cpu(kidled_cpu_dir_table, cpu), kidled_cpu_dir_prot,
+ sizeof(kidled_cpu_dir_prot));
+
+ for (i = 0; i < NUM_CPU_CTLS; i++) {
+ per_cpu(kidled_cpu_dir_table[i], cpu).extra1 =
+ (void *)((long)cpu);
+ }
+
+ }
+ kidled_init_completed = 1;
+ return 0;
+}
+module_init(kidled_init);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 7c1a67e..97d6193 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,6 +24,7 @@
#include <linux/ftrace.h>
#include <linux/smp.h>
#include <linux/tick.h>
+#include <linux/kidled.h>

#define CREATE_TRACE_POINTS
#include <trace/events/irq.h>
@@ -278,11 +279,15 @@ void irq_enter(void)
int cpu = smp_processor_id();

rcu_irq_enter();
- if (idle_cpu(cpu) && !in_interrupt()) {
- __irq_enter();
- tick_check_idle(cpu);
- } else
- __irq_enter();
+ __irq_enter();
+ if (!in_interrupt()) {
+ if (idle_cpu(cpu))
+ tick_check_idle(cpu);
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ kidled_interrupt_enter();
+#endif
+ }
}

#ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a68b24..eaec177 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -190,6 +190,9 @@ static struct ctl_table fs_table[];
static struct ctl_table debug_table[];
static struct ctl_table dev_table[];
extern struct ctl_table random_table[];
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+extern struct ctl_table kidled_table[];
+#endif
#ifdef CONFIG_INOTIFY_USER
extern struct ctl_table inotify_table[];
#endif
@@ -601,6 +604,14 @@ static struct ctl_table kern_table[] = {
.mode = 0555,
.child = random_table,
},
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ {
+ .procname = "kidled",
+ .mode = 0555,
+ .child = kidled_table,
+ },
+#endif
{
.procname = "overflowuid",
.data = &overflowuid,

2010-04-14 00:10:39

[permalink] [raw]

Subject: [PATCH 2/3] [kidled]: Add eager injection.

From: Salman Qazi <[email protected]>

We add the concept of a "power interactive" task group. This is a task
group that, for the purposes of power capping, will recieve special treatment.

When there are no power interactive tasks on the runqueue, we inject idle
cycles unless we have already met the quota. However, when there are
power interactive tasks on the runqueue, we only inject idle cycles if we
would otherwise fail to meet the quota. As a result, we try our very best
to not hit the interactive tasks with the idle cycles. The power
interactivity status of a task group is determined by the boolean value
in cpu.power_interactive.

Signed-off-by: Salman Qazi <[email protected]>
---
Documentation/kidled.txt | 15 ++++
include/linux/kidled.h | 34 +++++++++
include/linux/sched.h | 3 +
kernel/kidled.c | 166 +++++++++++++++++++++++++++++++++++++++++++---
kernel/sched.c | 80 ++++++++++++++++++++++
5 files changed, 285 insertions(+), 13 deletions(-)

diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt
index 1149e3f..564aa00 100644
--- a/Documentation/kidled.txt
+++ b/Documentation/kidled.txt
@@ -25,7 +25,7 @@ injected idle cycles are by convention reported as busy time, attributed to
kidled.

-Operation:
+Basic Operation:

The injecting component of the idle cycle injector is the kernel thread
kidled. The measurements to determine when to inject idle cycles is done
@@ -38,3 +38,16 @@ quota. If that's the case, then we inject idle cycles until the end of the
interval.

+Eager Injection:
+
+Above is true, when there is at least one tasks marked "interactive" on
+the CPU runqueue for the duration of the interval. Marking a task
+interactive involves setting power_interactive to 1 in its parent CPU
+cgroup. When such no such task is runnable and when we have not achieved
+the minimum idle percentage for the interval, we eagerly inject idle cycles.
+The purpose for doing so is to inject as many of the idle cycles as possible
+while the interactive tasks are not running. Thus, when the interactive
+tasks become runnable, they are more likely to fall in an interval when we
+aren't forcing the CPU idle.
+
+
diff --git a/include/linux/kidled.h b/include/linux/kidled.h
index 7940dfa..05c4ae5 100644
--- a/include/linux/kidled.h
+++ b/include/linux/kidled.h
@@ -11,6 +11,7 @@
#define _IDLED_H

DECLARE_PER_CPU(unsigned long, cpu_lazy_inject_count);
+DECLARE_PER_CPU(unsigned long, cpu_eager_inject_count);

static inline s64 current_cpu_lazy_inject_count(void)
{
@@ -18,9 +19,16 @@ static inline s64 current_cpu_lazy_inject_count(void)
return __get_cpu_var(cpu_lazy_inject_count);
}

+static inline s64 current_cpu_eager_inject_count(void)
+{
+ /* We update this value in the idle cycle injector */
+ return __get_cpu_var(cpu_eager_inject_count);
+}
+
static inline s64 current_cpu_inject_count(void)
{
- return current_cpu_lazy_inject_count();
+ return current_cpu_lazy_inject_count() +
+ current_cpu_eager_inject_count();
}

@@ -42,4 +50,28 @@ static inline s64 current_cpu_busy_count(void)
void kidled_interrupt_enter(void);
void set_cpu_idle_ratio(int cpu, long idle_time, long busy_time);
void get_cpu_idle_ratio(int cpu, long *idle_time, long *busy_time);
+
+enum ici_enum {
+ ICI_LAZY,
+ ICI_EAGER,
+};
+
+DECLARE_PER_CPU(enum ici_enum, ici_state);
+
+static inline int ici_in_eager_mode(void)
+{
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ return (__get_cpu_var(ici_state) == ICI_EAGER);
+#else
+ return 0;
+#endif
+}
+
+int kidled_running(void);
+struct task_struct *get_kidled_task(int cpu);
+int is_ici_thread(struct task_struct *p);
+void kidled_interrupt_enter(void);
+void set_cpu_idle_ratio(int cpu, long idle_time, long busy_time);
+void get_cpu_idle_ratio(int cpu, long *idle_time, long *busy_time);
+extern int should_eager_inject(void);
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78efe7c..1f94f21 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1566,6 +1566,9 @@ struct task_struct {
unsigned long memsw_bytes; /* uncharged mem+swap usage */
} memcg_batch;
#endif
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ int power_interactive;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/kidled.c b/kernel/kidled.c
index f590178..4e7aff3 100644
--- a/kernel/kidled.c
+++ b/kernel/kidled.c
@@ -45,10 +45,16 @@ struct kidled_inputs {
};

static int kidled_init_completed;
+
+DEFINE_PER_CPU(enum ici_enum, ici_state);
static DEFINE_PER_CPU(struct task_struct *, kidled_thread);
static DEFINE_PER_CPU(struct kidled_inputs, kidled_inputs);

DEFINE_PER_CPU(unsigned long, cpu_lazy_inject_count);
+DEFINE_PER_CPU(unsigned long, cpu_eager_inject_count);
+
+static int sysctl_ici_lb_prio;
+static int ici_lb_prio_max = MAX_PRIO - MAX_RT_PRIO - 1;

struct monitor_cpu_data {
int cpu;
@@ -58,10 +64,26 @@ struct monitor_cpu_data {
long max_cpu_time;
long clock_time;
long cpu_time;
+ long eager_inject_goal;
};

static DEFINE_PER_CPU(struct monitor_cpu_data, monitor_cpu_data);

+int get_ici_lb_prio(void)
+{
+ return sysctl_ici_lb_prio;
+}
+
+int is_ici_thread(struct task_struct *p)
+{
+ return per_cpu(kidled_thread, task_cpu(p)) == p;
+}
+
+int kidled_running(void)
+{
+ return __get_cpu_var(kidled_thread)->se.on_rq;
+}
+

static DEFINE_PER_CPU(int, in_lazy_inject);
static DEFINE_PER_CPU(unsigned long, inject_start);
@@ -98,6 +120,40 @@ static void exit_lazy_inject(void)
local_irq_enable();
}

+static DEFINE_PER_CPU(int, in_eager_inject);
+static void __enter_eager_inject(void)
+{
+ if (!__get_cpu_var(in_eager_inject)) {
+ __get_cpu_var(inject_start) = ktime_to_ns(ktime_get());
+ __get_cpu_var(in_eager_inject) = 1;
+ }
+ enter_idle();
+}
+
+static void __exit_eager_inject(void)
+{
+ if (__get_cpu_var(in_eager_inject)) {
+ __get_cpu_var(cpu_eager_inject_count) +=
+ ktime_to_ns(ktime_get()) - __get_cpu_var(inject_start);
+ __get_cpu_var(in_eager_inject) = 0;
+ }
+ __exit_idle();
+}
+
+static void enter_eager_inject(void)
+{
+ local_irq_disable();
+ __enter_eager_inject();
+ local_irq_enable();
+}
+
+static void exit_eager_inject(void)
+{
+ local_irq_disable();
+ __exit_eager_inject();
+ local_irq_enable();
+}
+
/* Caller must have interrupts disabled */
void kidled_interrupt_enter(void)
{
@@ -105,6 +161,7 @@ void kidled_interrupt_enter(void)
return;

__exit_lazy_inject();
+ __exit_eager_inject();
}

static DEFINE_PER_CPU(int, still_lazy_injecting);
@@ -168,8 +225,25 @@ static DEFINE_PER_CPU(int, still_monitoring);
/*
* Tells us when we would need to wake up next.
*/
-long get_next_timer(struct monitor_cpu_data *data)
+static void eager_inject(void)
+{
+ while (should_eager_inject() && __get_cpu_var(still_monitoring)
+ && ici_in_eager_mode()) {
+ enter_eager_inject();
+ do_idle();
+ exit_eager_inject();
+ cond_resched();
+ }
+}
+
+/*
+ * Tells us when we would need to wake up next
+ */
+long get_next_timer(struct monitor_cpu_data *data,
+ enum ici_enum *state)
{
+ long next_timer;
+ long rounded_eager;
long lazy;

lazy = min(data->max_cpu_time - data->cpu_time,
@@ -177,7 +251,19 @@ long get_next_timer(struct monitor_cpu_data *data)

lazy -= SLEEP_GRANULARITY - 1;

- return lazy;
+ if (data->eager_inject_goal > 0) {
+ *state = ICI_EAGER;
+ if (!should_eager_inject())
+ rounded_eager = NSEC_PER_MSEC;
+ else
+ rounded_eager = roundup(data->eager_inject_goal,
+ SLEEP_GRANULARITY);
+ next_timer = min(lazy, rounded_eager);
+ } else {
+ *state = ICI_LAZY;
+ next_timer = lazy;
+ }
+ return next_timer;
}

/*
@@ -191,32 +277,51 @@ long get_next_timer(struct monitor_cpu_data *data)
static enum hrtimer_restart monitor_cpu_timer_func(struct hrtimer *timer)
{
long next_timer;
+ enum ici_enum old_state;
struct monitor_cpu_data *data = &__get_cpu_var(monitor_cpu_data);

BUG_ON(data->cpu != smp_processor_id());
data->clock_time = ktime_to_ns(ktime_get()) - data->base_clock_count;
data->cpu_time = current_cpu_busy_count() - data->base_cpu_count;
+ data->eager_inject_goal = (data->max_clock_time - data->max_cpu_time) -
+ (data->clock_time - data->cpu_time);

if ((data->max_clock_time - data->clock_time < SLEEP_GRANULARITY) ||
(data->max_cpu_time - data->cpu_time < SLEEP_GRANULARITY)) {
__get_cpu_var(still_monitoring) = 0;
+ __get_cpu_var(ici_state) = ICI_LAZY;

wake_up_process(__get_cpu_var(kidled_thread));
return HRTIMER_NORESTART;
} else {
- next_timer = get_next_timer(data);
+ old_state = __get_cpu_var(ici_state);
+ next_timer = get_next_timer(data, &__get_cpu_var(ici_state));
+
+ if (__get_cpu_var(ici_state) != old_state)
+ set_tsk_need_resched(current);
+
+ if (ici_in_eager_mode() && should_eager_inject() &&
+ !kidled_running())
+ wake_up_process(__get_cpu_var(kidled_thread));

hrtimer_forward_now(timer, ktime_set(0, next_timer));
return HRTIMER_RESTART;
}
}

+struct task_struct *get_kidled_task(int cpu)
+{
+ return per_cpu(kidled_thread, cpu);
+}
+
/*
* Allow other processes to use CPU for up to max_clock_time
* clock time, and max_cpu_time CPU time.
*
* Accurate only up to resolution of hrtimers.
*
+ * Invariant: This function should return with ici_state == ICI_LAZY.
+ *
* @return: Clock time left
*/
static unsigned long monitor_cpu(long max_clock_time, long max_cpu_time,
@@ -232,12 +337,14 @@ static unsigned long monitor_cpu(long max_clock_time, long max_cpu_time,
data->clock_time = 0;
data->cpu_time = 0;
data->cpu = smp_processor_id();
+ data->eager_inject_goal = max_clock_time - max_cpu_time;

- first_timer = get_next_timer(data);
+ first_timer = get_next_timer(data, &__get_cpu_var(ici_state));
if (first_timer <= 0) {
if (left_cpu_time)
*left_cpu_time = max_cpu_time;

+ __get_cpu_var(ici_state) = ICI_LAZY;
return max_clock_time;
}

@@ -247,11 +354,19 @@ static unsigned long monitor_cpu(long max_clock_time, long max_cpu_time,
sleep_timer.function = monitor_cpu_timer_func;
hrtimer_start(&sleep_timer, ktime_set(0, first_timer),
HRTIMER_MODE_REL);
- while (1) {
- set_current_state(TASK_INTERRUPTIBLE);
- if (!__get_cpu_var(still_monitoring))
- break;
- schedule();
+
+ while (__get_cpu_var(still_monitoring)) {
+ while (1) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (!__get_cpu_var(still_monitoring) ||
+ (ici_in_eager_mode() && should_eager_inject())) {
+ set_current_state(TASK_RUNNING);
+ break;
+ }
+ schedule();
+ }
+
+ eager_inject();
}

__get_cpu_var(still_monitoring) = 0;
@@ -345,6 +460,25 @@ static void set_kidled_interval(int cpu, long interval)
spin_unlock(&per_cpu(kidled_inputs, cpu).lock);
}

+static int proc_ici_lb_prio(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+ int cpu;
+ struct sched_param param = { .sched_priority = KIDLED_PRIO };
+ ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ if (!ret && write) {
+ /* Make the scheduler set the load weight again */
+ for_each_online_cpu(cpu) {
+ sched_setscheduler(per_cpu(kidled_thread, cpu),
+ SCHED_FIFO, &param);
+ }
+ }
+
+ return ret;
+}
+
static int proc_min_idle_percent(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -427,6 +561,7 @@ static void getstats(void *info)
stats[0] = current_cpu_idle_count();
stats[1] = current_cpu_busy_count();
stats[2] = current_cpu_lazy_inject_count();
+ stats[3] = current_cpu_eager_inject_count();
}

@@ -434,7 +569,7 @@ static int proc_stats(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
int ret;
- unsigned long stats[3];
+ unsigned long stats[4];
int cpu = (int)((long)table->extra1);
struct ctl_table fake = {};

@@ -442,7 +577,7 @@ static int proc_stats(struct ctl_table *table, int write,
return -EINVAL;

fake.data = stats;
- fake.maxlen = 3*sizeof(unsigned long);
+ fake.maxlen = 4*sizeof(unsigned long);

ret = smp_call_function_single(cpu, getstats, &stats, 1);
if (ret)
@@ -487,6 +622,15 @@ static int zero;

struct ctl_table kidled_table[] = {
{
+ .procname = "lb_prio",
+ .data = &sysctl_ici_lb_prio,
+ .maxlen = sizeof(int),
+ .proc_handler = proc_ici_lb_prio,
+ .extra1 = &zero,
+ .extra2 = &ici_lb_prio_max,
+ .mode = 0644,
+ },
+ {
.procname = "cpu",
.mode = 0555,
.child = kidled_cpu_table,
diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..486cab2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -71,6 +71,7 @@
#include <linux/debugfs.h>
#include <linux/ctype.h>
#include <linux/ftrace.h>
+#include <linux/kidled.h>

#include <asm/tlb.h>
#include <asm/irq_regs.h>
@@ -257,6 +258,9 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ int power_interactive;
+#endif
#endif

#ifdef CONFIG_RT_GROUP_SCHED
@@ -626,6 +630,10 @@ struct rq {
/* BKL stats */
unsigned int bkl_count;
#endif
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ unsigned int nr_interactive;
+#endif
};

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -1888,6 +1896,13 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
if (wakeup)
p->se.start_runtime = p->se.sum_exec_runtime;

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ if (!p->se.on_rq) {
+ p->power_interactive = task_group(p)->power_interactive;
+ rq->nr_interactive += p->power_interactive;
+ }
+#endif
+
sched_info_queued(p);
p->sched_class->enqueue_task(rq, p, wakeup);
p->se.on_rq = 1;
@@ -1906,6 +1921,11 @@ static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
}
}

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ if (p->se.on_rq)
+ rq->nr_interactive -= p->power_interactive;
+#endif
+
sched_info_dequeued(p);
p->sched_class->dequeue_task(rq, p, sleep);
p->se.on_rq = 0;
@@ -5443,6 +5463,19 @@ static void put_prev_task(struct rq *rq, struct task_struct *prev)
prev->sched_class->put_prev_task(rq, prev);
}

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+int curr_rq_has_interactive(void)
+{
+ return (this_rq()->nr_interactive > 0);
+}
+
+int should_eager_inject(void)
+{
+ return !curr_rq_has_interactive() && (!this_rq()->rt.rt_nr_running
+ || ((this_rq()->rt.rt_nr_running == 1) && kidled_running()));
+}
+#endif
+
/*
* Pick up the highest-prio task:
*/
@@ -5452,6 +5485,23 @@ pick_next_task(struct rq *rq)
const struct sched_class *class;
struct task_struct *p;

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ if (ici_in_eager_mode() && should_eager_inject() &&
+ !kidled_running()) {
+ p = get_kidled_task(cpu_of(rq));
+
+ current->se.last_wakeup = current->se.sum_exec_runtime;
+
+#if defined(CONFIG_SMP) && defined(CONFIG_SCHEDSTATS)
+ schedstat_inc(rq, ttwu_count);
+ schedstat_inc(rq, ttwu_local);
+#endif
+
+ set_task_state(p, TASK_RUNNING);
+ activate_task(rq, p, 1);
+ }
+#endif
+
/*
* Optimization: we know that if all tasks are in
* the fair class we can call that function directly:
@@ -9567,6 +9617,9 @@ void __init sched_init(void)
rq = cpu_rq(i);
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ rq->nr_interactive = 0;
+#endif
rq->calc_load_active = 0;
rq->calc_load_update = jiffies + LOAD_FREQ;
init_cfs_rq(&rq->cfs, rq);
@@ -10604,6 +10657,26 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)

return (u64) tg->shares;
}
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+static u64 cpu_power_interactive_read_u64(struct cgroup *cgrp,
+ struct cftype *cft)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+ return (u64) tg->power_interactive;
+}
+
+static int cpu_power_interactive_write_u64(struct cgroup *cgrp,
+ struct cftype *cft, u64 interactive)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+ if ((interactive < 0) || (interactive > 1))
+ return -EINVAL;
+
+ tg->power_interactive = interactive;
+ return 0;
+}
+#endif /* CONFIG_IDLE_CYCLE_INJECTOR */
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_RT_GROUP_SCHED
@@ -10637,6 +10710,13 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_shares_read_u64,
.write_u64 = cpu_shares_write_u64,
},
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ {
+ .name = "power_interactive",
+ .read_u64 = cpu_power_interactive_read_u64,
+ .write_u64 = cpu_power_interactive_write_u64,
+ },
+#endif
#endif
#ifdef CONFIG_RT_GROUP_SCHED
{

2010-04-14 00:11:05

[permalink] [raw]

Subject: [PATCH 3/3] [kidled]: Introduce power capping priority and LB awareness.

From: Salman Qazi <[email protected]>

0) Power Capping Priority:

After we finish a lazy injection, we look at the task groups in the order
of increasing priority. For each task group, we attempt to assign
as much vruntime as possible, to cover the time that was spent doing
the lazy injection. Within each priority, we round-robin between the
task group between different invocations to make sure that we don't
consistently penalize the same one.

The priorities themselves are specified through the value
cpu.power_capping_priority in the parent CPU cgroup of the tasks.

1) Load balancer awareness

Idle cycle injector is an RT thread. A consequence is that from the load
balancer's point of view, it is a particularly heavy thread. While
we appreciate the ability to preempt any CFS threads, it is useful
to have a lesser weight: as a heavy weight makes an injected CPU
disproportionately less desirable than other CPUs. We provide this
by faking the weight of the idle cycle injector to be equivalent to
a CFS thread of a user controllable nice value.

Signed-off-by: Salman Qazi <[email protected]>
---
Documentation/kidled.txt | 38 ++++++++++++++++++++++-
include/linux/kidled.h | 6 ++++
kernel/kidled.c | 2 +
kernel/sched.c | 75 +++++++++++++++++++++++++++++++++++++++++++--
kernel/sched_fair.c | 77 +++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 192 insertions(+), 6 deletions(-)

diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt
index 564aa00..400b97b 100644
--- a/Documentation/kidled.txt
+++ b/Documentation/kidled.txt
@@ -6,7 +6,7 @@ Overview:
Provides a kernel interface for causing the CPUs to have some
minimum percentage of the idle time.

-Interfaces:
+Basic Interfaces:

Under /proc/sys/kernel/kidled/, we can find the following files:

@@ -51,3 +51,39 @@ tasks become runnable, they are more likely to fall in an interval when we
aren't forcing the CPU idle.

+Power Capping Priority:
+
+The time taken up by the idle cycle injector normally affects all of the
+interactive processes in the same way. Essentially, that length of time
+disappears from CF's decisions.
+
+However, this isn't always desirable. Ideally, we want
+to be able to shield some tasks from the consequences of power capping, while
+letting other tasks take the brunt of the impact. We accomplish this by
+stealing time from tasks, as if they were running while we were lazy
+injecting. We do this in a user specified priority order. The priorities
+are specified as power_capping_priority in the parent CPU cgroup of the tasks.
+The higher the priority, the better it is for the task. The run delay
+introduced by power capping is first given to the lower priority task, but
+if they aren't able to absorb it (i.e. it exceeds the time that they would
+have available to run), then it is passed to the higher priorities. In
+case of a tie, we round robin the order of the tasks for this penalty.
+
+Note that we reserve the power capping priority treatment for lazy injections
+only. Eagerly injected cycles are distributed equally among all the
+tasks. Since interactive tasks are unaffected by eager injection, this
+is fine.
+
+Pretending to be a CFS thread for the LB:
+
+The kidled is an RT thread so that it can preempt almost anything.
+As such, it would normally have the weight associated with an RT thread.
+However, this makes a CPU recieving an idle cycle injection,
+suddenly much much less desirable than other CPUs with just CFS tasks.
+To provide a way to remedy this, we allow the setting of a fake nice value
+for the kidled thread. Normally these threads are nice -19. But the value
+can be adjusted by the user with /proc/sys/kernel/kidled/lb_prio. This is
+specified as a non-negative integer. 0 corresponds to nice -19 (default)
+and 39 corresponds to nice 20.
+
+
diff --git a/include/linux/kidled.h b/include/linux/kidled.h
index 05c4ae5..199915a 100644
--- a/include/linux/kidled.h
+++ b/include/linux/kidled.h
@@ -69,9 +69,15 @@ static inline int ici_in_eager_mode(void)

int kidled_running(void);
struct task_struct *get_kidled_task(int cpu);
+int get_ici_lb_prio(void);
int is_ici_thread(struct task_struct *p);
void kidled_interrupt_enter(void);
void set_cpu_idle_ratio(int cpu, long idle_time, long busy_time);
void get_cpu_idle_ratio(int cpu, long *idle_time, long *busy_time);
extern int should_eager_inject(void);
+void power_capping_reshuffle_runqueue(long injected, long period);
+extern int should_eager_inject(void);
+
+#define MAX_POWER_CAPPING_PRIORITY (48)
+
#endif
diff --git a/kernel/kidled.c b/kernel/kidled.c
index 4e7aff3..5cd6911 100644
--- a/kernel/kidled.c
+++ b/kernel/kidled.c
@@ -218,6 +218,8 @@ static void lazy_inject(long nsecs, long interval)
}
__get_cpu_var(still_lazy_injecting) = 0;
hrtimer_cancel(&halt_timer);
+
+ power_capping_reshuffle_runqueue(nsecs, interval);
}

static DEFINE_PER_CPU(int, still_monitoring);
diff --git a/kernel/sched.c b/kernel/sched.c
index 486cab2..f2e89cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -260,6 +260,8 @@ struct task_group {
unsigned long shares;
#ifdef CONFIG_IDLE_CYCLE_INJECTOR
int power_interactive;
+ int power_capping_priority;
+ struct list_head pcp_queue_list[NR_CPUS];
#endif
#endif

@@ -552,6 +554,9 @@ struct rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this cpu: */
struct list_head leaf_cfs_rq_list;
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ struct list_head pwrcap_prio_queue[MAX_POWER_CAPPING_PRIORITY];
+#endif
#endif
#ifdef CONFIG_RT_GROUP_SCHED
struct list_head leaf_rt_rq_list;
@@ -1867,8 +1872,20 @@ static void dec_nr_running(struct rq *rq)
static void set_load_weight(struct task_struct *p)
{
if (task_has_rt_policy(p)) {
- p->se.load.weight = prio_to_weight[0] * 2;
- p->se.load.inv_weight = prio_to_wmult[0] >> 1;
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ if (!is_ici_thread(p)) {
+#endif
+ p->se.load.weight = prio_to_weight[0] * 2;
+ p->se.load.inv_weight = prio_to_wmult[0] >> 1;
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ } else {
+ int lb_prio = get_ici_lb_prio();
+ p->se.load.weight =
+ prio_to_weight[lb_prio];
+ p->se.load.inv_weight =
+ prio_to_wmult[lb_prio];
+ }
+#endif
return;
}

@@ -9599,7 +9616,12 @@ void __init sched_init(void)
#ifdef CONFIG_GROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(&init_task_group.pcp_queue_list[i]);
+#endif
+#endif
#ifdef CONFIG_USER_SCHED
INIT_LIST_HEAD(&root_task_group.children);
init_task_group.parent = &root_task_group;
@@ -9627,6 +9649,10 @@ void __init sched_init(void)
#ifdef CONFIG_FAIR_GROUP_SCHED
init_task_group.shares = init_task_group_load;
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ for (j = 0; j < MAX_POWER_CAPPING_PRIORITY; j++)
+ INIT_LIST_HEAD(&rq->pwrcap_prio_queue[j]);
+#endif
#ifdef CONFIG_CGROUP_SCHED
/*
* How much cpu bandwidth does init_task_group get?
@@ -10110,6 +10136,11 @@ struct task_group *sched_create_group(struct task_group *parent)

WARN_ON(!parent); /* root should already exist */

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(&tg->pcp_queue_list[i]);
+#endif
+
tg->parent = parent;
INIT_LIST_HEAD(&tg->children);
list_add_rcu(&tg->siblings, &parent->children);
@@ -10676,6 +10707,39 @@ static int cpu_power_interactive_write_u64(struct cgroup *cgrp,
tg->power_interactive = interactive;
return 0;
}
+
+static u64 cpu_power_capping_priority_read_u64(struct cgroup *cgrp,
+ struct cftype *cft)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+ return (u64) tg->power_capping_priority;
+}
+
+static int cpu_power_capping_priority_write_u64(struct cgroup *cgrp,
+ struct cftype *cftype,
+ u64 priority)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+ int i;
+
+ if (priority >= MAX_POWER_CAPPING_PRIORITY)
+ return -EINVAL;
+
+ tg->power_capping_priority = priority;
+
+ for_each_online_cpu(i) {
+ struct rq *rq = cpu_rq(i);
+
+ raw_spin_lock_irq(&rq->lock);
+ if (!list_empty(&tg->pcp_queue_list[i])) {
+ list_move_tail(&tg->pcp_queue_list[i],
+ &rq->pwrcap_prio_queue[priority]);
+ }
+ raw_spin_unlock_irq(&rq->lock);
+ }
+
+ return 0;
+}
#endif /* CONFIG_IDLE_CYCLE_INJECTOR */
#endif /* CONFIG_FAIR_GROUP_SCHED */

@@ -10712,6 +10776,11 @@ static struct cftype cpu_files[] = {
},
#ifdef CONFIG_IDLE_CYCLE_INJECTOR
{
+ .name = "power_capping_priority",
+ .read_u64 = cpu_power_capping_priority_read_u64,
+ .write_u64 = cpu_power_capping_priority_write_u64,
+ },
+ {
.name = "power_interactive",
.read_u64 = cpu_power_interactive_read_u64,
.write_u64 = cpu_power_interactive_write_u64,
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8fe7ee8..715a3ae 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -625,8 +625,23 @@ static void
account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
update_load_add(&cfs_rq->load, se->load.weight);
- if (!parent_entity(se))
+ if (!parent_entity(se)) {
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ struct task_group *tg = NULL;
+
+ if (group_cfs_rq(se))
+ tg = group_cfs_rq(se)->tg;
+ if (tg && tg->parent) {
+ int cpu = cpu_of(rq_of(cfs_rq));
+ int pcp_prio = tg->power_capping_priority;
+ list_add_tail(&tg->pcp_queue_list[cpu],
+ &rq_of(cfs_rq)->pwrcap_prio_queue[pcp_prio]);
+ }
+#endif
+
inc_cpu_load(rq_of(cfs_rq), se->load.weight);
+ }
if (entity_is_task(se)) {
add_cfs_task_weight(cfs_rq, se->load.weight);
list_add(&se->group_node, &cfs_rq->tasks);
@@ -639,8 +654,19 @@ static void
account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
update_load_sub(&cfs_rq->load, se->load.weight);
- if (!parent_entity(se))
+ if (!parent_entity(se)) {
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ struct task_group *tg = NULL;
+
+ if (group_cfs_rq(se))
+ tg = group_cfs_rq(se)->tg;
+ if (tg && tg->parent)
+ list_del_init(&tg->pcp_queue_list[cfs_rq->rq->cpu]);
+#endif
+
dec_cpu_load(rq_of(cfs_rq), se->load.weight);
+ }
if (entity_is_task(se)) {
add_cfs_task_weight(cfs_rq, -se->load.weight);
list_del_init(&se->group_node);
@@ -988,6 +1014,53 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
check_preempt_tick(cfs_rq, curr);
}

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+/* reshuffle run queue order base on power capping priority */
+void power_capping_reshuffle_runqueue(long injected, long ici_period)
+{
+ int i;
+ int cpu = smp_processor_id();
+ struct rq *rq = this_rq_lock();
+ struct task_group *tg;
+ struct task_group *next;
+
+ for (i = 0; i < MAX_POWER_CAPPING_PRIORITY; i++) {
+ struct list_head tmp_list;
+ INIT_LIST_HEAD(&tmp_list);
+ list_for_each_entry_safe(tg, next, &rq->pwrcap_prio_queue[i],
+ pcp_queue_list[cpu]) {
+ struct sched_entity *se;
+ struct cfs_rq *cfs_rq;
+ long slice, charge;
+
+ se = tg->se[cpu];
+ cfs_rq = se->cfs_rq;
+
+ slice = sched_slice(cfs_rq, se) * ici_period /
+ __sched_period(cfs_rq->nr_running);
+ charge = min(slice, injected);
+
+ __dequeue_entity(cfs_rq, se);
+ se->vruntime += calc_delta_fair(charge, se);
+ __enqueue_entity(cfs_rq, se);
+
+ injected -= charge;
+ list_del(&tg->pcp_queue_list[cpu]);
+ list_add_tail(&tg->pcp_queue_list[cpu], &tmp_list);
+ if (injected <= 0) {
+ list_splice(&tmp_list,
+ rq->pwrcap_prio_queue[i].prev);
+ goto done;
+ }
+ }
+ list_splice(&tmp_list, &rq->pwrcap_prio_queue[i]);
+ }
+done:
+ raw_spin_unlock_irq(&rq->lock);
+ return;
+}
+#endif
+
/**************************************************
* CFS operations on tasks:
*/

2010-04-14 09:50:04

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 1/3] [kidled]: introduce kidled.

Salman <[email protected]> writes:
> +
> +static int proc_stats(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int ret;
> + unsigned long stats[3];
> + int cpu = (int)((long)table->extra1);
> + struct ctl_table fake = {};
> +
> + if (write)
> + return -EINVAL;
> +
> + fake.data = stats;
> + fake.maxlen = 3*sizeof(unsigned long);
> +
> + ret = smp_call_function_single(cpu, getstats, &stats, 1);
> + if (ret)
> + return ret;

Haven't read the whole thing, but do any of these stats really
need to execute on the target CPU? They seem to be just readable
fields.

Or does it simply not matter because this proc call is too infrequent?

Anyways global broadcasts are discouraged, there is typically
always someone who feels their RT latency be messed up by them.

-Andi

--
[email protected] -- Speaking for myself only.

2010-04-14 15:41:39

[permalink] [raw]

Subject: Re: [PATCH 1/3] [kidled]: introduce kidled.

On Wed, Apr 14, 2010 at 2:49 AM, Andi Kleen <[email protected]> wrote:
> Salman <[email protected]> writes:
>> +
>> +static int proc_stats(struct ctl_table *table, int write,
>> + ? ? ? ? ? ? ? ? ? void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> + ? ? int ret;
>> + ? ? unsigned long stats[3];
>> + ? ? int cpu = (int)((long)table->extra1);
>> + ? ? struct ctl_table fake = {};
>> +
>> + ? ? if (write)
>> + ? ? ? ? ? ? return -EINVAL;
>> +
>> + ? ? fake.data = stats;
>> + ? ? fake.maxlen = 3*sizeof(unsigned long);
>> +
>> + ? ? ret = smp_call_function_single(cpu, getstats, &stats, 1);
>> + ? ? if (ret)
>> + ? ? ? ? ? ? return ret;
>
> Haven't read the whole thing, but do any of these stats really
> need to execute on the target CPU? They seem to be just readable
> fields.

To capture all the quantities for a CPU atomically, they must be read
on the CPU. Basically, reading them on that CPU prevents them from
changing as we read them.

Also, if the CPU is idle (injected or otherwise), the quantities won't
get updated.

>
> Or does it simply not matter because this proc call is too infrequent?

It should be infrequent. The idle cycle injector does all the hard
work. These interfaces are for monitoring.

>
> Anyways global broadcasts are discouraged, there is typically
> always someone who feels their RT latency be messed up by them.

I will look at it one more time to see if there is something else that
can be done.

>
> -Andi
>
>
> --
> [email protected] -- Speaking for myself only.
>

2010-04-15 07:46:56

[permalink] [raw]

Subject: Re: [PATCH 1/3] [kidled]: introduce kidled.

On Wed, 2010-04-14 at 08:41 -0700, Salman Qazi wrote:
>
> To capture all the quantities for a CPU atomically, they must be read
> on the CPU. Basically, reading them on that CPU prevents them from
> changing as we read them.
>
> Also, if the CPU is idle (injected or otherwise), the quantities won't
> get updated.

Who cares? by the time they reach userspace they've changed anyway.

2010-04-15 07:51:31

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

On Tue, 2010-04-13 at 17:08 -0700, Salman wrote:
> As we discussed earlier this year, Google has an implementation that it
> would like to share. I have finally gotten around to porting it to
> v2.6.33 and cleaning up the interfaces. It is provided in the following
> messages for your review. I realize that when we first discussed this
> idea, a lot of ideas were presented for enhancing it. Thanks alot for
> your suggestions. I haven't gotten around to implementing any of them.

.33 is way too old to submit patches against.

That said, I really really dislike this approach, I would much rather
see it tie in with power aware scheduling.

2010-04-17 16:38:40

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

On Tue, 13 Apr 2010 17:08:18 -0700
Salman <[email protected]> wrote:

> As we discussed earlier this year, Google has an implementation that
> it would like to share. I have finally gotten around to porting it to
> v2.6.33 and cleaning up the interfaces. It is provided in the
> following messages for your review. I realize that when we first
> discussed this idea, a lot of ideas were presented for enhancing it.
> Thanks alot for your suggestions. I haven't gotten around to
> implementing any of them.

again I'll chime in to support this effort; it's the right thing to do
for power limiting (as opposed to taking cores offline), and I'm happy
to see progress being made.

I'll start playing with your patches and use timechart to see how well
it works, including to see how well things align...

> The ones that I still find appealing are:
>
> 0. Providing approximate synchronization between cores, regardless
> of their independant settings in order to improve power savings. We
> have to balance this with eager injection (i.e. avoiding injection
> when an interactive task needs to run).

I still would like to see this ;-)
It's a *HUGE* instant power delta.

But it does not have to be perfect. As long as "on average" we align
we're good enough.

the easiest way is to round the time of the start of idle injection
up to, say, double the duration of the injection period...
and maybe to whole seconds or some round value of jiffies as well.

It could even be done by "creeping" towards an aligned situation...
rather than forcing instant alignment, as long as each time we inject
idle time we get a step closer to being aligned.. very soon we WILL be
aligned.
(for example, if a cpu notices it's on the late side of an alignment
window, it could inject a little shorter than usual, while if it
notices it's a little early, it can inject a little longer)

> A stricter synchronization between cores is needed to make idle cycle
> injector work on hyperthreaded systems. This is a some what separate
> issue, as there should only be one idle cycle injector minimum idle
> setting per physical core.

actually... while the HT case is clearly required to be solved to get
actual power limits, ideally we can solve it using the same tricks we
use for the above, just with a stronger bias...

I don't think we need to force the admin to set the same value per se,
it's something that's just a matter of having the policy guy do this
right... (but if you want to do "effective injection %age is minimum of
the two" or so, I can live with that)

>

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-04-17 17:08:16

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

On Thu, Apr 15, 2010 at 09:51:26AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-04-13 at 17:08 -0700, Salman wrote:
> > As we discussed earlier this year, Google has an implementation that it
> > would like to share. I have finally gotten around to porting it to
> > v2.6.33 and cleaning up the interfaces. It is provided in the following
> > messages for your review. I realize that when we first discussed this
> > idea, a lot of ideas were presented for enhancing it. Thanks alot for
> > your suggestions. I haven't gotten around to implementing any of them.
>
> .33 is way too old to submit patches against.

But it's not too old for review purposes; as Salman said, they were
sent to LKML for comments and review. I think it's well understood
that when these patches are ready to be merged, they need to be
submitted right before the merge window opens, against a recent -rc
kernel.

- Ted

2010-04-17 17:55:59

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

On Sat, 17 Apr 2010 13:08:08 -0400
[email protected] wrote:

> On Thu, Apr 15, 2010 at 09:51:26AM +0200, Peter Zijlstra wrote:
> > On Tue, 2010-04-13 at 17:08 -0700, Salman wrote:
> > > As we discussed earlier this year, Google has an implementation
> > > that it would like to share. I have finally gotten around to
> > > porting it to v2.6.33 and cleaning up the interfaces. It is
> > > provided in the following messages for your review. I realize
> > > that when we first discussed this idea, a lot of ideas were
> > > presented for enhancing it. Thanks alot for your suggestions. I
> > > haven't gotten around to implementing any of them.
> >
> > .33 is way too old to submit patches against.
>
> But it's not too old for review purposes; as Salman said, they were
> sent to LKML for comments and review. I think it's well understood
> that when these patches are ready to be merged, they need to be
> submitted right before the merge window opens, against a recent -rc
> kernel.

s/submitted/refreshed/ ;-)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-04-17 19:55:22

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

On Sat, 2010-04-17 at 13:08 -0400, [email protected] wrote:
> I think it's well understood
> that when these patches are ready to be merged, they need to be
> submitted right before the merge window opens, against a recent -rc
> kernel.

No, they need to be in the relevant subsystem tree by then, patch
submissions to subsystem trees right before the merge window opens are
bound to get delayed another cycle.

2010-04-19 17:20:21

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

On Thu, Apr 15, 2010 at 12:51 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2010-04-13 at 17:08 -0700, Salman wrote:
>> As we discussed earlier this year, Google has an implementation that it
>> would like to share. ?I have finally gotten around to porting it to
>> v2.6.33 and cleaning up the interfaces. ?It is provided in the following
>> messages for your review. ?I realize that when we first discussed this
>> idea, a lot of ideas were presented for enhancing it. ?Thanks alot for
>> your suggestions. ?I haven't gotten around to implementing any of them.
>
> .33 is way too old to submit patches against.

Will bump up the version when I refresh the change.

>
> That said, I really really dislike this approach, I would much rather
> see it tie in with power aware scheduling.

I think I can see your point: there is potentially better information
about the power consumption of the CPU beyond the time it was busy.
But please clarify: is your complaint the lack of use of this
information or are you arguing for a deeper integration into the
scheduler (I.e. implementing it as part of the scheduler rather than
an independent thread) or both?

>
>

2010-04-19 19:01:47

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

On Mon, 2010-04-19 at 10:20 -0700, Salman Qazi wrote:
> On Thu, Apr 15, 2010 at 12:51 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, 2010-04-13 at 17:08 -0700, Salman wrote:
> >> As we discussed earlier this year, Google has an implementation that it
> >> would like to share. I have finally gotten around to porting it to
> >> v2.6.33 and cleaning up the interfaces. It is provided in the following
> >> messages for your review. I realize that when we first discussed this
> >> idea, a lot of ideas were presented for enhancing it. Thanks alot for
> >> your suggestions. I haven't gotten around to implementing any of them.
> >
> > .33 is way too old to submit patches against.
>
> Will bump up the version when I refresh the change.
>
> >
> > That said, I really really dislike this approach, I would much rather
> > see it tie in with power aware scheduling.
>
> I think I can see your point: there is potentially better information
> about the power consumption of the CPU beyond the time it was busy.
> But please clarify: is your complaint the lack of use of this
> information or are you arguing for a deeper integration into the
> scheduler (I.e. implementing it as part of the scheduler rather than
> an independent thread) or both?

Right, so the IBM folks who were looking at power aware scheduling were
working on an interface to quantify the amount of power to save.

But their approach, was an extension of the regular power aware
load-balancer, which basically groups tasks onto sockets so that whole
sockets can go idle.

However Arjan explained to me that your approach, which idles the whole
machine, has the advantage that also memory banks can go into idle mode
and save power.

Still in the interest to cut back on power-saving interfaces it would be
nice to see if there is anything we can do to merge these things, but I
really haven't thought much about that yet.

2010-04-20 00:58:56

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

On Mon, 19 Apr 2010 21:01:41 +0200
Peter Zijlstra <[email protected]> wrote:

> Right, so the IBM folks who were looking at power aware scheduling
> were working on an interface to quantify the amount of power to save.
>
> But their approach, was an extension of the regular power aware
> load-balancer, which basically groups tasks onto sockets so that whole
> sockets can go idle.
>
> However Arjan explained to me that your approach, which idles the
> whole machine, has the advantage that also memory banks can go into
> idle mode and save power.
>
> Still in the interest to cut back on power-saving interfaces it would
> be nice to see if there is anything we can do to merge these things,
> but I really haven't thought much about that yet.

one correction, this is not about power *saving*, it is about power
*capping*. Power capping is pretty much energy inefficient by
definition (and surely in practice), but it's about dealing with
reality about underdimensioned airconditioning or voltage rails....

Due to the reality that socket offlining isn't as good as idle
insertion.. I rather focus on the later...

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-04-20 04:51:14

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

* Peter Zijlstra <[email protected]> [2010-04-19 21:01:41]:

> On Mon, 2010-04-19 at 10:20 -0700, Salman Qazi wrote:
> > On Thu, Apr 15, 2010 at 12:51 AM, Peter Zijlstra <[email protected]> wrote:
> > > On Tue, 2010-04-13 at 17:08 -0700, Salman wrote:
> > >> As we discussed earlier this year, Google has an implementation that it
> > >> would like to share. I have finally gotten around to porting it to
> > >> v2.6.33 and cleaning up the interfaces. It is provided in the following
> > >> messages for your review. I realize that when we first discussed this
> > >> idea, a lot of ideas were presented for enhancing it. Thanks alot for
> > >> your suggestions. I haven't gotten around to implementing any of them.
> > >
> > > .33 is way too old to submit patches against.
> >
> > Will bump up the version when I refresh the change.
> >
> > >
> > > That said, I really really dislike this approach, I would much rather
> > > see it tie in with power aware scheduling.
> >
> > I think I can see your point: there is potentially better information
> > about the power consumption of the CPU beyond the time it was busy.
> > But please clarify: is your complaint the lack of use of this
> > information or are you arguing for a deeper integration into the
> > scheduler (I.e. implementing it as part of the scheduler rather than
> > an independent thread) or both?
>
> Right, so the IBM folks who were looking at power aware scheduling were
> working on an interface to quantify the amount of power to save.

Indicating required system capacity to the loadbalance and using that
information to evacuate cores or socket was the basic idea.

Ref: http://lkml.org/lkml/2009/5/13/173

The challenges with that approach is the predictable evacuation or
forced idleness is not guaranteed.

> But their approach, was an extension of the regular power aware
> load-balancer, which basically groups tasks onto sockets so that whole
> sockets can go idle.

Integrating with the load balancer will make the design cleaner and
avoid forcefully running an idle thread. The scheduler should
schedule 'nothing' so that idleness can happen and cpuidle governor
can take care of idle states.

> However Arjan explained to me that your approach, which idles the whole
> machine, has the advantage that also memory banks can go into idle mode
> and save power.

Well, this is an ideal goal. Injecting some amount of idle time
across all cores/threads preferably with overlapping time window will
save quite a lot of power on x86. But atleast overlapping idle times
among sibling threads are required to get any power savings.

This proposed approach does not yet have the ability to do overlapping
idle times, though they may randomly occur.

> Still in the interest to cut back on power-saving interfaces it would be
> nice to see if there is anything we can do to merge these things, but I
> really haven't thought much about that yet.

Atleast integrating this with ACPI cpu aggregation driver can be a good
first step. Both the drivers and code are for the same power capping
purpose using idle time injection and running an high priority idle
thread for short duration.

ACPI Processor Aggregator Driver for 2.6.32-rc1
Ref: http://lkml.org/lkml/2009/10/3/13

--Vaidy

2010-04-20 05:00:46

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

* Arjan van de Ven <[email protected]> [2010-04-19 18:00:32]:

> On Mon, 19 Apr 2010 21:01:41 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > Right, so the IBM folks who were looking at power aware scheduling
> > were working on an interface to quantify the amount of power to save.
> >
> > But their approach, was an extension of the regular power aware
> > load-balancer, which basically groups tasks onto sockets so that whole
> > sockets can go idle.
> >
> > However Arjan explained to me that your approach, which idles the
> > whole machine, has the advantage that also memory banks can go into
> > idle mode and save power.
> >
> > Still in the interest to cut back on power-saving interfaces it would
> > be nice to see if there is anything we can do to merge these things,
> > but I really haven't thought much about that yet.
>
> one correction, this is not about power *saving*, it is about power
> *capping*. Power capping is pretty much energy inefficient by
> definition (and surely in practice), but it's about dealing with
> reality about underdimensioned airconditioning or voltage rails....
>
> Due to the reality that socket offlining isn't as good as idle
> insertion.. I rather focus on the later...

The power reduction benefit is architecture and topology dependent.
Like on POWER platform, socket offlining could provide better power
reduction than idle injection.

As mentioned by Arjan, these approaches help reduce average power
consumption to meet power and cooling limitation over a short
interval. These are not general optimizations to improve operating
efficiency, however when use at certain workload and utilization
levels, these can potentially provide overall energy savings.

Having the SMP load balancer pull jobs away form a core or socket to
allow it to remain idle for short burst of time will be an good
implementation.

--Vaidy

2010-04-20 17:53:07

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

On Mon, Apr 19, 2010 at 9:50 PM, Vaidyanathan Srinivasan
<[email protected]> wrote:
> * Peter Zijlstra <[email protected]> [2010-04-19 21:01:41]:
>
>> On Mon, 2010-04-19 at 10:20 -0700, Salman Qazi wrote:
>> > On Thu, Apr 15, 2010 at 12:51 AM, Peter Zijlstra <[email protected]> wrote:
>> > > On Tue, 2010-04-13 at 17:08 -0700, Salman wrote:
>> > >> As we discussed earlier this year, Google has an implementation that it
>> > >> would like to share. ?I have finally gotten around to porting it to
>> > >> v2.6.33 and cleaning up the interfaces. ?It is provided in the following
>> > >> messages for your review. ?I realize that when we first discussed this
>> > >> idea, a lot of ideas were presented for enhancing it. ?Thanks alot for
>> > >> your suggestions. ?I haven't gotten around to implementing any of them.
>> > >
>> > > .33 is way too old to submit patches against.
>> >
>> > Will bump up the version when I refresh the change.
>> >
>> > >
>> > > That said, I really really dislike this approach, I would much rather
>> > > see it tie in with power aware scheduling.
>> >
>> > I think I can see your point: ?there is potentially better information
>> > about the power consumption of the CPU beyond the time it was busy.
>> > But please clarify: is your complaint the lack of use of this
>> > information or are you arguing for a deeper integration into the
>> > scheduler (I.e. implementing it as part of the scheduler rather than
>> > an independent thread) or both?
>>
>> Right, so the IBM folks who were looking at power aware scheduling were
>> working on an interface to quantify the amount of power to save.
>
> Indicating required system capacity to the loadbalance and using that
> information to evacuate cores or socket was the basic idea.
>
> Ref: http://lkml.org/lkml/2009/5/13/173
>
> The challenges with that approach is the predictable evacuation or
> forced idleness is not guaranteed.
>
>> But their approach, was an extension of the regular power aware
>> load-balancer, which basically groups tasks onto sockets so that whole
>> sockets can go idle.
>
> Integrating with the load balancer will make the design cleaner and
> avoid forcefully running an idle thread. ?The scheduler should
> schedule 'nothing' so that idleness can happen and cpuidle governor
> can take care of idle states.

I am actually not sure which one would be more aesthetically pleasing.
Putting it into the scheduler would
also place a lot of complexity (basically, the same set of timers) in
the scheduler.

>
>> However Arjan explained to me that your approach, which idles the whole
>> machine, has the advantage that also memory banks can go into idle mode
>> and save power.
>
> Well, this is an ideal goal. ?Injecting some amount of idle time
> across all cores/threads preferably with overlapping time window will
> save quite a lot of power on x86. ?But atleast overlapping idle times
> among sibling threads are required to get any power savings.

Agreed. For sibling threads, we need a hard guarantee of simultaneous
injection, which is best achieved by using a single timer for all the
siblings. It is in my list of things to do. Is it necessary for the
first cut of idle cycle injector?

For improving power savings in the non-SMT case, as Arjan suggested, I
will make the changes for heuristically aligning the injection on
multiple cores. This will not be perfect, but then because it's a
power optimization, it doesn't have to always work. I presume that
this works best when done according to the CPU hierarchy? That is, it
is more beneficial to idle an entire socket than the same number of
cores on different sockets?

>
> This proposed approach does not yet have the ability to do overlapping
> idle times, though they may randomly occur.
>
>> Still in the interest to cut back on power-saving interfaces it would be
>> nice to see if there is anything we can do to merge these things, but I
>> really haven't thought much about that yet.
>
> Atleast integrating this with ACPI cpu aggregation driver can be a good
> first step. ?Both the drivers and code are for the same power capping
> purpose using idle time injection and running an high priority idle
> thread for short duration.
>
> ACPI Processor Aggregator Driver for 2.6.32-rc1
> Ref: http://lkml.org/lkml/2009/10/3/13

This is reasonable. I could merge the two implementations. Are there
features in that implementation that our implementation is missing?
>From a cursory glance, the driver is a naive idle cycle injector, in
that it doesn't take existing idle time or scheduler issues into
account. But if it did, it won't harm its original purpose.

>
> --Vaidy
>
>

2010-04-21 05:07:30

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

On Tue, 20 Apr 2010 10:52:58 -0700
Salman Qazi <[email protected]> wrote:

> For improving power savings in the non-SMT case, as Arjan suggested, I
> will make the changes for heuristically aligning the injection on
> multiple cores. This will not be perfect, but then because it's a
> power optimization, it doesn't have to always work. I presume that
> this works best when done according to the CPU hierarchy? That is, it
> is more beneficial to idle an entire socket than the same number of
> cores on different sockets?

not really; at least not for Intel CPUs.
The problem is that due to the cache coherency, as long as one cpu in
the system is awake, the memory controllers etc cannot go into a sleep
mode...

I would not be surprised if AMD has the same behavior... or anyone else
with an integrated memory controller for that matter.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-04-22 01:32:25

by Mike Chan

[permalink] [raw]

Subject: Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

On Thu, Apr 15, 2010 at 12:51 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2010-04-13 at 17:08 -0700, Salman wrote:
>> As we discussed earlier this year, Google has an implementation that it
>> would like to share. ?I have finally gotten around to porting it to
>> v2.6.33 and cleaning up the interfaces. ?It is provided in the following
>> messages for your review. ?I realize that when we first discussed this
>> idea, a lot of ideas were presented for enhancing it. ?Thanks alot for
>> your suggestions. ?I haven't gotten around to implementing any of them.
>
> .33 is way too old to submit patches against.
>
> That said, I really really dislike this approach, I would much rather
> see it tie in with power aware scheduling.

I may have missed this on lkml but are there any on-going community
efforts to power aware scheduling?

-- Mike

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

2010-04-22 08:30:32