2009-10-08 09:49:14

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v8 PATCH 0/8]: cpuidle: Cleanup cpuidle/ Introduce cpuidle to POWER.

Hi

Please consider this for inclusion into the testing tree.

This patchset introduces cpuidle infrastructure to POWER, prototyping
for pSeries, and also does a major refactoring of current x86 idle
power management and a cleanup of cpuidle infrastructure.

Earlier discussions on the same can be found at:

v7 --> http://lkml.org/lkml/2009/10/6/278
v6 --> http://lkml.org/lkml/2009/9/22/180
v5 --> http://lkml.org/lkml/2009/9/22/26
v4 --> http://lkml.org/lkml/2009/9/1/133
v3 --> http://lkml.org/lkml/2009/8/27/124
v2 --> http://lkml.org/lkml/2009/8/26/233
v1 --> http://lkml.org/lkml/2009/8/19/150


Changes in this version:
--------------------------------------

* Remove redundant poll_idle definition in arch/x86/kernel/process.c

* Prevent acpi_driver from registering when boot_option_idle_override
is set and let cpuidle_default driver to take over in this case.

* Enable default_idle when power_save=off in POWER.

thanks,
arun


2009-10-08 09:50:36

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v8 PATCH 1/8]: cpuidle: cleanup drivers/cpuidle/cpuidle.c

* Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:

This patch cleans up drivers/cpuidle/cpuidle.c
Earlier cpuidle assumed pm_idle as the default idle loop. Break that
assumption and make it more generic. cpuidle_idle_call() which is the
main idle loop of cpuidle is to be called by architectures which have
registered to cpuidle.

Remove routines cpuidle_install/uninstall_idle_handler() which are not
needed anymore.

Signed-off-by: Arun R Bharadwaj <[email protected]>
---
drivers/cpuidle/cpuidle.c | 62 +++++----------------------------------------
drivers/cpuidle/cpuidle.h | 6 +---
drivers/cpuidle/driver.c | 4 --
drivers/cpuidle/governor.c | 13 +++------
drivers/cpuidle/sysfs.c | 34 +++++++++++++-----------
include/linux/cpuidle.h | 4 ++
6 files changed, 37 insertions(+), 86 deletions(-)

Index: linux.trees.git/drivers/cpuidle/cpuidle.c
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/cpuidle.c
+++ linux.trees.git/drivers/cpuidle/cpuidle.c
@@ -24,10 +24,6 @@
DEFINE_PER_CPU(struct cpuidle_device *, cpuidle_devices);

DEFINE_MUTEX(cpuidle_lock);
-LIST_HEAD(cpuidle_detected_devices);
-static void (*pm_idle_old)(void);
-
-static int enabled_devices;

#if defined(CONFIG_ARCH_HAS_CPU_IDLE_WAIT)
static void cpuidle_kick_cpus(void)
@@ -47,7 +43,7 @@ static int __cpuidle_register_device(str
*
* NOTE: no locks or semaphores should be used here
*/
-static void cpuidle_idle_call(void)
+void cpuidle_idle_call(void)
{
struct cpuidle_device *dev = __get_cpu_var(cpuidle_devices);
struct cpuidle_state *target_state;
@@ -55,13 +51,10 @@ static void cpuidle_idle_call(void)

/* check if the device is ready */
if (!dev || !dev->enabled) {
- if (pm_idle_old)
- pm_idle_old();
- else
#if defined(CONFIG_ARCH_HAS_DEFAULT_IDLE)
- default_idle();
+ default_idle();
#else
- local_irq_enable();
+ local_irq_enable();
#endif
return;
}
@@ -75,7 +68,11 @@ static void cpuidle_idle_call(void)
hrtimer_peek_ahead_timers();
#endif
/* ask the governor for the next state */
- next_state = cpuidle_curr_governor->select(dev);
+ if (dev->state_count > 1)
+ next_state = cpuidle_curr_governor->select(dev);
+ else
+ next_state = 0;
+
if (need_resched())
return;
target_state = &dev->states[next_state];
@@ -96,35 +93,11 @@ static void cpuidle_idle_call(void)
}

/**
- * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
- */
-void cpuidle_install_idle_handler(void)
-{
- if (enabled_devices && (pm_idle != cpuidle_idle_call)) {
- /* Make sure all changes finished before we switch to new idle */
- smp_wmb();
- pm_idle = cpuidle_idle_call;
- }
-}
-
-/**
- * cpuidle_uninstall_idle_handler - uninstalls the cpuidle idle loop handler
- */
-void cpuidle_uninstall_idle_handler(void)
-{
- if (enabled_devices && pm_idle_old && (pm_idle != pm_idle_old)) {
- pm_idle = pm_idle_old;
- cpuidle_kick_cpus();
- }
-}
-
-/**
* cpuidle_pause_and_lock - temporarily disables CPUIDLE
*/
void cpuidle_pause_and_lock(void)
{
mutex_lock(&cpuidle_lock);
- cpuidle_uninstall_idle_handler();
}

EXPORT_SYMBOL_GPL(cpuidle_pause_and_lock);
@@ -134,7 +107,6 @@ EXPORT_SYMBOL_GPL(cpuidle_pause_and_lock
*/
void cpuidle_resume_and_unlock(void)
{
- cpuidle_install_idle_handler();
mutex_unlock(&cpuidle_lock);
}

@@ -182,7 +154,6 @@ int cpuidle_enable_device(struct cpuidle

dev->enabled = 1;

- enabled_devices++;
return 0;

fail_sysfs:
@@ -213,7 +184,6 @@ void cpuidle_disable_device(struct cpuid
cpuidle_curr_governor->disable(dev);

cpuidle_remove_state_sysfs(dev);
- enabled_devices--;
}

EXPORT_SYMBOL_GPL(cpuidle_disable_device);
@@ -266,7 +236,6 @@ static void poll_idle_init(struct cpuidl
*/
static int __cpuidle_register_device(struct cpuidle_device *dev)
{
- int ret;
struct sys_device *sys_dev = get_cpu_sysdev((unsigned long)dev->cpu);

if (!sys_dev)
@@ -274,16 +243,9 @@ static int __cpuidle_register_device(str
if (!try_module_get(cpuidle_curr_driver->owner))
return -EINVAL;

- init_completion(&dev->kobj_unregister);
-
poll_idle_init(dev);

per_cpu(cpuidle_devices, dev->cpu) = dev;
- list_add(&dev->device_list, &cpuidle_detected_devices);
- if ((ret = cpuidle_add_sysfs(sys_dev))) {
- module_put(cpuidle_curr_driver->owner);
- return ret;
- }

dev->registered = 1;
return 0;
@@ -305,7 +267,6 @@ int cpuidle_register_device(struct cpuid
}

cpuidle_enable_device(dev);
- cpuidle_install_idle_handler();

mutex_unlock(&cpuidle_lock);

@@ -321,8 +282,6 @@ EXPORT_SYMBOL_GPL(cpuidle_register_devic
*/
void cpuidle_unregister_device(struct cpuidle_device *dev)
{
- struct sys_device *sys_dev = get_cpu_sysdev((unsigned long)dev->cpu);
-
if (dev->registered == 0)
return;

@@ -330,9 +289,6 @@ void cpuidle_unregister_device(struct cp

cpuidle_disable_device(dev);

- cpuidle_remove_sysfs(sys_dev);
- list_del(&dev->device_list);
- wait_for_completion(&dev->kobj_unregister);
per_cpu(cpuidle_devices, dev->cpu) = NULL;

cpuidle_resume_and_unlock();
@@ -384,8 +340,6 @@ static int __init cpuidle_init(void)
{
int ret;

- pm_idle_old = pm_idle;
-
ret = cpuidle_add_class_sysfs(&cpu_sysdev_class);
if (ret)
return ret;
Index: linux.trees.git/drivers/cpuidle/governor.c
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/governor.c
+++ linux.trees.git/drivers/cpuidle/governor.c
@@ -43,16 +43,14 @@ static struct cpuidle_governor * __cpuid
*/
int cpuidle_switch_governor(struct cpuidle_governor *gov)
{
- struct cpuidle_device *dev;
+ int cpu;

if (gov == cpuidle_curr_governor)
return 0;

- cpuidle_uninstall_idle_handler();
-
if (cpuidle_curr_governor) {
- list_for_each_entry(dev, &cpuidle_detected_devices, device_list)
- cpuidle_disable_device(dev);
+ for_each_online_cpu(cpu)
+ cpuidle_disable_device(per_cpu(cpuidle_devices, cpu));
module_put(cpuidle_curr_governor->owner);
}

@@ -61,9 +59,8 @@ int cpuidle_switch_governor(struct cpuid
if (gov) {
if (!try_module_get(cpuidle_curr_governor->owner))
return -EINVAL;
- list_for_each_entry(dev, &cpuidle_detected_devices, device_list)
- cpuidle_enable_device(dev);
- cpuidle_install_idle_handler();
+ for_each_online_cpu(cpu)
+ cpuidle_enable_device(per_cpu(cpuidle_devices, cpu));
printk(KERN_INFO "cpuidle: using governor %s\n", gov->name);
}

Index: linux.trees.git/include/linux/cpuidle.h
===================================================================
--- linux.trees.git.orig/include/linux/cpuidle.h
+++ linux.trees.git/include/linux/cpuidle.h
@@ -92,7 +92,6 @@ struct cpuidle_device {
struct cpuidle_state_kobj *kobjs[CPUIDLE_STATE_MAX];
struct cpuidle_state *last_state;

- struct list_head device_list;
struct kobject kobj;
struct completion kobj_unregister;
void *governor_data;
@@ -112,6 +111,9 @@ static inline int cpuidle_get_last_resid
return dev->last_residency;
}

+extern void cpuidle_idle_call(void);
+extern struct cpuidle_driver *cpuidle_curr_driver;
+

/****************************
* CPUIDLE DRIVER INTERFACE *
Index: linux.trees.git/drivers/cpuidle/cpuidle.h
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/cpuidle.h
+++ linux.trees.git/drivers/cpuidle/cpuidle.h
@@ -9,9 +9,7 @@

/* For internal use only */
extern struct cpuidle_governor *cpuidle_curr_governor;
-extern struct cpuidle_driver *cpuidle_curr_driver;
extern struct list_head cpuidle_governors;
-extern struct list_head cpuidle_detected_devices;
extern struct mutex cpuidle_lock;
extern spinlock_t cpuidle_driver_lock;

@@ -27,7 +25,7 @@ extern int cpuidle_add_class_sysfs(struc
extern void cpuidle_remove_class_sysfs(struct sysdev_class *cls);
extern int cpuidle_add_state_sysfs(struct cpuidle_device *device);
extern void cpuidle_remove_state_sysfs(struct cpuidle_device *device);
-extern int cpuidle_add_sysfs(struct sys_device *sysdev);
-extern void cpuidle_remove_sysfs(struct sys_device *sysdev);
+extern int cpuidle_add_sysfs(struct cpuidle_device *device);
+extern void cpuidle_remove_sysfs(struct cpuidle_device *device);

#endif /* __DRIVER_CPUIDLE_H */
Index: linux.trees.git/drivers/cpuidle/sysfs.c
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/sysfs.c
+++ linux.trees.git/drivers/cpuidle/sysfs.c
@@ -311,6 +311,13 @@ int cpuidle_add_state_sysfs(struct cpuid
int i, ret = -ENOMEM;
struct cpuidle_state_kobj *kobj;

+ init_completion(&device->kobj_unregister);
+
+ ret = cpuidle_add_sysfs(device);
+ if (ret) {
+ module_put(cpuidle_curr_driver->owner);
+ return ret;
+ }
/* state statistics */
for (i = 0; i < device->state_count; i++) {
kobj = kzalloc(sizeof(struct cpuidle_state_kobj), GFP_KERNEL);
@@ -347,35 +354,32 @@ void cpuidle_remove_state_sysfs(struct c

for (i = 0; i < device->state_count; i++)
cpuidle_free_state_kobj(device, i);
+
+ cpuidle_remove_sysfs(device);
}

/**
* cpuidle_add_sysfs - creates a sysfs instance for the target device
- * @sysdev: the target device
+ * @device: the target device
*/
-int cpuidle_add_sysfs(struct sys_device *sysdev)
+int cpuidle_add_sysfs(struct cpuidle_device *device)
{
- int cpu = sysdev->id;
- struct cpuidle_device *dev;
int error;
+ struct sys_device *sysdev = get_cpu_sysdev((unsigned long)device->cpu);

- dev = per_cpu(cpuidle_devices, cpu);
- error = kobject_init_and_add(&dev->kobj, &ktype_cpuidle, &sysdev->kobj,
- "cpuidle");
+ error = kobject_init_and_add(&device->kobj, &ktype_cpuidle,
+ &sysdev->kobj, "cpuidle");
if (!error)
- kobject_uevent(&dev->kobj, KOBJ_ADD);
+ kobject_uevent(&device->kobj, KOBJ_ADD);
return error;
}

/**
* cpuidle_remove_sysfs - deletes a sysfs instance on the target device
- * @sysdev: the target device
+ * @device: the target device
*/
-void cpuidle_remove_sysfs(struct sys_device *sysdev)
+void cpuidle_remove_sysfs(struct cpuidle_device *device)
{
- int cpu = sysdev->id;
- struct cpuidle_device *dev;
-
- dev = per_cpu(cpuidle_devices, cpu);
- kobject_put(&dev->kobj);
+ kobject_put(&device->kobj);
+ wait_for_completion(&device->kobj_unregister);
}
Index: linux.trees.git/drivers/cpuidle/driver.c
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/driver.c
+++ linux.trees.git/drivers/cpuidle/driver.c
@@ -27,10 +27,6 @@ int cpuidle_register_driver(struct cpuid
return -EINVAL;

spin_lock(&cpuidle_driver_lock);
- if (cpuidle_curr_driver) {
- spin_unlock(&cpuidle_driver_lock);
- return -EBUSY;
- }
cpuidle_curr_driver = drv;
spin_unlock(&cpuidle_driver_lock);

2009-10-08 09:51:12

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

* Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:

Implement a list based registering mechanism for architectures which
have multiple sets of idle routines which are to be registered.

Currently, in x86 it is done by merely setting pm_idle = idle_routine
and managing this pm_idle pointer is messy.

To give an example of how this mechanism works:
In x86, initially, idle routine is selected from the set of poll/mwait/
c1e/default idle loops. So the selected idle loop is registered in cpuidle
as one idle state cpuidle devices. Once ACPI comes up, it registers
another set of idle states on top of this state. Again, suppose a module
registers another set of idle loops, it is added to this list.

This provides a clean way of registering and unregistering idle state
routines.

In the current implementation, pm_idle is set as the current idle routine
being used and the old idle routine has to be maintained and when a module
registers/unregisters an idle routine, confusion arises.


Signed-off-by: Arun R Bharadwaj <[email protected]>
---
drivers/cpuidle/cpuidle.c | 54 ++++++++++++++++++++++++++++++++++++++++------
include/linux/cpuidle.h | 1
2 files changed, 48 insertions(+), 7 deletions(-)

Index: linux.trees.git/drivers/cpuidle/cpuidle.c
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/cpuidle.c
+++ linux.trees.git/drivers/cpuidle/cpuidle.c
@@ -22,6 +22,7 @@
#include "cpuidle.h"

DEFINE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
+DEFINE_PER_CPU(struct list_head, cpuidle_devices_list);

DEFINE_MUTEX(cpuidle_lock);

@@ -112,6 +113,45 @@ void cpuidle_resume_and_unlock(void)

EXPORT_SYMBOL_GPL(cpuidle_resume_and_unlock);

+int cpuidle_add_to_list(struct cpuidle_device *dev)
+{
+ int ret, cpu = dev->cpu;
+ struct cpuidle_device *old_dev;
+
+ if (!list_empty(&per_cpu(cpuidle_devices_list, cpu))) {
+ old_dev = list_first_entry(&per_cpu(cpuidle_devices_list, cpu),
+ struct cpuidle_device, idle_list);
+ cpuidle_remove_state_sysfs(old_dev);
+ }
+
+ list_add(&dev->idle_list, &per_cpu(cpuidle_devices_list, cpu));
+ ret = cpuidle_add_state_sysfs(dev);
+ return ret;
+}
+
+void cpuidle_remove_from_list(struct cpuidle_device *dev)
+{
+ struct cpuidle_device *temp_dev;
+ struct list_head *pos;
+ int ret, cpu = dev->cpu;
+
+ list_for_each(pos, &per_cpu(cpuidle_devices_list, cpu)) {
+ temp_dev = container_of(pos, struct cpuidle_device, idle_list);
+ if (dev == temp_dev) {
+ list_del(&temp_dev->idle_list);
+ cpuidle_remove_state_sysfs(temp_dev);
+ break;
+ }
+ }
+
+ if (!list_empty(&per_cpu(cpuidle_devices_list, cpu))) {
+ temp_dev = list_first_entry(&per_cpu(cpuidle_devices_list, cpu),
+ struct cpuidle_device, idle_list);
+ ret = cpuidle_add_state_sysfs(temp_dev);
+ }
+ cpuidle_kick_cpus();
+}
+
/**
* cpuidle_enable_device - enables idle PM for a CPU
* @dev: the CPU
@@ -136,9 +176,6 @@ int cpuidle_enable_device(struct cpuidle
return ret;
}

- if ((ret = cpuidle_add_state_sysfs(dev)))
- return ret;
-
if (cpuidle_curr_governor->enable &&
(ret = cpuidle_curr_governor->enable(dev)))
goto fail_sysfs;
@@ -157,7 +194,7 @@ int cpuidle_enable_device(struct cpuidle
return 0;

fail_sysfs:
- cpuidle_remove_state_sysfs(dev);
+ cpuidle_remove_from_list(dev);

return ret;
}
@@ -182,8 +219,6 @@ void cpuidle_disable_device(struct cpuid

if (cpuidle_curr_governor->disable)
cpuidle_curr_governor->disable(dev);
-
- cpuidle_remove_state_sysfs(dev);
}

EXPORT_SYMBOL_GPL(cpuidle_disable_device);
@@ -267,6 +302,7 @@ int cpuidle_register_device(struct cpuid
}

cpuidle_enable_device(dev);
+ cpuidle_add_to_list(dev);

mutex_unlock(&cpuidle_lock);

@@ -288,6 +324,7 @@ void cpuidle_unregister_device(struct cp
cpuidle_pause_and_lock();

cpuidle_disable_device(dev);
+ cpuidle_remove_from_list(dev);

per_cpu(cpuidle_devices, dev->cpu) = NULL;

@@ -338,12 +375,15 @@ static inline void latency_notifier_init
*/
static int __init cpuidle_init(void)
{
- int ret;
+ int ret, cpu;

ret = cpuidle_add_class_sysfs(&cpu_sysdev_class);
if (ret)
return ret;

+ for_each_possible_cpu(cpu)
+ INIT_LIST_HEAD(&per_cpu(cpuidle_devices_list, cpu));
+
latency_notifier_init(&cpuidle_latency_notifier);

return 0;
Index: linux.trees.git/include/linux/cpuidle.h
===================================================================
--- linux.trees.git.orig/include/linux/cpuidle.h
+++ linux.trees.git/include/linux/cpuidle.h
@@ -92,6 +92,7 @@ struct cpuidle_device {
struct cpuidle_state_kobj *kobjs[CPUIDLE_STATE_MAX];
struct cpuidle_state *last_state;

+ struct list_head idle_list;
struct kobject kobj;
struct completion kobj_unregister;
void *governor_data;

2009-10-08 09:52:14

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v8 PATCH 3/8]: x86: refactor x86 idle power management code and remove all instances of pm_idle.

* Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:

This patch cleans up x86 of all instances of pm_idle.

pm_idle which was earlier called from cpu_idle() idle loop
is replaced by cpuidle_idle_call.

x86 also registers to cpuidle when the idle routine is selected,
by populating the cpuidle_device data structure for each cpu.

This is replicated for apm module and for xen, which also used pm_idle.


Signed-off-by: Arun R Bharadwaj <[email protected]>
---
arch/x86/kernel/apm_32.c | 55 +++++++++++++++++++++++-
arch/x86/kernel/process.c | 93 ++++++++++++++++++++++++++----------------
arch/x86/kernel/process_32.c | 3 -
arch/x86/kernel/process_64.c | 3 -
arch/x86/xen/setup.c | 40 +++++++++++++++++-
drivers/acpi/processor_core.c | 9 ++--
drivers/cpuidle/cpuidle.c | 16 ++++---
include/linux/cpuidle.h | 1
8 files changed, 172 insertions(+), 48 deletions(-)

Index: linux.trees.git/arch/x86/kernel/process.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/process.c
+++ linux.trees.git/arch/x86/kernel/process.c
@@ -9,6 +9,7 @@
#include <linux/pm.h>
#include <linux/clockchips.h>
#include <linux/random.h>
+#include <linux/cpuidle.h>
#include <trace/events/power.h>
#include <asm/system.h>
#include <asm/apic.h>
@@ -244,12 +245,6 @@ int sys_vfork(struct pt_regs *regs)
unsigned long boot_option_idle_override = 0;
EXPORT_SYMBOL(boot_option_idle_override);

-/*
- * Powermanagement idle function, if any..
- */
-void (*pm_idle)(void);
-EXPORT_SYMBOL(pm_idle);
-
#ifdef CONFIG_X86_32
/*
* This halt magic was a workaround for ancient floppy DMA
@@ -329,17 +324,15 @@ static void do_nothing(void *unused)
}

/*
- * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
- * pm_idle and update to new pm_idle value. Required while changing pm_idle
- * handler on SMP systems.
+ * cpu_idle_wait - Required while changing idle routine handler on SMP systems.
*
- * Caller must have changed pm_idle to the new value before the call. Old
- * pm_idle value will not be used by any CPU after the return of this function.
+ * Caller must have changed idle routine to the new value before the call. Old
+ * value will not be used by any CPU after the return of this function.
*/
void cpu_idle_wait(void)
{
smp_mb();
- /* kick all the CPUs so that they exit out of pm_idle */
+ /* kick all the CPUs so that they exit out of idle loop */
smp_call_function(do_nothing, NULL, 1);
}
EXPORT_SYMBOL_GPL(cpu_idle_wait);
@@ -387,20 +380,6 @@ static void mwait_idle(void)
}

/*
- * On SMP it's slightly faster (but much more power-consuming!)
- * to poll the ->work.need_resched flag instead of waiting for the
- * cross-CPU IPI to arrive. Use this option with caution.
- */
-static void poll_idle(void)
-{
- trace_power_start(POWER_CSTATE, 0);
- local_irq_enable();
- while (!need_resched())
- cpu_relax();
- trace_power_end(0);
-}
-
-/*
* mwait selection logic:
*
* It depends on the CPU. For AMD CPUs that support MWAIT this is
@@ -518,15 +497,59 @@ static void c1e_idle(void)
default_idle();
}

+static void (*local_idle)(void);
+DEFINE_PER_CPU(struct cpuidle_device, idle_devices);
+
+struct cpuidle_driver cpuidle_default_driver = {
+ .name = "cpuidle_default",
+};
+
+static int local_idle_loop(struct cpuidle_device *dev, struct cpuidle_state *st)
+{
+ ktime_t t1, t2;
+ s64 diff;
+ int ret;
+
+ t1 = ktime_get();
+ local_idle();
+ t2 = ktime_get();
+
+ diff = ktime_to_us(ktime_sub(t2, t1));
+ if (diff > INT_MAX)
+ diff = INT_MAX;
+ ret = (int) diff;
+
+ return ret;
+}
+
+static int setup_cpuidle_simple(void)
+{
+ struct cpuidle_device *dev;
+ int cpu;
+
+ if (!cpuidle_curr_driver)
+ cpuidle_register_driver(&cpuidle_default_driver);
+
+ for_each_online_cpu(cpu) {
+ dev = &per_cpu(idle_devices, cpu);
+ dev->cpu = cpu;
+ dev->states[0].enter = local_idle_loop;
+ dev->state_count = 1;
+ cpuidle_register_device(dev);
+ }
+ return 0;
+}
+device_initcall(setup_cpuidle_simple);
+
void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c)
{
#ifdef CONFIG_SMP
- if (pm_idle == poll_idle && smp_num_siblings > 1) {
+ if (local_idle == poll_idle && smp_num_siblings > 1) {
printk(KERN_WARNING "WARNING: polling idle and HT enabled,"
" performance may degrade.\n");
}
#endif
- if (pm_idle)
+ if (local_idle)
return;

if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) {
@@ -534,18 +557,20 @@ void __cpuinit select_idle_routine(const
* One CPU supports mwait => All CPUs supports mwait
*/
printk(KERN_INFO "using mwait in idle threads.\n");
- pm_idle = mwait_idle;
+ local_idle = mwait_idle;
} else if (check_c1e_idle(c)) {
printk(KERN_INFO "using C1E aware idle routine\n");
- pm_idle = c1e_idle;
+ local_idle = c1e_idle;
} else
- pm_idle = default_idle;
+ local_idle = default_idle;
+
+ return;
}

void __init init_c1e_mask(void)
{
/* If we're using c1e_idle, we need to allocate c1e_mask. */
- if (pm_idle == c1e_idle)
+ if (local_idle == c1e_idle)
zalloc_cpumask_var(&c1e_mask, GFP_KERNEL);
}

@@ -556,7 +581,7 @@ static int __init idle_setup(char *str)

if (!strcmp(str, "poll")) {
printk("using polling idle threads.\n");
- pm_idle = poll_idle;
+ local_idle = poll_idle;
} else if (!strcmp(str, "mwait"))
force_mwait = 1;
else if (!strcmp(str, "halt")) {
@@ -567,7 +592,7 @@ static int __init idle_setup(char *str)
* To continue to load the CPU idle driver, don't touch
* the boot_option_idle_override.
*/
- pm_idle = default_idle;
+ local_idle = default_idle;
idle_halt = 1;
return 0;
} else if (!strcmp(str, "nomwait")) {
Index: linux.trees.git/arch/x86/kernel/process_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/process_32.c
+++ linux.trees.git/arch/x86/kernel/process_32.c
@@ -40,6 +40,7 @@
#include <linux/uaccess.h>
#include <linux/io.h>
#include <linux/kdebug.h>
+#include <linux/cpuidle.h>

#include <asm/pgtable.h>
#include <asm/system.h>
@@ -113,7 +114,7 @@ void cpu_idle(void)
local_irq_disable();
/* Don't trace irqs off for idle */
stop_critical_timings();
- pm_idle();
+ cpuidle_idle_call();
start_critical_timings();
}
tick_nohz_restart_sched_tick();
Index: linux.trees.git/arch/x86/kernel/process_64.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/process_64.c
+++ linux.trees.git/arch/x86/kernel/process_64.c
@@ -39,6 +39,7 @@
#include <linux/io.h>
#include <linux/ftrace.h>
#include <linux/dmi.h>
+#include <linux/cpuidle.h>

#include <asm/pgtable.h>
#include <asm/system.h>
@@ -142,7 +143,7 @@ void cpu_idle(void)
enter_idle();
/* Don't trace irqs off for idle */
stop_critical_timings();
- pm_idle();
+ cpuidle_idle_call();
start_critical_timings();
/* In many cases the interrupt that ended idle
has already called exit_idle. But some idle
Index: linux.trees.git/arch/x86/kernel/apm_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/apm_32.c
+++ linux.trees.git/arch/x86/kernel/apm_32.c
@@ -2257,6 +2257,56 @@ static struct dmi_system_id __initdata a
{ }
};

+DEFINE_PER_CPU(struct cpuidle_device, apm_idle_devices);
+
+struct cpuidle_driver cpuidle_apm_driver = {
+ .name = "cpuidle_apm",
+};
+
+static int apm_idle_loop(struct cpuidle_device *dev, struct cpuidle_state *st)
+{
+ ktime_t t1, t2;
+ s64 diff;
+ int ret;
+
+ t1 = ktime_get();
+ apm_cpu_idle();
+ t2 = ktime_get();
+
+ diff = ktime_to_us(ktime_sub(t2, t1));
+ if (diff > INT_MAX)
+ diff = INT_MAX;
+ ret = (int) diff;
+
+ return ret;
+}
+
+void __cpuinit setup_cpuidle_apm(void)
+{
+ struct cpuidle_device *dev;
+
+ if (!cpuidle_curr_driver)
+ cpuidle_register_driver(&cpuidle_apm_driver);
+
+ dev = &per_cpu(apm_idle_devices, smp_processor_id());
+ dev->cpu = smp_processor_id();
+ dev->states[0].enter = apm_idle_loop;
+ dev->state_count = 1;
+ cpuidle_register_device(dev);
+}
+
+void exit_cpuidle_apm(void)
+{
+ struct cpuidle_device *dev;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ dev = &per_cpu(apm_idle_devices, cpu);
+ cpuidle_unregister_device(dev);
+ }
+}
+
+
/*
* Just start the APM thread. We do NOT want to do APM BIOS
* calls from anything but the APM thread, if for no other reason
@@ -2394,8 +2444,7 @@ static int __init apm_init(void)
if (HZ != 100)
idle_period = (idle_period * HZ) / 100;
if (idle_threshold < 100) {
- original_pm_idle = pm_idle;
- pm_idle = apm_cpu_idle;
+ setup_cpuidle_apm();
set_pm_idle = 1;
}

@@ -2407,7 +2456,7 @@ static void __exit apm_exit(void)
int error;

if (set_pm_idle) {
- pm_idle = original_pm_idle;
+ exit_cpuidle_apm();
/*
* We are about to unload the current idle thread pm callback
* (pm_idle), Wait for all processors to update cached/local
Index: linux.trees.git/arch/x86/xen/setup.c
===================================================================
--- linux.trees.git.orig/arch/x86/xen/setup.c
+++ linux.trees.git/arch/x86/xen/setup.c
@@ -8,6 +8,7 @@
#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/pm.h>
+#include <linux/cpuidle.h>

#include <asm/elf.h>
#include <asm/vdso.h>
@@ -151,6 +152,43 @@ void __cpuinit xen_enable_syscall(void)
#endif /* CONFIG_X86_64 */
}

+DEFINE_PER_CPU(struct cpuidle_device, xen_idle_devices);
+struct cpuidle_driver cpuidle_xen_driver = {
+ .name = "cpuidle_xen",
+};
+
+static int xen_idle_loop(struct cpuidle_device *dev, struct cpuidle_state *st)
+{
+ ktime_t t1, t2;
+ s64 diff;
+ int ret;
+
+ t1 = ktime_get();
+ xen_idle();
+ t2 = ktime_get();
+
+ diff = ktime_to_us(ktime_sub(t2, t1));
+ if (diff > INT_MAX)
+ diff = INT_MAX;
+ ret = (int) diff;
+
+ return ret;
+}
+
+void __cpuinit setup_cpuidle_xen(void)
+{
+ struct cpuidle_device *dev;
+
+ if (!cpuidle_curr_driver)
+ cpuidle_register_driver(&cpuidle_xen_driver);
+
+ dev = &per_cpu(xen_idle_devices, smp_processor_id());
+ dev->cpu = smp_processor_id();
+ dev->states[0].enter = xen_idle_loop;
+ dev->state_count = 1;
+ cpuidle_register_device(dev);
+}
+
void __init xen_arch_setup(void)
{
struct physdev_set_iopl set_iopl;
@@ -186,7 +224,7 @@ void __init xen_arch_setup(void)
MAX_GUEST_CMDLINE > COMMAND_LINE_SIZE ?
COMMAND_LINE_SIZE : MAX_GUEST_CMDLINE);

- pm_idle = xen_idle;
+ setup_cpuidle_xen();

paravirt_disable_iospace();

Index: linux.trees.git/drivers/acpi/processor_core.c
===================================================================
--- linux.trees.git.orig/drivers/acpi/processor_core.c
+++ linux.trees.git/drivers/acpi/processor_core.c
@@ -1150,9 +1150,12 @@ static int __init acpi_processor_init(vo
* should not use mwait for CPU-states.
*/
dmi_check_system(processor_idle_dmi_table);
- result = cpuidle_register_driver(&acpi_idle_driver);
- if (result < 0)
- goto out_proc;
+
+ if (!boot_option_idle_override) {
+ result = cpuidle_register_driver(&acpi_idle_driver);
+ if (result < 0)
+ goto out_proc;
+ }

result = acpi_bus_register_driver(&acpi_processor_driver);
if (result < 0)
Index: linux.trees.git/drivers/cpuidle/cpuidle.c
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/cpuidle.c
+++ linux.trees.git/drivers/cpuidle/cpuidle.c
@@ -224,16 +224,22 @@ void cpuidle_disable_device(struct cpuid
EXPORT_SYMBOL_GPL(cpuidle_disable_device);

#ifdef CONFIG_ARCH_HAS_CPU_RELAX
-static int poll_idle(struct cpuidle_device *dev, struct cpuidle_state *st)
+void poll_idle(void)
+{
+ local_irq_enable();
+ while (!need_resched())
+ cpu_relax();
+}
+
+static int poll_idle_loop(struct cpuidle_device *dev, struct cpuidle_state *st)
{
ktime_t t1, t2;
s64 diff;
int ret;

t1 = ktime_get();
- local_irq_enable();
- while (!need_resched())
- cpu_relax();
+
+ poll_idle();

t2 = ktime_get();
diff = ktime_to_us(ktime_sub(t2, t1));
@@ -256,7 +262,7 @@ static void poll_idle_init(struct cpuidl
state->target_residency = 0;
state->power_usage = -1;
state->flags = CPUIDLE_FLAG_POLL;
- state->enter = poll_idle;
+ state->enter = poll_idle_loop;
}
#else
static void poll_idle_init(struct cpuidle_device *dev) {}
Index: linux.trees.git/include/linux/cpuidle.h
===================================================================
--- linux.trees.git.orig/include/linux/cpuidle.h
+++ linux.trees.git/include/linux/cpuidle.h
@@ -187,6 +187,7 @@ static inline void cpuidle_unregister_go

#ifdef CONFIG_ARCH_HAS_CPU_RELAX
#define CPUIDLE_DRIVER_STATE_START 1
+extern void poll_idle(void);
#else
#define CPUIDLE_DRIVER_STATE_START 0
#endif

2009-10-08 09:52:59

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v8 PATCH 4/8]: POWER: enable cpuidle for POWER.

* Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:

This patch enables the cpuidle option in Kconfig for pSeries.

Currently cpuidle infrastructure is enabled only for x86 and ARM.
This code is almost completely borrowed from x86 to enable
cpuidle for pSeries.

Signed-off-by: Arun R Bharadwaj <[email protected]>
---
arch/powerpc/Kconfig | 17 +++++++++++++++++
arch/powerpc/include/asm/system.h | 2 ++
arch/powerpc/kernel/idle.c | 19 +++++++++++++++++++
3 files changed, 38 insertions(+)

Index: linux.trees.git/arch/powerpc/Kconfig
===================================================================
--- linux.trees.git.orig/arch/powerpc/Kconfig
+++ linux.trees.git/arch/powerpc/Kconfig
@@ -91,6 +91,9 @@ config ARCH_HAS_ILOG2_U64
bool
default y if 64BIT

+config ARCH_HAS_CPU_IDLE_WAIT
+ def_bool y
+
config GENERIC_HWEIGHT
bool
default y
@@ -247,6 +250,20 @@ source "kernel/Kconfig.freezer"
source "arch/powerpc/sysdev/Kconfig"
source "arch/powerpc/platforms/Kconfig"

+menu "Power management options"
+
+source "drivers/cpuidle/Kconfig"
+
+config PSERIES_PROCESSOR_IDLE
+ bool "Idle Power Management Support for pSeries"
+ depends on PPC_PSERIES && CPU_IDLE
+ default y
+ help
+ Idle Power Management Support for pSeries. This hooks onto cpuidle
+ infrastructure to help in idle cpu power management.
+
+endmenu
+
menu "Kernel options"

config HIGHMEM
Index: linux.trees.git/arch/powerpc/include/asm/system.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/include/asm/system.h
+++ linux.trees.git/arch/powerpc/include/asm/system.h
@@ -546,5 +546,7 @@ extern void account_system_vtime(struct

extern struct dentry *powerpc_debugfs_root;

+void cpu_idle_wait(void);
+
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_SYSTEM_H */
Index: linux.trees.git/arch/powerpc/kernel/idle.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/idle.c
+++ linux.trees.git/arch/powerpc/kernel/idle.c
@@ -102,6 +102,25 @@ void cpu_idle(void)
}
}

+static void do_nothing(void *unused)
+{
+}
+
+/*
+ * cpu_idle_wait - Used to ensure that all the CPUs come out of the old
+ * idle loop and start using the new idle loop.
+ * Required while changing idle handler on SMP systems.
+ * Caller must have changed idle handler to the new value before the call.
+ */
+void cpu_idle_wait(void)
+{
+ /* Ensure that new value of idle is set */
+ smp_mb();
+ /* kick all the CPUs so that they exit out of old idle routine */
+ smp_call_function(do_nothing, NULL, 1);
+}
+EXPORT_SYMBOL_GPL(cpu_idle_wait);
+
int powersave_nap;

#ifdef CONFIG_SYSCTL

2009-10-08 09:53:54

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v8 PATCH 5/8]: pSeries/cpuidle: remove dedicate/shared idle loops, which will be moved to arch/powerpc/platforms/pseries/processor_idle.c

* Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:

This patch removes the routines, pseries_shared_idle_sleep and
pseries_dedicated_idle_sleep, since this is implemented as a part
of arch/powerpc/platform/pseries/processor_idle.c

Also, similar to x86, call cpuidle_idle_call from cpu_idle() idle
loop instead of ppc_md.power_save.


Signed-off-by: Arun R Bharadwaj <[email protected]>
---
arch/powerpc/kernel/idle.c | 50 +++++++-----------
arch/powerpc/platforms/pseries/setup.c | 89 ---------------------------------
2 files changed, 22 insertions(+), 117 deletions(-)

Index: linux.trees.git/arch/powerpc/platforms/pseries/setup.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/setup.c
+++ linux.trees.git/arch/powerpc/platforms/pseries/setup.c
@@ -75,9 +75,6 @@ EXPORT_SYMBOL(CMO_PageSize);

int fwnmi_active; /* TRUE if an FWNMI handler is present */

-static void pseries_shared_idle_sleep(void);
-static void pseries_dedicated_idle_sleep(void);
-
static struct device_node *pSeries_mpic_node;

static void pSeries_show_cpuinfo(struct seq_file *m)
@@ -297,18 +294,8 @@ static void __init pSeries_setup_arch(vo
pSeries_nvram_init();

/* Choose an idle loop */
- if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+ if (firmware_has_feature(FW_FEATURE_SPLPAR))
vpa_init(boot_cpuid);
- if (get_lppaca()->shared_proc) {
- printk(KERN_DEBUG "Using shared processor idle loop\n");
- ppc_md.power_save = pseries_shared_idle_sleep;
- } else {
- printk(KERN_DEBUG "Using dedicated idle loop\n");
- ppc_md.power_save = pseries_dedicated_idle_sleep;
- }
- } else {
- printk(KERN_DEBUG "Using default idle loop\n");
- }

if (firmware_has_feature(FW_FEATURE_LPAR))
ppc_md.enable_pmcs = pseries_lpar_enable_pmcs;
@@ -496,80 +483,6 @@ static int __init pSeries_probe(void)
return 1;
}

-
-DECLARE_PER_CPU(unsigned long, smt_snooze_delay);
-
-static void pseries_dedicated_idle_sleep(void)
-{
- unsigned int cpu = smp_processor_id();
- unsigned long start_snooze;
- unsigned long in_purr, out_purr;
-
- /*
- * Indicate to the HV that we are idle. Now would be
- * a good time to find other work to dispatch.
- */
- get_lppaca()->idle = 1;
- get_lppaca()->donate_dedicated_cpu = 1;
- in_purr = mfspr(SPRN_PURR);
-
- /*
- * We come in with interrupts disabled, and need_resched()
- * has been checked recently. If we should poll for a little
- * while, do so.
- */
- if (__get_cpu_var(smt_snooze_delay)) {
- start_snooze = get_tb() +
- __get_cpu_var(smt_snooze_delay) * tb_ticks_per_usec;
- local_irq_enable();
- set_thread_flag(TIF_POLLING_NRFLAG);
-
- while (get_tb() < start_snooze) {
- if (need_resched() || cpu_is_offline(cpu))
- goto out;
- ppc64_runlatch_off();
- HMT_low();
- HMT_very_low();
- }
-
- HMT_medium();
- clear_thread_flag(TIF_POLLING_NRFLAG);
- smp_mb();
- local_irq_disable();
- if (need_resched() || cpu_is_offline(cpu))
- goto out;
- }
-
- cede_processor();
-
-out:
- HMT_medium();
- out_purr = mfspr(SPRN_PURR);
- get_lppaca()->wait_state_cycles += out_purr - in_purr;
- get_lppaca()->donate_dedicated_cpu = 0;
- get_lppaca()->idle = 0;
-}
-
-static void pseries_shared_idle_sleep(void)
-{
- /*
- * Indicate to the HV that we are idle. Now would be
- * a good time to find other work to dispatch.
- */
- get_lppaca()->idle = 1;
-
- /*
- * Yield the processor to the hypervisor. We return if
- * an external interrupt occurs (which are driven prior
- * to returning here) or if a prod occurs from another
- * processor. When returning here, external interrupts
- * are enabled.
- */
- cede_processor();
-
- get_lppaca()->idle = 0;
-}
-
static int pSeries_pci_probe_mode(struct pci_bus *bus)
{
if (firmware_has_feature(FW_FEATURE_LPAR))
Index: linux.trees.git/arch/powerpc/kernel/idle.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/idle.c
+++ linux.trees.git/arch/powerpc/kernel/idle.c
@@ -25,6 +25,7 @@
#include <linux/cpu.h>
#include <linux/sysctl.h>
#include <linux/tick.h>
+#include <linux/cpuidle.h>

#include <asm/system.h>
#include <asm/processor.h>
@@ -60,35 +61,26 @@ void cpu_idle(void)
while (!need_resched() && !cpu_should_die()) {
ppc64_runlatch_off();

- if (ppc_md.power_save) {
- clear_thread_flag(TIF_POLLING_NRFLAG);
- /*
- * smp_mb is so clearing of TIF_POLLING_NRFLAG
- * is ordered w.r.t. need_resched() test.
- */
- smp_mb();
- local_irq_disable();
-
- /* Don't trace irqs off for idle */
- stop_critical_timings();
-
- /* check again after disabling irqs */
- if (!need_resched() && !cpu_should_die())
- ppc_md.power_save();
-
- start_critical_timings();
-
- local_irq_enable();
- set_thread_flag(TIF_POLLING_NRFLAG);
-
- } else {
- /*
- * Go into low thread priority and possibly
- * low power mode.
- */
- HMT_low();
- HMT_very_low();
- }
+ clear_thread_flag(TIF_POLLING_NRFLAG);
+ /*
+ * smp_mb is so clearing of TIF_POLLING_NRFLAG
+ * is ordered w.r.t. need_resched() test.
+ */
+ smp_mb();
+ local_irq_disable();
+
+ /* Don't trace irqs off for idle */
+ stop_critical_timings();
+
+ /* check again after disabling irqs */
+ if (!need_resched() && !cpu_should_die())
+ cpuidle_idle_call();
+
+ start_critical_timings();
+
+ local_irq_enable();
+ set_thread_flag(TIF_POLLING_NRFLAG);
+
}

HMT_medium();

2009-10-08 09:54:41

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v8 PATCH 6/8]: POWER: add a default_idle idle loop for POWER.

* Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:

In arch/powerpc/kernel/idle.c create a default_idle() routine by moving
the failover condition of the cpu_idle() idle loop. This is needed by
cpuidle infrastructure to call default_idle when other idle routines
are not yet registered. Functionality remains the same, but the code is
slightly moved around.


Signed-off-by: Arun R Bharadwaj <[email protected]>
---
arch/powerpc/Kconfig | 3 +++
arch/powerpc/include/asm/system.h | 1 +
arch/powerpc/kernel/idle.c | 6 ++++++
3 files changed, 10 insertions(+)

Index: linux.trees.git/arch/powerpc/Kconfig
===================================================================
--- linux.trees.git.orig/arch/powerpc/Kconfig
+++ linux.trees.git/arch/powerpc/Kconfig
@@ -94,6 +94,9 @@ config ARCH_HAS_ILOG2_U64
config ARCH_HAS_CPU_IDLE_WAIT
def_bool y

+config ARCH_HAS_DEFAULT_IDLE
+ def_bool y
+
config GENERIC_HWEIGHT
bool
default y
Index: linux.trees.git/arch/powerpc/include/asm/system.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/include/asm/system.h
+++ linux.trees.git/arch/powerpc/include/asm/system.h
@@ -218,6 +218,7 @@ extern unsigned long klimit;
extern void *alloc_maybe_bootmem(size_t size, gfp_t mask);
extern void *zalloc_maybe_bootmem(size_t size, gfp_t mask);

+extern void default_idle(void);
extern int powersave_nap; /* set if nap mode can be used in idle loop */

/*
Index: linux.trees.git/arch/powerpc/kernel/idle.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/idle.c
+++ linux.trees.git/arch/powerpc/kernel/idle.c
@@ -113,6 +113,12 @@ void cpu_idle_wait(void)
}
EXPORT_SYMBOL_GPL(cpu_idle_wait);

+void default_idle(void)
+{
+ HMT_low();
+ HMT_very_low();
+}
+
int powersave_nap;

#ifdef CONFIG_SYSCTL

2009-10-08 09:55:27

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v8 PATCH 7/8]: pSeries: implement pSeries processor idle module.

* Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:

This patch creates arch/powerpc/platforms/pseries/processor_idle.c,
which implements the cpuidle infrastructure for pseries.
It implements a pseries_cpuidle_loop() which would be the main idle loop
called from cpu_idle(). It makes decision of entering either
dedicated_snooze_loop or dedicated_cede_loop for dedicated lpar and
shared_cede_loop for shared lpar processor based on the
decision taken by the cpuidle governor.

Signed-off-by: Arun R Bharadwaj <[email protected]>
---
arch/powerpc/include/asm/system.h | 1
arch/powerpc/kernel/sysfs.c | 2
arch/powerpc/platforms/pseries/Makefile | 1
arch/powerpc/platforms/pseries/processor_idle.c | 210 ++++++++++++++++++++++++
arch/powerpc/platforms/pseries/pseries.h | 8
5 files changed, 222 insertions(+)

Index: linux.trees.git/arch/powerpc/platforms/pseries/Makefile
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/Makefile
+++ linux.trees.git/arch/powerpc/platforms/pseries/Makefile
@@ -26,3 +26,4 @@ obj-$(CONFIG_HCALL_STATS) += hvCall_inst
obj-$(CONFIG_PHYP_DUMP) += phyp_dump.o
obj-$(CONFIG_CMM) += cmm.o
obj-$(CONFIG_DTL) += dtl.o
+obj-$(CONFIG_PSERIES_PROCESSOR_IDLE) += processor_idle.o
Index: linux.trees.git/arch/powerpc/platforms/pseries/pseries.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/pseries.h
+++ linux.trees.git/arch/powerpc/platforms/pseries/pseries.h
@@ -10,6 +10,8 @@
#ifndef _PSERIES_PSERIES_H
#define _PSERIES_PSERIES_H

+#include <linux/cpuidle.h>
+
extern void __init fw_feature_init(const char *hypertas, unsigned long len);

struct pt_regs;
@@ -40,4 +42,10 @@ extern unsigned long rtas_poweron_auto;

extern void find_udbg_vterm(void);

+DECLARE_PER_CPU(unsigned long, smt_snooze_delay);
+
+#ifdef CONFIG_PSERIES_PROCESSOR_IDLE
+extern struct cpuidle_driver pseries_idle_driver;
+#endif
+
#endif /* _PSERIES_PSERIES_H */
Index: linux.trees.git/arch/powerpc/platforms/pseries/processor_idle.c
===================================================================
--- /dev/null
+++ linux.trees.git/arch/powerpc/platforms/pseries/processor_idle.c
@@ -0,0 +1,210 @@
+/*
+ * processor_idle - idle state cpuidle driver.
+ * Adapted from drivers/acpi/processor_idle.c
+ *
+ * Arun R Bharadwaj <[email protected]>
+ *
+ * Copyright (C) 2009 IBM Corporation.
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or (at
+ * your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+#include <linux/cpuidle.h>
+
+#include <asm/paca.h>
+#include <asm/reg.h>
+#include <asm/system.h>
+#include <asm/machdep.h>
+#include <asm/firmware.h>
+
+#include "plpar_wrappers.h"
+#include "pseries.h"
+
+MODULE_AUTHOR("Arun R Bharadwaj");
+MODULE_DESCRIPTION("pSeries Idle State Driver");
+MODULE_LICENSE("GPL");
+
+struct cpuidle_driver pseries_idle_driver = {
+ .name = "pseries_idle",
+ .owner = THIS_MODULE,
+};
+
+DEFINE_PER_CPU(struct cpuidle_device, pseries_dev);
+
+#define IDLE_STATE_COUNT 2
+
+/* pSeries Idle state Flags */
+#define PSERIES_DEDICATED_SNOOZE (0x01)
+#define PSERIES_DEDICATED_CEDE (0x02)
+#define PSERIES_SHARED_CEDE (0x03)
+
+static int pseries_idle_init(struct cpuidle_device *dev)
+{
+ return cpuidle_register_device(dev);
+}
+
+static void shared_cede_loop(void)
+{
+ get_lppaca()->idle = 1;
+ cede_processor();
+ get_lppaca()->idle = 0;
+}
+
+static void dedicated_snooze_loop(void)
+{
+ local_irq_enable();
+ set_thread_flag(TIF_POLLING_NRFLAG);
+ while (!need_resched()) {
+ ppc64_runlatch_off();
+ HMT_low();
+ HMT_very_low();
+ }
+ HMT_medium();
+ clear_thread_flag(TIF_POLLING_NRFLAG);
+ smp_mb();
+ local_irq_disable();
+}
+
+static void dedicated_cede_loop(void)
+{
+ ppc64_runlatch_off();
+ HMT_medium();
+ cede_processor();
+}
+
+static int pseries_cpuidle_loop(struct cpuidle_device *dev,
+ struct cpuidle_state *st)
+{
+ ktime_t t1, t2;
+ s64 diff;
+ int ret;
+ unsigned long in_purr, out_purr;
+
+ get_lppaca()->idle = 1;
+ get_lppaca()->donate_dedicated_cpu = 1;
+ in_purr = mfspr(SPRN_PURR);
+
+ t1 = ktime_get();
+
+ if (st->flags & PSERIES_SHARED_CEDE)
+ shared_cede_loop();
+ else if (st->flags & PSERIES_DEDICATED_SNOOZE)
+ dedicated_snooze_loop();
+ else
+ dedicated_cede_loop();
+
+ t2 = ktime_get();
+ diff = ktime_to_us(ktime_sub(t2, t1));
+ if (diff > INT_MAX)
+ diff = INT_MAX;
+
+ ret = (int) diff;
+
+ out_purr = mfspr(SPRN_PURR);
+ get_lppaca()->wait_state_cycles += out_purr - in_purr;
+ get_lppaca()->donate_dedicated_cpu = 0;
+ get_lppaca()->idle = 0;
+
+ return ret;
+}
+
+static int pseries_setup_cpuidle(struct cpuidle_device *dev, int cpu)
+{
+ int i;
+ struct cpuidle_state *state;
+
+ dev->cpu = cpu;
+
+ if (get_lppaca()->shared_proc) {
+ state = &dev->states[0];
+ snprintf(state->name, CPUIDLE_NAME_LEN, "IDLE");
+ state->enter = pseries_cpuidle_loop;
+ strncpy(state->desc, "shared_cede", CPUIDLE_DESC_LEN);
+ state->flags = PSERIES_SHARED_CEDE;
+ state->exit_latency = 0;
+ state->target_residency = 0;
+ return 0;
+ }
+
+ for (i = 0; i < IDLE_STATE_COUNT; i++) {
+ state = &dev->states[i];
+
+ snprintf(state->name, CPUIDLE_NAME_LEN, "CEDE%d", i);
+ state->enter = pseries_cpuidle_loop;
+
+ switch (i) {
+ case 0:
+ strncpy(state->desc, "snooze", CPUIDLE_DESC_LEN);
+ state->flags = PSERIES_DEDICATED_SNOOZE;
+ state->exit_latency = 0;
+ state->target_residency = 0;
+ break;
+
+ case 1:
+ strncpy(state->desc, "cede", CPUIDLE_DESC_LEN);
+ state->flags = PSERIES_DEDICATED_CEDE;
+ state->exit_latency = 1;
+ state->target_residency =
+ __get_cpu_var(smt_snooze_delay);
+ break;
+ }
+ }
+ dev->state_count = IDLE_STATE_COUNT;
+
+ return 0;
+}
+
+void update_smt_snooze_delay(int snooze)
+{
+ int cpu;
+ for_each_online_cpu(cpu)
+ per_cpu(pseries_dev, cpu).states[0].target_residency = snooze;
+}
+
+static int __init pseries_processor_idle_init(void)
+{
+ int cpu;
+ int result;
+
+ result = cpuidle_register_driver(&pseries_idle_driver);
+
+ if (result < 0)
+ return result;
+
+ printk(KERN_DEBUG "pSeries idle driver registered\n");
+
+ if (!firmware_has_feature(FW_FEATURE_SPLPAR)) {
+ printk(KERN_DEBUG "Using default idle\n");
+ return 0;
+ }
+
+ for_each_online_cpu(cpu) {
+ pseries_setup_cpuidle(&per_cpu(pseries_dev, cpu), cpu);
+ pseries_idle_init(&per_cpu(pseries_dev, cpu));
+ }
+
+ printk(KERN_DEBUG "Using cpuidle idle loop\n");
+
+ return 0;
+}
+
+device_initcall(pseries_processor_idle_init);
Index: linux.trees.git/arch/powerpc/include/asm/system.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/include/asm/system.h
+++ linux.trees.git/arch/powerpc/include/asm/system.h
@@ -548,6 +548,7 @@ extern void account_system_vtime(struct
extern struct dentry *powerpc_debugfs_root;

void cpu_idle_wait(void);
+extern void update_smt_snooze_delay(int snooze);

#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_SYSTEM_H */
Index: linux.trees.git/arch/powerpc/kernel/sysfs.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/sysfs.c
+++ linux.trees.git/arch/powerpc/kernel/sysfs.c
@@ -18,6 +18,7 @@
#include <asm/machdep.h>
#include <asm/smp.h>
#include <asm/pmc.h>
+#include <asm/system.h>

#include "cacheinfo.h"

@@ -51,6 +52,7 @@ static ssize_t store_smt_snooze_delay(st
return -EINVAL;

per_cpu(smt_snooze_delay, cpu->sysdev.id) = snooze;
+ update_smt_snooze_delay(snooze);

return count;
}

2009-10-08 09:56:52

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v8 PATCH 8/8]: POWER: Enable default_idle when power_save=off.

* Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:

This patch enables default_idle when power_save=off kernel boot
option is specified.

Earlier, this was done by setting ppc_md.power_save = NULL and hence
HMT_low() and HMT_very_low() was called. Now this is defined under
default_idle() and hence by setting boot_option_idle_override = 1,
the cpuidle registration stuff does not happen and hence default_idle
is chosen in cpuidle_idle_call.

Signed-off-by: Arun R Bharadwaj <[email protected]>
---
arch/powerpc/include/asm/processor.h | 2 ++
arch/powerpc/kernel/idle.c | 4 +++-
arch/powerpc/platforms/pseries/processor_idle.c | 5 +++++
3 files changed, 10 insertions(+), 1 deletion(-)

Index: linux.trees.git/arch/powerpc/include/asm/processor.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/include/asm/processor.h
+++ linux.trees.git/arch/powerpc/include/asm/processor.h
@@ -332,6 +332,8 @@ static inline unsigned long get_clean_sp
}
#endif

+extern int boot_option_idle_override;
+
#endif /* __KERNEL__ */
#endif /* __ASSEMBLY__ */
#endif /* _ASM_POWERPC_PROCESSOR_H */
Index: linux.trees.git/arch/powerpc/kernel/idle.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/idle.c
+++ linux.trees.git/arch/powerpc/kernel/idle.c
@@ -40,9 +40,11 @@
#define cpu_should_die() 0
#endif

+int boot_option_idle_override = 0;
+
static int __init powersave_off(char *arg)
{
- ppc_md.power_save = NULL;
+ boot_option_idle_override = 1;
return 0;
}
__setup("powersave=off", powersave_off);
Index: linux.trees.git/arch/powerpc/platforms/pseries/processor_idle.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/processor_idle.c
+++ linux.trees.git/arch/powerpc/platforms/pseries/processor_idle.c
@@ -185,6 +185,11 @@ static int __init pseries_processor_idle
int cpu;
int result;

+ if (boot_option_idle_override) {
+ printk(KERN_DEBUG "Using default idle\n");
+ return 0;
+ }
+
result = cpuidle_register_driver(&pseries_idle_driver);

if (result < 0)

2009-10-08 10:33:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

On Thu, 2009-10-08 at 15:20 +0530, Arun R Bharadwaj wrote:
> * Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:
>
> Implement a list based registering mechanism for architectures which
> have multiple sets of idle routines which are to be registered.
>
> Currently, in x86 it is done by merely setting pm_idle = idle_routine
> and managing this pm_idle pointer is messy.
>
> To give an example of how this mechanism works:
> In x86, initially, idle routine is selected from the set of poll/mwait/
> c1e/default idle loops. So the selected idle loop is registered in cpuidle
> as one idle state cpuidle devices. Once ACPI comes up, it registers
> another set of idle states on top of this state. Again, suppose a module
> registers another set of idle loops, it is added to this list.
>
> This provides a clean way of registering and unregistering idle state
> routines.

So cpuidle didn't already have a list of idle functions it takes an
appropriate one from?

Then what does this governor do?

Also, does this imply the governor doesn't consider these idle routines?

2009-10-08 10:43:36

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

* Peter Zijlstra <[email protected]> [2009-10-08 12:36:02]:

> On Thu, 2009-10-08 at 15:20 +0530, Arun R Bharadwaj wrote:
> > * Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:
> >
> > Implement a list based registering mechanism for architectures which
> > have multiple sets of idle routines which are to be registered.
> >
> > Currently, in x86 it is done by merely setting pm_idle = idle_routine
> > and managing this pm_idle pointer is messy.
> >
> > To give an example of how this mechanism works:
> > In x86, initially, idle routine is selected from the set of poll/mwait/
> > c1e/default idle loops. So the selected idle loop is registered in cpuidle
> > as one idle state cpuidle devices. Once ACPI comes up, it registers
> > another set of idle states on top of this state. Again, suppose a module
> > registers another set of idle loops, it is added to this list.
> >
> > This provides a clean way of registering and unregistering idle state
> > routines.
>
> So cpuidle didn't already have a list of idle functions it takes an
> appropriate one from?
>

No.. As of now, cpuidle supported only one _set_ of idle states that
can be registered. So in this one set, it would choose the appropriate
idle state. But this list mechanism(actually a stack) allows for
multiple sets.

This is needed because we have a hierarchy of idle states discovery
in x86. First, select_idle_routine() would select poll/mwait/default/c1e.
It doesn't know of existance of ACPI. Later when ACPI comes up,
it registers a set of routines on top of the earlier set.

> Then what does this governor do?
>

The governor would only select the best idle state available from the
set of states which is at the top of the stack. (In the above case, it
would only consider the states registered by ACPI).

If the top-of-the-stack set of idle states is unregistered, the next
set of states on the stack are considered.

> Also, does this imply the governor doesn't consider these idle routines?
>

As i said above, governor would only consider the idle routines which
are at the top of the stack.

Hope this gave a better idea..

2009-10-08 10:47:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote:
>
> > So cpuidle didn't already have a list of idle functions it takes an
> > appropriate one from?
> >
>
> No.. As of now, cpuidle supported only one _set_ of idle states that
> can be registered. So in this one set, it would choose the appropriate
> idle state. But this list mechanism(actually a stack) allows for
> multiple sets.
>
> This is needed because we have a hierarchy of idle states discovery
> in x86. First, select_idle_routine() would select poll/mwait/default/c1e.
> It doesn't know of existance of ACPI. Later when ACPI comes up,
> it registers a set of routines on top of the earlier set.
>
> > Then what does this governor do?
> >
>
> The governor would only select the best idle state available from the
> set of states which is at the top of the stack. (In the above case, it
> would only consider the states registered by ACPI).
>
> If the top-of-the-stack set of idle states is unregistered, the next
> set of states on the stack are considered.
>
> > Also, does this imply the governor doesn't consider these idle routines?
> >
>
> As i said above, governor would only consider the idle routines which
> are at the top of the stack.
>
> Hope this gave a better idea..

So does it make sense to have a set of sets?

Why not integrate them all into one set to be ruled by this governor
thing?

2009-10-08 11:01:53

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

* Peter Zijlstra <[email protected]> [2009-10-08 12:50:33]:

> On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote:
> >
> > > So cpuidle didn't already have a list of idle functions it takes an
> > > appropriate one from?
> > >
> >
> > No.. As of now, cpuidle supported only one _set_ of idle states that
> > can be registered. So in this one set, it would choose the appropriate
> > idle state. But this list mechanism(actually a stack) allows for
> > multiple sets.
> >
> > This is needed because we have a hierarchy of idle states discovery
> > in x86. First, select_idle_routine() would select poll/mwait/default/c1e.
> > It doesn't know of existance of ACPI. Later when ACPI comes up,
> > it registers a set of routines on top of the earlier set.
> >
> > > Then what does this governor do?
> > >
> >
> > The governor would only select the best idle state available from the
> > set of states which is at the top of the stack. (In the above case, it
> > would only consider the states registered by ACPI).
> >
> > If the top-of-the-stack set of idle states is unregistered, the next
> > set of states on the stack are considered.
> >
> > > Also, does this imply the governor doesn't consider these idle routines?
> > >
> >
> > As i said above, governor would only consider the idle routines which
> > are at the top of the stack.
> >
> > Hope this gave a better idea..
>
> So does it make sense to have a set of sets?
>
> Why not integrate them all into one set to be ruled by this governor
> thing?
>

Right now there is a clean hierarchy. So breaking that would mean
putting the registration of all idle routines under ACPI. So, if ACPI
fails to come up or if ACPI is not supported, that would lead to
problems. Because if that happens now, we can fallback to the
initially registered set.

Also, if a module wants to register a set of routines later on, that
cant be added to the initially registered set. So i think we need this
set of sets.

2009-10-08 11:22:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

On Thu, 2009-10-08 at 16:31 +0530, Arun R Bharadwaj wrote:
> * Peter Zijlstra <[email protected]> [2009-10-08 12:50:33]:
>
> > On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote:
> > >
> > > > So cpuidle didn't already have a list of idle functions it takes an
> > > > appropriate one from?
> > > >
> > >
> > > No.. As of now, cpuidle supported only one _set_ of idle states that
> > > can be registered. So in this one set, it would choose the appropriate
> > > idle state. But this list mechanism(actually a stack) allows for
> > > multiple sets.
> > >
> > > This is needed because we have a hierarchy of idle states discovery
> > > in x86. First, select_idle_routine() would select poll/mwait/default/c1e.
> > > It doesn't know of existance of ACPI. Later when ACPI comes up,
> > > it registers a set of routines on top of the earlier set.
> > >
> > > > Then what does this governor do?
> > > >
> > >
> > > The governor would only select the best idle state available from the
> > > set of states which is at the top of the stack. (In the above case, it
> > > would only consider the states registered by ACPI).
> > >
> > > If the top-of-the-stack set of idle states is unregistered, the next
> > > set of states on the stack are considered.
> > >
> > > > Also, does this imply the governor doesn't consider these idle routines?
> > > >
> > >
> > > As i said above, governor would only consider the idle routines which
> > > are at the top of the stack.
> > >
> > > Hope this gave a better idea..
> >
> > So does it make sense to have a set of sets?
> >
> > Why not integrate them all into one set to be ruled by this governor
> > thing?
> >
>
> Right now there is a clean hierarchy. So breaking that would mean
> putting the registration of all idle routines under ACPI.

Uhm, no, it would mean ACPI putting its idle routines on the same level
as all others.

> So, if ACPI
> fails to come up or if ACPI is not supported, that would lead to
> problems.

I think the problem is that ACPI is thinking its special, that should be
rectified, its not.

> Because if that happens now, we can fallback to the
> initially registered set.

I'm thinking its all daft and we should be having one set of idle
routines, if ACPI fails (a tautology if ever there was one) we simply
wouldn't have its idle routines to pick from.

> Also, if a module wants to register a set of routines later on, that
> cant be added to the initially registered set. So i think we need this
> set of sets.

Sounds like something is wrong alright. If you can register an idle
routine you should be able to unregister it too.

What about making ACPI register its idle routines too, 1 for each C
state, and have the governor make a selection out of the full set?

That also allows you to do away with this default_idle() nonsense and
simply panic the box when there are no registered idle routines when the
box wants to go idle.

2009-10-08 12:02:10

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

* Peter Zijlstra <[email protected]> [2009-10-08 13:25:10]:

> On Thu, 2009-10-08 at 16:31 +0530, Arun R Bharadwaj wrote:
> > * Peter Zijlstra <[email protected]> [2009-10-08 12:50:33]:
> >
> > > On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote:
> > > >
> > > > > So cpuidle didn't already have a list of idle functions it takes an
> > > > > appropriate one from?
> > > > >
> > > >
> > > > No.. As of now, cpuidle supported only one _set_ of idle states that
> > > > can be registered. So in this one set, it would choose the appropriate
> > > > idle state. But this list mechanism(actually a stack) allows for
> > > > multiple sets.
> > > >
> > > > This is needed because we have a hierarchy of idle states discovery
> > > > in x86. First, select_idle_routine() would select poll/mwait/default/c1e.
> > > > It doesn't know of existance of ACPI. Later when ACPI comes up,
> > > > it registers a set of routines on top of the earlier set.
> > > >
> > > > > Then what does this governor do?
> > > > >
> > > >
> > > > The governor would only select the best idle state available from the
> > > > set of states which is at the top of the stack. (In the above case, it
> > > > would only consider the states registered by ACPI).
> > > >
> > > > If the top-of-the-stack set of idle states is unregistered, the next
> > > > set of states on the stack are considered.
> > > >
> > > > > Also, does this imply the governor doesn't consider these idle routines?
> > > > >
> > > >
> > > > As i said above, governor would only consider the idle routines which
> > > > are at the top of the stack.
> > > >
> > > > Hope this gave a better idea..
> > >
> > > So does it make sense to have a set of sets?
> > >
> > > Why not integrate them all into one set to be ruled by this governor
> > > thing?
> > >
> >
> > Right now there is a clean hierarchy. So breaking that would mean
> > putting the registration of all idle routines under ACPI.
>
> Uhm, no, it would mean ACPI putting its idle routines on the same level
> as all others.
>

Putting them all on the same level would mean, we need an
enable/disable routine to enable only the currently active routines.

Also, the way governor works is that, it assumes that idle routines
are indexed in the increasing order of power benefit that can be got
out of the state. So this would get messed up.

> > So, if ACPI
> > fails to come up or if ACPI is not supported, that would lead to
> > problems.
>
> I think the problem is that ACPI is thinking its special, that should be
> rectified, its not.
>
> > Because if that happens now, we can fallback to the
> > initially registered set.
>
> I'm thinking its all daft and we should be having one set of idle
> routines, if ACPI fails (a tautology if ever there was one) we simply
> wouldn't have its idle routines to pick from.
>
> > Also, if a module wants to register a set of routines later on, that
> > cant be added to the initially registered set. So i think we need this
> > set of sets.
>
> Sounds like something is wrong alright. If you can register an idle
> routine you should be able to unregister it too.
>

Yes, we can register and unregister in a clean way now.
Consider this. We have a set of routines A, B, C currently registered.
Now a module comes and registers D and E, and later on at some point
of time wants to unregister. So how do you keep track of what all idle
routines the module registered and unregister only those?
Best way to do that is a stack, which is how I have currently
implemented.

> What about making ACPI register its idle routines too, 1 for each C
> state, and have the governor make a selection out of the full set?
>
> That also allows you to do away with this default_idle() nonsense and
> simply panic the box when there are no registered idle routines when the
> box wants to go idle.
>

2009-10-08 12:22:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

On Thu, 2009-10-08 at 17:31 +0530, Arun R Bharadwaj wrote:
>
> > Uhm, no, it would mean ACPI putting its idle routines on the same level
> > as all others.
> >
>
> Putting them all on the same level would mean, we need an
> enable/disable routine to enable only the currently active routines.

What's this enable/disable stuff about?

> Also, the way governor works is that, it assumes that idle routines
> are indexed in the increasing order of power benefit that can be got
> out of the state. So this would get messed up.

Right, which is why I initially had a power-savings field in my
proposal, so it could weight the power savings vs the wakeup latency.

http://lkml.org/lkml/2009/8/27/159

There it was said that was exactly what these governors were doing,
seems its not.

> > Sounds like something is wrong alright. If you can register an idle
> > routine you should be able to unregister it too.
> >
>
> Yes, we can register and unregister in a clean way now.
> Consider this. We have a set of routines A, B, C currently registered.
> Now a module comes and registers D and E, and later on at some point
> of time wants to unregister. So how do you keep track of what all idle
> routines the module registered and unregister only those?
> Best way to do that is a stack, which is how I have currently
> implemented.

Right, so destroy that inner set thing, that way we only have one
left ;-)

2009-10-08 13:11:06

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

* Peter Zijlstra <[email protected]> [2009-10-08 14:25:37]:

> On Thu, 2009-10-08 at 17:31 +0530, Arun R Bharadwaj wrote:
> >
> > > Uhm, no, it would mean ACPI putting its idle routines on the same level
> > > as all others.
> > >
> >
> > Putting them all on the same level would mean, we need an
> > enable/disable routine to enable only the currently active routines.
>
> What's this enable/disable stuff about?

Don't we need an explicit 'don't use this routine' apart from having
the weights based on power consumption and latency. In multiple
registration the assumption is that the top most 'set' has all
necessary routines and we do not need any other idle routines for
optimum operation.

Otherwise the governor has to select from larger 'set' which could
have redundant or conflicting idle routines.

For example we now have c1e_idle to start with and then a set of ACPI
C1, C2, C3 routines. The expectation now is that once we have the
ACPI's routines, we do not need the previous used c1e_idle because
this new set is self contained and picking one from this set based on
latency is good for power savings.

> > Also, the way governor works is that, it assumes that idle routines
> > are indexed in the increasing order of power benefit that can be got
> > out of the state. So this would get messed up.
>
> Right, which is why I initially had a power-savings field in my
> proposal, so it could weight the power savings vs the wakeup latency.
>
> http://lkml.org/lkml/2009/8/27/159

This proposal that you had suggested is being used for the 'set' of
idle routines. The patch changes the idle routines as 'sets' and does
not mix use of routines between two registrations.

> There it was said that was exactly what these governors were doing,
> seems its not.

Governors select from a set of idle routines based on latency. But
there is a probability that any of the routine in the set can be
selected.

> > > Sounds like something is wrong alright. If you can register an idle
> > > routine you should be able to unregister it too.
> > >
> >
> > Yes, we can register and unregister in a clean way now.
> > Consider this. We have a set of routines A, B, C currently registered.
> > Now a module comes and registers D and E, and later on at some point
> > of time wants to unregister. So how do you keep track of what all idle
> > routines the module registered and unregister only those?
> > Best way to do that is a stack, which is how I have currently
> > implemented.
>
> Right, so destroy that inner set thing, that way we only have one
> left ;-)

If un-registration is not needed, then this framework can easily
override the current set with the new one and not worry about the set
of sets.

Ideally, during system boot, we could wait until we know enough
information about idle states and then have a single registration.
The boot process can be in poll-idle until this decision happens.
Like in x86, we can register either c1e_idle or ACPI's routines at
a stage where we know for sure if ACPI is enabled or not.

--Vaidy

2009-10-09 09:40:22

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

* Peter Zijlstra <[email protected]> [2009-10-08 14:25:37]:

> On Thu, 2009-10-08 at 17:31 +0530, Arun R Bharadwaj wrote:
> >
> > > Uhm, no, it would mean ACPI putting its idle routines on the same level
> > > as all others.
> > >
> >
> > Putting them all on the same level would mean, we need an
> > enable/disable routine to enable only the currently active routines.
>
> What's this enable/disable stuff about?
>
> > Also, the way governor works is that, it assumes that idle routines
> > are indexed in the increasing order of power benefit that can be got
> > out of the state. So this would get messed up.
>
> Right, which is why I initially had a power-savings field in my
> proposal, so it could weight the power savings vs the wakeup latency.
>
> http://lkml.org/lkml/2009/8/27/159
>
> There it was said that was exactly what these governors were doing,
> seems its not.
>
> > > Sounds like something is wrong alright. If you can register an idle
> > > routine you should be able to unregister it too.
> > >
> >
> > Yes, we can register and unregister in a clean way now.
> > Consider this. We have a set of routines A, B, C currently registered.
> > Now a module comes and registers D and E, and later on at some point
> > of time wants to unregister. So how do you keep track of what all idle
> > routines the module registered and unregister only those?
> > Best way to do that is a stack, which is how I have currently
> > implemented.
>
> Right, so destroy that inner set thing, that way we only have one
> left ;-)
>

I'm not convinced with your argument. Why dont we do this
incrementally. Right now, this set of sets mechanism works fine and
doesn't look like it has any obvious flaws in it. We have a clean
register/unregister mechanism which solves all the earlier problems we
started out to solve.

We can gradually build on this and try to come up with a single set
of idle states for a governor to choose from.

thanks,
arun

2009-10-12 10:07:56

by Balbir Singh

[permalink] [raw]
Subject: Re: [v8 PATCH 0/8]: cpuidle: Cleanup cpuidle/ Introduce cpuidle to POWER

* Arun R B <[email protected]> [2009-10-08 15:18:28]:

> Hi
>
> Please consider this for inclusion into the testing tree.
>
> This patchset introduces cpuidle infrastructure to POWER, prototyping
> for pSeries, and also does a major refactoring of current x86 idle
> power management and a cleanup of cpuidle infrastructure.
>
> Earlier discussions on the same can be found at:
>
> v7 --> http://lkml.org/lkml/2009/10/6/278
> v6 --> http://lkml.org/lkml/2009/9/22/180
> v5 --> http://lkml.org/lkml/2009/9/22/26
> v4 --> http://lkml.org/lkml/2009/9/1/133
> v3 --> http://lkml.org/lkml/2009/8/27/124
> v2 --> http://lkml.org/lkml/2009/8/26/233
> v1 --> http://lkml.org/lkml/2009/8/19/150
>
>

I looked at the changes and they are beginning to look very good. Some
minor comments about comments in the individual patches.

--
Balbir

2009-10-12 11:36:49

by Balbir Singh

[permalink] [raw]
Subject: Re: [v8 PATCH 1/8]: cpuidle: cleanup drivers/cpuidle/cpuidle.c

* Arun R B <[email protected]> [2009-10-08 15:19:42]:

> * Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:
>
> This patch cleans up drivers/cpuidle/cpuidle.c
> Earlier cpuidle assumed pm_idle as the default idle loop. Break that
> assumption and make it more generic. cpuidle_idle_call() which is the
> main idle loop of cpuidle is to be called by architectures which have
> registered to cpuidle.
>
> Remove routines cpuidle_install/uninstall_idle_handler() which are not
> needed anymore.
>
>

[snip]

/**
> - * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
> - */
> -void cpuidle_install_idle_handler(void)
> -{
> - if (enabled_devices && (pm_idle != cpuidle_idle_call)) {
> - /* Make sure all changes finished before we switch to new idle */
> - smp_wmb();
> - pm_idle = cpuidle_idle_call;
> - }
> -}
> -
> -/**
> - * cpuidle_uninstall_idle_handler - uninstalls the cpuidle idle loop handler
> - */
> -void cpuidle_uninstall_idle_handler(void)
> -{
> - if (enabled_devices && pm_idle_old && (pm_idle != pm_idle_old)) {
> - pm_idle = pm_idle_old;
> - cpuidle_kick_cpus();
> - }
> -}
> -

I see the routines above being called in from
cpuidle_pause/resume_and_lock/unlock below and they are entries from
ACPI on ACPI_PROCESSOR_NOTIFY_POWER and from the hotplug path, could
you test them to make sure they are not broken. We also seem to be
missing a cpuidle_kick_cpus() in cpuidle_pause_and_lock()

[snip]

--
Balbir

2009-10-12 18:00:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

Peter Zijlstra <[email protected]> writes:
>
> So does it make sense to have a set of sets?
>
> Why not integrate them all into one set to be ruled by this governor
> thing?

cpuidle is currently optional, that is why the two level hierarchy
is there so that you can still have simple idle selection without it.

% size drivers/cpuidle/*.o
text data bss dec hex filename
5514 1416 44 6974 1b3e drivers/cpuidle/built-in.o

Adding it unconditionally would add ~7k to everyone who wants idle functions.

I think making it unconditional would require putting it on a serious
diet first.

-Andi
--
[email protected] -- Speaking for myself only.

2009-10-14 06:18:11

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

* Andi Kleen <[email protected]> [2009-10-12 20:00:05]:

> Peter Zijlstra <[email protected]> writes:
> >
> > So does it make sense to have a set of sets?
> >
> > Why not integrate them all into one set to be ruled by this governor
> > thing?
>
> cpuidle is currently optional, that is why the two level hierarchy
> is there so that you can still have simple idle selection without it.
>
> % size drivers/cpuidle/*.o
> text data bss dec hex filename
> 5514 1416 44 6974 1b3e drivers/cpuidle/built-in.o
>
> Adding it unconditionally would add ~7k to everyone who wants idle functions.
>
> I think making it unconditional would require putting it on a serious
> diet first.
>

Hi Andi,

Yes, this is a valid point.

How about something like this..
If the arch does not enable CONFIG_CPU_IDLE, the cpuidle_idle_call
which is called from cpu_idle() should call default_idle without
involving the registering cpuidle steps. This should prevent bloating
up of the kernel for archs which dont want to use cpuidle.

--arun
> -Andi
> --
> [email protected] -- Speaking for myself only.

2009-10-14 06:24:56

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: [v8 PATCH 1/8]: cpuidle: cleanup drivers/cpuidle/cpuidle.c

* Balbir Singh <[email protected]> [2009-10-12 17:06:02]:

> * Arun R B <[email protected]> [2009-10-08 15:19:42]:
>
> > * Arun R Bharadwaj <[email protected]> [2009-10-08 15:18:28]:
> >
> > This patch cleans up drivers/cpuidle/cpuidle.c
> > Earlier cpuidle assumed pm_idle as the default idle loop. Break that
> > assumption and make it more generic. cpuidle_idle_call() which is the
> > main idle loop of cpuidle is to be called by architectures which have
> > registered to cpuidle.
> >
> > Remove routines cpuidle_install/uninstall_idle_handler() which are not
> > needed anymore.
> >
> >
>
> [snip]
>
> /**
> > - * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
> > - */
> > -void cpuidle_install_idle_handler(void)
> > -{
> > - if (enabled_devices && (pm_idle != cpuidle_idle_call)) {
> > - /* Make sure all changes finished before we switch to new idle */
> > - smp_wmb();
> > - pm_idle = cpuidle_idle_call;
> > - }
> > -}
> > -
> > -/**
> > - * cpuidle_uninstall_idle_handler - uninstalls the cpuidle idle loop handler
> > - */
> > -void cpuidle_uninstall_idle_handler(void)
> > -{
> > - if (enabled_devices && pm_idle_old && (pm_idle != pm_idle_old)) {
> > - pm_idle = pm_idle_old;
> > - cpuidle_kick_cpus();
> > - }
> > -}
> > -
>
> I see the routines above being called in from
> cpuidle_pause/resume_and_lock/unlock below and they are entries from
> ACPI on ACPI_PROCESSOR_NOTIFY_POWER and from the hotplug path, could
> you test them to make sure they are not broken. We also seem to be
> missing a cpuidle_kick_cpus() in cpuidle_pause_and_lock()
>
> [snip]
>

Hi Balbir,

yes, we definitely need a cpuidle_kick_cpus() in
cpuidle_pause_and_lock() since this is used while disabling the
cpuidle_device and the cpus need to be kicked out of the idle states.
I will test this modified code and see if it breaks hotplug.

thanks,
arun

> --
> Balbir

2009-10-14 07:19:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

> How about something like this..
> If the arch does not enable CONFIG_CPU_IDLE, the cpuidle_idle_call
> which is called from cpu_idle() should call default_idle without
> involving the registering cpuidle steps. This should prevent bloating
> up of the kernel for archs which dont want to use cpuidle.

On x86 some people want small kernel too, so selecting it on a architecture
granuality is not good. Also you can make it default, you just need
to slim it down first.

-Andi

--
[email protected] -- Speaking for myself only.

2009-10-15 06:07:37

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: [v8 PATCH 2/8]: cpuidle: implement a list based approach to register a set of idle routines.

* Andi Kleen <[email protected]> [2009-10-14 09:18:38]:

> > How about something like this..
> > If the arch does not enable CONFIG_CPU_IDLE, the cpuidle_idle_call
> > which is called from cpu_idle() should call default_idle without
> > involving the registering cpuidle steps. This should prevent bloating
> > up of the kernel for archs which dont want to use cpuidle.
>
> On x86 some people want small kernel too, so selecting it on a architecture
> granuality is not good. Also you can make it default, you just need
> to slim it down first.
>

No, I dont mean selecting it on an architecture granularity.

At compile time, if CONFIG_CPU_IDLE is disabled, the arch can redefine
cpuidle_idle_call. For e.g. in arch/x86/kernel/process.c

#ifndef CONFIG_CPU_IDLE
void cpuidle_idle_call(void)
{
if (local_idle)
local_idle();
else
default_idle();
}
#endif

where local_idle points to the idle routine selected using
select_idle_routine() which can be poll, mwait, c1e.

So this way, we still preserve the exact same functionality as before
and we also remove the ugly pm_idle exported function pointer and we
avoid unnecessary code bloat for platforms who do not want to use
cpuidle.

--arun