2011-03-22 12:32:25

by Trinabh Gupta

[permalink] [raw]
Subject: [RFC PATCH V4 0/5] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel

Changes in V4:
* Implemented cpuidle driver for xen. Earlier pm_idle
function pointer would be set to default idle. Now a cpuidle_driver
encapsulating the idle routine is cleanly registered for this
using cpuidle_register_driver API.

* Implemented a cpuidle driver for apm_cpu_idle() as part of
APM BIOS driver. Earlier APM BIOS driver would flip
pm_idle function pointer to apm_cpu_idle.

* This patch series applies on 2.6.38, and was tested on x86 system
with multiple sleep states.

Goal:
This patch series tries to achieve the goal of having cpuidle manage
all idle routine for x86. It removes pm_idle function pointer which
causes problems discussed at http://lkml.org/lkml/2009/8/28/43 and
http://lkml.org/lkml/2009/8/28/50.

V1 of this series is at https://lkml.org/lkml/2010/10/19/449,
V2 is at https://lkml.org/lkml/2011/1/13/98 and
V3 is at https://lkml.org/lkml/2011/2/8/73

---

Trinabh Gupta (5):
cpuidle: cpuidle driver for apm
cpuidle: driver for xen
cpuidle: default idle driver for x86
cpuidle: list based cpuidle driver registration and selection
cpuidle: Remove pm_idle pointer for x86


arch/x86/Kconfig | 2
arch/x86/kernel/apm_32.c | 75 ++++++-
arch/x86/kernel/process.c | 339 --------------------------------
arch/x86/kernel/process_32.c | 4
arch/x86/kernel/process_64.c | 4
arch/x86/xen/setup.c | 42 ++++
drivers/acpi/processor_idle.c | 2
drivers/cpuidle/Kconfig | 9 +
drivers/cpuidle/cpuidle.c | 50 ++++-
drivers/cpuidle/driver.c | 114 ++++++++++-
drivers/idle/Makefile | 1
drivers/idle/default_driver.c | 437 +++++++++++++++++++++++++++++++++++++++++
include/linux/cpuidle.h | 3
13 files changed, 707 insertions(+), 375 deletions(-)
create mode 100644 drivers/idle/default_driver.c

--
-Trinabh


2011-03-22 12:32:44

by Trinabh Gupta

[permalink] [raw]
Subject: [RFC PATCH V4 1/5] cpuidle: Remove pm_idle pointer for x86

This patch removes pm_idle function pointer and directly calls
cpuidle_idle_call from the idle loop on x86. Hence, CPUIdle has
to be built into the kernel for x86 and optional for other
archs.

Archs that still use pm_idle can continue to set
pm_idle=cpuidle_idle_call() and co-exist. The
cpuidle_(un)install_idle_handler() is defined per arch in
cpuidle.c and would set pm_idle pointer on non-x86 architectures
until they are incrementally converted to directly call
cpuidle_idle_call() and not use the pm_idle pointer.

Signed-off-by: Trinabh Gupta <[email protected]>
---

arch/x86/kernel/process_32.c | 4 +++-
arch/x86/kernel/process_64.c | 4 +++-
drivers/cpuidle/Kconfig | 3 ++-
drivers/cpuidle/cpuidle.c | 2 +-
4 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 8d12878..17b7101 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -74,6 +74,8 @@ static inline void play_dead(void)
}
#endif

+extern void cpuidle_idle_call(void);
+
/*
* The idle thread. There's no useful work to be
* done, so just try to conserve power and have a
@@ -109,7 +111,7 @@ void cpu_idle(void)
local_irq_disable();
/* Don't trace irqs off for idle */
stop_critical_timings();
- pm_idle();
+ cpuidle_idle_call();
start_critical_timings();
}
tick_nohz_restart_sched_tick();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index bd387e8..c736bf3 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -99,6 +99,8 @@ static inline void play_dead(void)
}
#endif

+extern void cpuidle_idle_call(void);
+
/*
* The idle thread. There's no useful work to be
* done, so just try to conserve power and have a
@@ -136,7 +138,7 @@ void cpu_idle(void)
enter_idle();
/* Don't trace irqs off for idle */
stop_critical_timings();
- pm_idle();
+ cpuidle_idle_call();
start_critical_timings();

/* In many cases the interrupt that ended idle
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 7dbc4a8..e67c258 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -1,7 +1,8 @@

config CPU_IDLE
bool "CPU idle PM support"
- default ACPI
+ default y if X86
+ default y if ACPI
help
CPU idle is a generic framework for supporting software-controlled
idle processor power management. It includes modular cross-platform
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index bf50924..8baaa04 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -47,7 +47,7 @@ static int __cpuidle_register_device(struct cpuidle_device *dev);
*
* NOTE: no locks or semaphores should be used here
*/
-static void cpuidle_idle_call(void)
+void cpuidle_idle_call(void)
{
struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
struct cpuidle_state *target_state;

2011-03-22 12:32:51

by Trinabh Gupta

[permalink] [raw]
Subject: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection

A cpuidle_driver structure represents a cpuidle driver like
acpi_idle, intel_idle providing low level idle routines.
A cpuidle_driver is global in nature as it provides routines
for all the CPUS. Each CPU registered with the cpuidle subsystem is
represented as a cpuidle_device. A cpuidle_device structure
points to the low level idle routines for that CPU provided by
a certain driver. In other words, a cpuidle driver creates a
cpuidle_device structure for each CPU that it registers with the
cpuidle subsystem. Whenever cpuidle idle loop is called, the cpuidle
subsystem picks the cpuidle_device structure for that cpu and
calls one of the low level idle routines through that structure.

In the current design, only one cpuidle_driver may be registered
and registration of any subsequent driver fails. The same registered
driver provides low level idle routines for each cpuidle_device.

A list based registration for cpuidle_driver provides a clean
mechanism for multiple subsystems/modules to register their own idle
routines and thus avoids using pm_idle.

This patch implements a list based registration for cpuidle
driver. Different drivers can be registered and the driver to
be used is selected based on a per driver priority. On a
cpuidle driver registration or unregistration cpuidle_device
structure for each CPU is changed. Each cpuidle_device points
to its driver to facilitate this registration and unregistration.

--------- ---------
|cpuidle| |cpuidle|
|driver |----------------------- |driver |
|default| |acpi |
--------- ---------
^ ^
| |
---------------------------- ---------------------------
^ ^ ^ ^
| | | |
| | | |
--------- --------- --------- ----------
|cpuidle| |cpuidle| |cpuidle| |cpuidle |
|device | |device | .... |device | |device | .....
| CPU0 | | CPU1 | |CPU0 | |CPU1 |
--------- --------- --------- ----------
Signed-off-by: Trinabh Gupta <[email protected]>
---

drivers/acpi/processor_idle.c | 2 +
drivers/cpuidle/Kconfig | 6 ++
drivers/cpuidle/cpuidle.c | 46 ++++++++++++++---
drivers/cpuidle/driver.c | 114 ++++++++++++++++++++++++++++++++++++++---
include/linux/cpuidle.h | 3 +
5 files changed, 157 insertions(+), 14 deletions(-)

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index d615b7d..65dde00 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -971,6 +971,7 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
struct cpuidle_driver acpi_idle_driver = {
.name = "acpi_idle",
.owner = THIS_MODULE,
+ .priority = 10,
};

/**
@@ -992,6 +993,7 @@ static int acpi_processor_setup_cpuidle(struct acpi_processor *pr)
}

dev->cpu = pr->id;
+ dev->drv = &acpi_idle_driver;
for (i = 0; i < CPUIDLE_STATE_MAX; i++) {
dev->states[i].name[0] = '\0';
dev->states[i].desc[0] = '\0';
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index e67c258..8cc73c8 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -19,3 +19,9 @@ config CPU_IDLE_GOV_MENU
bool
depends on CPU_IDLE && NO_HZ
default y
+
+config ARCH_USES_PMIDLE
+ bool
+ depends on CPUIDLE
+ depends on !X86
+ default y
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 8baaa04..91ef1bf 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -116,6 +116,7 @@ void cpuidle_idle_call(void)
cpuidle_curr_governor->reflect(dev);
}

+#ifdef CONFIG_ARCH_USES_PMIDLE
/**
* cpuidle_install_idle_handler - installs the cpuidle idle loop handler
*/
@@ -138,6 +139,10 @@ void cpuidle_uninstall_idle_handler(void)
cpuidle_kick_cpus();
}
}
+#else
+void cpuidle_install_idle_handler(void) {}
+void cpuidle_uninstall_idle_handler(void) {}
+#endif

/**
* cpuidle_pause_and_lock - temporarily disables CPUIDLE
@@ -293,9 +298,21 @@ static int __cpuidle_register_device(struct cpuidle_device *dev)
struct sys_device *sys_dev = get_cpu_sysdev((unsigned long)dev->cpu);
struct cpuidle_driver *cpuidle_driver = cpuidle_get_driver();

+ if (dev->registered)
+ return 0;
if (!sys_dev)
return -EINVAL;
- if (!try_module_get(cpuidle_driver->owner))
+ /*
+ * This will maintain compatibility with old cpuidle_device
+ * structure where driver pointer is not set.
+ *
+ * To Do: Change all call sites to set dev->drv pointer.
+ * This can be changed to BUG() later in case somebody
+ * registers without a driver pointer.
+ */
+ if (!dev->drv)
+ dev->drv = cpuidle_driver;
+ if (!try_module_get(dev->drv->owner))
return -EINVAL;

init_completion(&dev->kobj_unregister);
@@ -320,10 +337,11 @@ static int __cpuidle_register_device(struct cpuidle_device *dev)
dev->states[i].power_usage = -1 - i;
}

- per_cpu(cpuidle_devices, dev->cpu) = dev;
+ if (cpuidle_driver == dev->drv)
+ per_cpu(cpuidle_devices, dev->cpu) = dev;
list_add(&dev->device_list, &cpuidle_detected_devices);
if ((ret = cpuidle_add_sysfs(sys_dev))) {
- module_put(cpuidle_driver->owner);
+ module_put(dev->drv->owner);
return ret;
}

@@ -374,13 +392,18 @@ void cpuidle_unregister_device(struct cpuidle_device *dev)
cpuidle_disable_device(dev);

cpuidle_remove_sysfs(sys_dev);
- list_del(&dev->device_list);
wait_for_completion(&dev->kobj_unregister);
- per_cpu(cpuidle_devices, dev->cpu) = NULL;
+ if (cpuidle_driver == dev->drv)
+ per_cpu(cpuidle_devices, dev->cpu) = NULL;

+ /*
+ * To Do: Ensure each CPU has exited the current
+ * idle routine which may belong to this module/driver.
+ * cpuidle_kick_cpus() equivalent is required.
+ */
cpuidle_resume_and_unlock();

- module_put(cpuidle_driver->owner);
+ module_put(dev->drv->owner);
}

EXPORT_SYMBOL_GPL(cpuidle_unregister_device);
@@ -420,6 +443,15 @@ static inline void latency_notifier_init(struct notifier_block *n)

#endif /* CONFIG_SMP */

+#ifdef CONFIG_ARCH_USES_PMIDLE
+static inline void save_pm_idle(void)
+{
+ pm_idle_old = pm_idle;
+}
+#else
+static inline void save_pm_idle(void) {}
+#endif
+
/**
* cpuidle_init - core initializer
*/
@@ -427,7 +459,7 @@ static int __init cpuidle_init(void)
{
int ret;

- pm_idle_old = pm_idle;
+ save_pm_idle();

ret = cpuidle_add_class_sysfs(&cpu_sysdev_class);
if (ret)
diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
index fd1601e..c235286 100644
--- a/drivers/cpuidle/driver.c
+++ b/drivers/cpuidle/driver.c
@@ -14,8 +14,83 @@

#include "cpuidle.h"

+#define MAX_PRIORITY 1000
+#define DEFAULT_PRIORITY 100
+
static struct cpuidle_driver *cpuidle_curr_driver;
DEFINE_SPINLOCK(cpuidle_driver_lock);
+LIST_HEAD(registered_cpuidle_drivers);
+
+static struct cpuidle_driver *select_cpuidle_driver(void)
+{
+ struct cpuidle_driver *item = NULL, *selected = NULL;
+ unsigned int min_priority = MAX_PRIORITY;
+
+ list_for_each_entry(item, &registered_cpuidle_drivers,
+ driver_list) {
+ if (item->priority <= min_priority) {
+ selected = item;
+ min_priority = item->priority;
+ }
+ }
+ return selected;
+}
+
+static void set_current_cpuidle_driver(struct cpuidle_driver *drv)
+{
+ struct cpuidle_device *item = NULL;
+ if (drv == cpuidle_curr_driver)
+ return;
+
+ /* Unregister the previous drivers devices */
+ /* Do we need to take cpuidle_lock ? {un}register_device takes lock*/
+ if (cpuidle_curr_driver) {
+ list_for_each_entry(item, &cpuidle_detected_devices,
+ device_list) {
+ if (item->drv == cpuidle_curr_driver)
+ cpuidle_unregister_device(item);
+ }
+ }
+
+ cpuidle_curr_driver = drv;
+
+ if (drv == NULL)
+ return;
+ else {
+ /* Register the new driver devices */
+ list_for_each_entry(item, &cpuidle_detected_devices,
+ device_list) {
+ if (item->drv == drv)
+ cpuidle_register_device(item);
+ }
+ }
+}
+
+static inline int arch_allow_register(void)
+{
+#ifdef CONFIG_ARCH_USES_PMIDLE
+ if (cpuidle_curr_driver)
+ return -EPERM;
+ else
+ return 0;
+#else
+ return 0;
+#endif
+}
+
+static inline int arch_allow_unregister(struct cpuidle_driver *drv)
+{
+#ifdef CONFIG_ARCH_USES_PMIDLE
+ if (drv != cpuidle_curr_driver) {
+ WARN(1, "invalid cpuidle_unregister_driver(%s)\n",
+ drv->name);
+ return -EPERM;
+ } else
+ return 0;
+#else
+ return 0;
+#endif
+}

/**
* cpuidle_register_driver - registers a driver
@@ -23,15 +98,29 @@ DEFINE_SPINLOCK(cpuidle_driver_lock);
*/
int cpuidle_register_driver(struct cpuidle_driver *drv)
{
+ struct cpuidle_driver *item = NULL;
if (!drv)
return -EINVAL;
+ if (drv->priority == 0)
+ drv->priority = DEFAULT_PRIORITY;

spin_lock(&cpuidle_driver_lock);
- if (cpuidle_curr_driver) {
+
+ if (arch_allow_register()) {
spin_unlock(&cpuidle_driver_lock);
return -EBUSY;
}
- cpuidle_curr_driver = drv;
+
+ /* Check if driver already registered */
+ list_for_each_entry(item, &registered_cpuidle_drivers, driver_list) {
+ if (item == drv) {
+ spin_unlock(&cpuidle_driver_lock);
+ return -EINVAL;
+ }
+ }
+
+ list_add(&drv->driver_list, &registered_cpuidle_drivers);
+ set_current_cpuidle_driver(select_cpuidle_driver());
spin_unlock(&cpuidle_driver_lock);

return 0;
@@ -54,14 +143,25 @@ EXPORT_SYMBOL_GPL(cpuidle_get_driver);
*/
void cpuidle_unregister_driver(struct cpuidle_driver *drv)
{
- if (drv != cpuidle_curr_driver) {
- WARN(1, "invalid cpuidle_unregister_driver(%s)\n",
- drv->name);
+ struct cpuidle_device *item = NULL;
+
+ if (arch_allow_unregister(drv))
return;
- }

spin_lock(&cpuidle_driver_lock);
- cpuidle_curr_driver = NULL;
+
+ /* Set some other driver as current */
+ list_del(&drv->driver_list);
+ set_current_cpuidle_driver(select_cpuidle_driver());
+
+ /* Delete all devices corresponding to this driver */
+ mutex_lock(&cpuidle_lock);
+ list_for_each_entry(item, &cpuidle_detected_devices, device_list) {
+ if (item->drv == drv)
+ list_del(&item->device_list);
+ }
+ mutex_unlock(&cpuidle_lock);
+
spin_unlock(&cpuidle_driver_lock);
}

diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 36719ea..a5f5fd0 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -90,6 +90,7 @@ struct cpuidle_device {
struct cpuidle_state *last_state;

struct list_head device_list;
+ struct cpuidle_driver *drv;
struct kobject kobj;
struct completion kobj_unregister;
void *governor_data;
@@ -119,6 +120,8 @@ static inline int cpuidle_get_last_residency(struct cpuidle_device *dev)
struct cpuidle_driver {
char name[CPUIDLE_NAME_LEN];
struct module *owner;
+ unsigned int priority;
+ struct list_head driver_list;
};

#ifdef CONFIG_CPU_IDLE

2011-03-22 12:33:28

by Trinabh Gupta

[permalink] [raw]
Subject: [RFC PATCH V4 3/5] cpuidle: default idle driver for x86

This default cpuidle_driver parses idle= boot parameters, selects
the optimal idle routine for x86 during bootup and registers with
cpuidle. The code for idle routines and the selection of optimal
routine is moved from arch/x86/kernel/process.c . At module_init this
default driver is registered with cpuidle and for non ACPI platforms
it continues to be used. For ACPI platforms, acpi_idle driver would
replace this driver at a later point in time during bootup. Until
this driver's registration, architecture supplied compile time
default idle routine is called from within cpuidle_idle_call().

Signed-off-by: Trinabh Gupta <[email protected]>
---

arch/x86/Kconfig | 2
arch/x86/kernel/process.c | 339 --------------------------------
drivers/cpuidle/cpuidle.c | 2
drivers/idle/Makefile | 1
drivers/idle/default_driver.c | 437 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 440 insertions(+), 341 deletions(-)
create mode 100644 drivers/idle/default_driver.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d5ed94d..6c03c92 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -148,7 +148,7 @@ config RWSEM_XCHGADD_ALGORITHM
def_bool X86_XADD

config ARCH_HAS_CPU_IDLE_WAIT
- def_bool y
+ def_bool n

config GENERIC_CALIBRATE_DELAY
def_bool y
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index ff45541..27950dc 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -7,7 +7,6 @@
#include <linux/sched.h>
#include <linux/module.h>
#include <linux/pm.h>
-#include <linux/clockchips.h>
#include <linux/random.h>
#include <linux/user-return-notifier.h>
#include <linux/dmi.h>
@@ -330,344 +329,6 @@ long sys_execve(const char __user *name,
return error;
}

-/*
- * Idle related variables and functions
- */
-unsigned long boot_option_idle_override = IDLE_NO_OVERRIDE;
-EXPORT_SYMBOL(boot_option_idle_override);
-
-/*
- * Powermanagement idle function, if any..
- */
-void (*pm_idle)(void);
-EXPORT_SYMBOL(pm_idle);
-
-#ifdef CONFIG_X86_32
-/*
- * This halt magic was a workaround for ancient floppy DMA
- * wreckage. It should be safe to remove.
- */
-static int hlt_counter;
-void disable_hlt(void)
-{
- hlt_counter++;
-}
-EXPORT_SYMBOL(disable_hlt);
-
-void enable_hlt(void)
-{
- hlt_counter--;
-}
-EXPORT_SYMBOL(enable_hlt);
-
-static inline int hlt_use_halt(void)
-{
- return (!hlt_counter && boot_cpu_data.hlt_works_ok);
-}
-#else
-static inline int hlt_use_halt(void)
-{
- return 1;
-}
-#endif
-
-/*
- * We use this if we don't have any better
- * idle routine..
- */
-void default_idle(void)
-{
- if (hlt_use_halt()) {
- trace_power_start(POWER_CSTATE, 1, smp_processor_id());
- trace_cpu_idle(1, smp_processor_id());
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we
- * test NEED_RESCHED:
- */
- smp_mb();
-
- if (!need_resched())
- safe_halt(); /* enables interrupts racelessly */
- else
- local_irq_enable();
- current_thread_info()->status |= TS_POLLING;
- trace_power_end(smp_processor_id());
- trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
- } else {
- local_irq_enable();
- /* loop is done by the caller */
- cpu_relax();
- }
-}
-#ifdef CONFIG_APM_MODULE
-EXPORT_SYMBOL(default_idle);
-#endif
-
-void stop_this_cpu(void *dummy)
-{
- local_irq_disable();
- /*
- * Remove this CPU:
- */
- set_cpu_online(smp_processor_id(), false);
- disable_local_APIC();
-
- for (;;) {
- if (hlt_works(smp_processor_id()))
- halt();
- }
-}
-
-static void do_nothing(void *unused)
-{
-}
-
-/*
- * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
- * pm_idle and update to new pm_idle value. Required while changing pm_idle
- * handler on SMP systems.
- *
- * Caller must have changed pm_idle to the new value before the call. Old
- * pm_idle value will not be used by any CPU after the return of this function.
- */
-void cpu_idle_wait(void)
-{
- smp_mb();
- /* kick all the CPUs so that they exit out of pm_idle */
- smp_call_function(do_nothing, NULL, 1);
-}
-EXPORT_SYMBOL_GPL(cpu_idle_wait);
-
-/*
- * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
- * which can obviate IPI to trigger checking of need_resched.
- * We execute MONITOR against need_resched and enter optimized wait state
- * through MWAIT. Whenever someone changes need_resched, we would be woken
- * up from MWAIT (without an IPI).
- *
- * New with Core Duo processors, MWAIT can take some hints based on CPU
- * capability.
- */
-void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
-{
- if (!need_resched()) {
- if (cpu_has(__this_cpu_ptr(&cpu_info), X86_FEATURE_CLFLUSH_MONITOR))
- clflush((void *)&current_thread_info()->flags);
-
- __monitor((void *)&current_thread_info()->flags, 0, 0);
- smp_mb();
- if (!need_resched())
- __mwait(ax, cx);
- }
-}
-
-/* Default MONITOR/MWAIT with no hints, used for default C1 state */
-static void mwait_idle(void)
-{
- if (!need_resched()) {
- trace_power_start(POWER_CSTATE, 1, smp_processor_id());
- trace_cpu_idle(1, smp_processor_id());
- if (cpu_has(__this_cpu_ptr(&cpu_info), X86_FEATURE_CLFLUSH_MONITOR))
- clflush((void *)&current_thread_info()->flags);
-
- __monitor((void *)&current_thread_info()->flags, 0, 0);
- smp_mb();
- if (!need_resched())
- __sti_mwait(0, 0);
- else
- local_irq_enable();
- trace_power_end(smp_processor_id());
- trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
- } else
- local_irq_enable();
-}
-
-/*
- * On SMP it's slightly faster (but much more power-consuming!)
- * to poll the ->work.need_resched flag instead of waiting for the
- * cross-CPU IPI to arrive. Use this option with caution.
- */
-static void poll_idle(void)
-{
- trace_power_start(POWER_CSTATE, 0, smp_processor_id());
- trace_cpu_idle(0, smp_processor_id());
- local_irq_enable();
- while (!need_resched())
- cpu_relax();
- trace_power_end(smp_processor_id());
- trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
-}
-
-/*
- * mwait selection logic:
- *
- * It depends on the CPU. For AMD CPUs that support MWAIT this is
- * wrong. Family 0x10 and 0x11 CPUs will enter C1 on HLT. Powersavings
- * then depend on a clock divisor and current Pstate of the core. If
- * all cores of a processor are in halt state (C1) the processor can
- * enter the C1E (C1 enhanced) state. If mwait is used this will never
- * happen.
- *
- * idle=mwait overrides this decision and forces the usage of mwait.
- */
-
-#define MWAIT_INFO 0x05
-#define MWAIT_ECX_EXTENDED_INFO 0x01
-#define MWAIT_EDX_C1 0xf0
-
-int mwait_usable(const struct cpuinfo_x86 *c)
-{
- u32 eax, ebx, ecx, edx;
-
- if (boot_option_idle_override == IDLE_FORCE_MWAIT)
- return 1;
-
- if (c->cpuid_level < MWAIT_INFO)
- return 0;
-
- cpuid(MWAIT_INFO, &eax, &ebx, &ecx, &edx);
- /* Check, whether EDX has extended info about MWAIT */
- if (!(ecx & MWAIT_ECX_EXTENDED_INFO))
- return 1;
-
- /*
- * edx enumeratios MONITOR/MWAIT extensions. Check, whether
- * C1 supports MWAIT
- */
- return (edx & MWAIT_EDX_C1);
-}
-
-bool c1e_detected;
-EXPORT_SYMBOL(c1e_detected);
-
-static cpumask_var_t c1e_mask;
-
-void c1e_remove_cpu(int cpu)
-{
- if (c1e_mask != NULL)
- cpumask_clear_cpu(cpu, c1e_mask);
-}
-
-/*
- * C1E aware idle routine. We check for C1E active in the interrupt
- * pending message MSR. If we detect C1E, then we handle it the same
- * way as C3 power states (local apic timer and TSC stop)
- */
-static void c1e_idle(void)
-{
- if (need_resched())
- return;
-
- if (!c1e_detected) {
- u32 lo, hi;
-
- rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
-
- if (lo & K8_INTP_C1E_ACTIVE_MASK) {
- c1e_detected = true;
- if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
- mark_tsc_unstable("TSC halt in AMD C1E");
- printk(KERN_INFO "System has AMD C1E enabled\n");
- }
- }
-
- if (c1e_detected) {
- int cpu = smp_processor_id();
-
- if (!cpumask_test_cpu(cpu, c1e_mask)) {
- cpumask_set_cpu(cpu, c1e_mask);
- /*
- * Force broadcast so ACPI can not interfere.
- */
- clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_FORCE,
- &cpu);
- printk(KERN_INFO "Switch to broadcast mode on CPU%d\n",
- cpu);
- }
- clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
-
- default_idle();
-
- /*
- * The switch back from broadcast mode needs to be
- * called with interrupts disabled.
- */
- local_irq_disable();
- clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
- local_irq_enable();
- } else
- default_idle();
-}
-
-void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c)
-{
-#ifdef CONFIG_SMP
- if (pm_idle == poll_idle && smp_num_siblings > 1) {
- printk_once(KERN_WARNING "WARNING: polling idle and HT enabled,"
- " performance may degrade.\n");
- }
-#endif
- if (pm_idle)
- return;
-
- if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) {
- /*
- * One CPU supports mwait => All CPUs supports mwait
- */
- printk(KERN_INFO "using mwait in idle threads.\n");
- pm_idle = mwait_idle;
- } else if (cpu_has_amd_erratum(amd_erratum_400)) {
- /* E400: APIC timer interrupt does not wake up CPU from C1e */
- printk(KERN_INFO "using C1E aware idle routine\n");
- pm_idle = c1e_idle;
- } else
- pm_idle = default_idle;
-}
-
-void __init init_c1e_mask(void)
-{
- /* If we're using c1e_idle, we need to allocate c1e_mask. */
- if (pm_idle == c1e_idle)
- zalloc_cpumask_var(&c1e_mask, GFP_KERNEL);
-}
-
-static int __init idle_setup(char *str)
-{
- if (!str)
- return -EINVAL;
-
- if (!strcmp(str, "poll")) {
- printk("using polling idle threads.\n");
- pm_idle = poll_idle;
- boot_option_idle_override = IDLE_POLL;
- } else if (!strcmp(str, "mwait")) {
- boot_option_idle_override = IDLE_FORCE_MWAIT;
- } else if (!strcmp(str, "halt")) {
- /*
- * When the boot option of idle=halt is added, halt is
- * forced to be used for CPU idle. In such case CPU C2/C3
- * won't be used again.
- * To continue to load the CPU idle driver, don't touch
- * the boot_option_idle_override.
- */
- pm_idle = default_idle;
- boot_option_idle_override = IDLE_HALT;
- } else if (!strcmp(str, "nomwait")) {
- /*
- * If the boot option of "idle=nomwait" is added,
- * it means that mwait will be disabled for CPU C2/C3
- * states. In such case it won't touch the variable
- * of boot_option_idle_override.
- */
- boot_option_idle_override = IDLE_NOMWAIT;
- } else
- return -1;
-
- return 0;
-}
-early_param("idle", idle_setup);
-
unsigned long arch_align_stack(unsigned long sp)
{
if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 91ef1bf..7486e0f 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -34,7 +34,7 @@ static void cpuidle_kick_cpus(void)
{
cpu_idle_wait();
}
-#elif defined(CONFIG_SMP)
+#elif defined(CONFIG_SMP) && defined(CONFIG_ARCH_USES_PMIDLE)
# error "Arch needs cpu_idle_wait() equivalent here"
#else /* !CONFIG_ARCH_HAS_CPU_IDLE_WAIT && !CONFIG_SMP */
static void cpuidle_kick_cpus(void) {}
diff --git a/drivers/idle/Makefile b/drivers/idle/Makefile
index 23d295c..f16e866 100644
--- a/drivers/idle/Makefile
+++ b/drivers/idle/Makefile
@@ -1,3 +1,4 @@
obj-$(CONFIG_I7300_IDLE) += i7300_idle.o
obj-$(CONFIG_INTEL_IDLE) += intel_idle.o
+obj-$(CONFIG_X86) += default_driver.o

diff --git a/drivers/idle/default_driver.c b/drivers/idle/default_driver.c
new file mode 100644
index 0000000..21de286
--- /dev/null
+++ b/drivers/idle/default_driver.c
@@ -0,0 +1,437 @@
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/sched.h>
+#include <linux/cpuidle.h>
+#include <linux/clockchips.h>
+#include <linux/slab.h>
+#include <trace/events/power.h>
+#include <asm/mwait.h>
+
+
+/*
+ * Idle related variables and functions
+ */
+unsigned long boot_option_idle_override = IDLE_NO_OVERRIDE;
+EXPORT_SYMBOL(boot_option_idle_override);
+
+static struct cpuidle_state *opt_state;
+
+#ifdef CONFIG_X86_32
+/*
+ * This halt magic was a workaround for ancient floppy DMA
+ * wreckage. It should be safe to remove.
+ */
+static int hlt_counter;
+void disable_hlt(void)
+{
+ hlt_counter++;
+}
+EXPORT_SYMBOL(disable_hlt);
+
+void enable_hlt(void)
+{
+ hlt_counter--;
+}
+EXPORT_SYMBOL(enable_hlt);
+
+static inline int hlt_use_halt(void)
+{
+ return (!hlt_counter && boot_cpu_data.hlt_works_ok);
+}
+#else
+static inline int hlt_use_halt(void)
+{
+ return 1;
+}
+#endif
+
+/*
+ * We use this if we don't have any better
+ * idle routine..
+ */
+void default_idle(void)
+{
+ if (hlt_use_halt()) {
+ trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+ trace_cpu_idle(1, smp_processor_id());
+ current_thread_info()->status &= ~TS_POLLING;
+ /*
+ * TS_POLLING-cleared state must be visible before we
+ * test NEED_RESCHED:
+ */
+ smp_mb();
+
+ if (!need_resched())
+ safe_halt(); /* enables interrupts racelessly */
+ else
+ local_irq_enable();
+ current_thread_info()->status |= TS_POLLING;
+ trace_power_end(smp_processor_id());
+ trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+ } else {
+ local_irq_enable();
+ /* loop is done by the caller */
+ cpu_relax();
+ }
+}
+#ifdef CONFIG_APM_MODULE
+EXPORT_SYMBOL(default_idle);
+#endif
+
+void stop_this_cpu(void *dummy)
+{
+ local_irq_disable();
+ /*
+ * Remove this CPU:
+ */
+ set_cpu_online(smp_processor_id(), false);
+ disable_local_APIC();
+
+ for (;;) {
+ if (hlt_works(smp_processor_id()))
+ halt();
+ }
+}
+
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait state
+ * through MWAIT. Whenever someone changes need_resched, we would be woken
+ * up from MWAIT (without an IPI).
+ *
+ * New with Core Duo processors, MWAIT can take some hints based on CPU
+ * capability.
+ */
+void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
+{
+ if (!need_resched()) {
+ if (cpu_has(__this_cpu_ptr(&cpu_info), X86_FEATURE_CLFLUSH_MONITOR))
+ clflush((void *)&current_thread_info()->flags);
+
+ __monitor((void *)&current_thread_info()->flags, 0, 0);
+ smp_mb();
+ if (!need_resched())
+ __mwait(ax, cx);
+ }
+}
+
+/* Default MONITOR/MWAIT with no hints, used for default C1 state */
+static void mwait_idle(void)
+{
+ if (!need_resched()) {
+ trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+ trace_cpu_idle(1, smp_processor_id());
+ if (cpu_has(__this_cpu_ptr(&cpu_info), X86_FEATURE_CLFLUSH_MONITOR))
+ clflush((void *)&current_thread_info()->flags);
+
+ __monitor((void *)&current_thread_info()->flags, 0, 0);
+ smp_mb();
+ if (!need_resched())
+ __sti_mwait(0, 0);
+ else
+ local_irq_enable();
+ trace_power_end(smp_processor_id());
+ trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+ } else
+ local_irq_enable();
+}
+
+/*
+ * On SMP it's slightly faster (but much more power-consuming!)
+ * to poll the ->work.need_resched flag instead of waiting for the
+ * cross-CPU IPI to arrive. Use this option with caution.
+ */
+static void poll_idle(void)
+{
+ trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+ trace_cpu_idle(0, smp_processor_id());
+ local_irq_enable();
+ while (!need_resched())
+ cpu_relax();
+ trace_power_end(smp_processor_id());
+ trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+}
+
+/*
+ * mwait selection logic:
+ *
+ * It depends on the CPU. For AMD CPUs that support MWAIT this is
+ * wrong. Family 0x10 and 0x11 CPUs will enter C1 on HLT. Powersavings
+ * then depend on a clock divisor and current Pstate of the core. If
+ * all cores of a processor are in halt state (C1) the processor can
+ * enter the C1E (C1 enhanced) state. If mwait is used this will never
+ * happen.
+ *
+ * idle=mwait overrides this decision and forces the usage of mwait.
+ */
+
+#define MWAIT_INFO 0x05
+#define MWAIT_ECX_EXTENDED_INFO 0x01
+#define MWAIT_EDX_C1 0xf0
+
+int mwait_usable(const struct cpuinfo_x86 *c)
+{
+ u32 eax, ebx, ecx, edx;
+
+ if (boot_option_idle_override == IDLE_FORCE_MWAIT)
+ return 1;
+
+ if (c->cpuid_level < MWAIT_INFO)
+ return 0;
+
+ cpuid(MWAIT_INFO, &eax, &ebx, &ecx, &edx);
+ /* Check, whether EDX has extended info about MWAIT */
+ if (!(ecx & MWAIT_ECX_EXTENDED_INFO))
+ return 1;
+
+ /*
+ * edx enumeratios MONITOR/MWAIT extensions. Check, whether
+ * C1 supports MWAIT
+ */
+ return (edx & MWAIT_EDX_C1);
+}
+
+bool c1e_detected;
+EXPORT_SYMBOL(c1e_detected);
+
+static cpumask_var_t c1e_mask;
+
+void c1e_remove_cpu(int cpu)
+{
+ if (c1e_mask != NULL)
+ cpumask_clear_cpu(cpu, c1e_mask);
+}
+
+/*
+ * C1E aware idle routine. We check for C1E active in the interrupt
+ * pending message MSR. If we detect C1E, then we handle it the same
+ * way as C3 power states (local apic timer and TSC stop)
+ */
+static void c1e_idle(void)
+{
+ if (need_resched())
+ return;
+
+ if (!c1e_detected) {
+ u32 lo, hi;
+
+ rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
+
+ if (lo & K8_INTP_C1E_ACTIVE_MASK) {
+ c1e_detected = true;
+ if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
+ mark_tsc_unstable("TSC halt in AMD C1E");
+ printk(KERN_INFO "System has AMD C1E enabled\n");
+ }
+ }
+
+ if (c1e_detected) {
+ int cpu = smp_processor_id();
+
+ if (!cpumask_test_cpu(cpu, c1e_mask)) {
+ cpumask_set_cpu(cpu, c1e_mask);
+ /*
+ * Force broadcast so ACPI can not interfere.
+ */
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_FORCE,
+ &cpu);
+ printk(KERN_INFO "Switch to broadcast mode on CPU%d\n",
+ cpu);
+ }
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
+
+ default_idle();
+
+ /*
+ * The switch back from broadcast mode needs to be
+ * called with interrupts disabled.
+ */
+ local_irq_disable();
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+ local_irq_enable();
+ } else
+ default_idle();
+}
+
+static int poll_idle_wrapper(struct cpuidle_device *dev,
+ struct cpuidle_state *state)
+{
+ poll_idle();
+ return 0;
+}
+
+static int mwait_idle_wrapper(struct cpuidle_device *dev,
+ struct cpuidle_state *state)
+{
+ mwait_idle();
+ return 0;
+}
+
+static int c1e_idle_wrapper(struct cpuidle_device *dev,
+ struct cpuidle_state *state)
+{
+ c1e_idle();
+ return 0;
+}
+
+int default_idle_wrapper(struct cpuidle_device *dev,
+ struct cpuidle_state *state)
+{
+ default_idle();
+ return 0;
+}
+
+static struct cpuidle_state state_poll = {
+ .name = "POLL",
+ .desc = "POLL",
+ .driver_data = (void *) 0x00,
+ .flags = CPUIDLE_FLAG_TIME_VALID,
+ .exit_latency = 1,
+ .target_residency = 1,
+ .enter = &poll_idle_wrapper,
+};
+
+static struct cpuidle_state state_mwait = {
+ .name = "C1",
+ .desc = "MWAIT No Hints",
+ .driver_data = (void *) 0x01,
+ .flags = CPUIDLE_FLAG_TIME_VALID,
+ .exit_latency = 1,
+ .target_residency = 1,
+ .enter = &mwait_idle_wrapper,
+};
+
+static struct cpuidle_state state_c1e = {
+ .name = "C1E",
+ .desc = "C1E",
+ .driver_data = (void *) 0x02,
+ .flags = CPUIDLE_FLAG_TIME_VALID,
+ .exit_latency = 1,
+ .target_residency = 1,
+ .enter = &c1e_idle_wrapper,
+};
+
+struct cpuidle_state state_default_idle = {
+ .name = "DEFAULT-IDLE",
+ .desc = "Default idle routine",
+ .driver_data = (void *) 0x03,
+ .flags = CPUIDLE_FLAG_TIME_VALID,
+ .exit_latency = 1,
+ .target_residency = 1,
+ .enter = &default_idle_wrapper,
+};
+
+void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_SMP
+ if (opt_state == &state_poll && smp_num_siblings > 1) {
+ printk_once(KERN_WARNING "WARNING: polling idle and HT enabled,"
+ " performance may degrade.\n");
+ }
+#endif
+ if (opt_state)
+ return;
+
+ if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) {
+ /*
+ * One CPU supports mwait => All CPUs supports mwait
+ */
+ printk(KERN_INFO "using mwait in idle threads.\n");
+ opt_state = &state_mwait;
+ } else if (cpu_has_amd_erratum(amd_erratum_400)) {
+ /* E400: APIC timer interrupt does not wake up CPU from C1e */
+ printk(KERN_INFO "using C1E aware idle routine\n");
+ opt_state = &state_c1e;
+ } else
+ opt_state = &state_default_idle;
+}
+
+void __init init_c1e_mask(void)
+{
+ /* If we're using c1e_idle, we need to allocate c1e_mask. */
+ if (opt_state == &state_c1e)
+ zalloc_cpumask_var(&c1e_mask, GFP_KERNEL);
+}
+
+static int __init idle_setup(char *str)
+{
+ if (!str)
+ return -EINVAL;
+
+ if (!strcmp(str, "poll")) {
+ printk("using polling idle threads.\n");
+ opt_state = &state_poll;
+ boot_option_idle_override = IDLE_POLL;
+ } else if (!strcmp(str, "mwait")) {
+ boot_option_idle_override = IDLE_FORCE_MWAIT;
+ } else if (!strcmp(str, "halt")) {
+ /*
+ * When the boot option of idle=halt is added, halt is
+ * forced to be used for CPU idle. In such case CPU C2/C3
+ * won't be used again.
+ * To continue to load the CPU idle driver, don't touch
+ * the boot_option_idle_override.
+ */
+ opt_state = &state_default_idle;
+ boot_option_idle_override = IDLE_HALT;
+ } else if (!strcmp(str, "nomwait")) {
+ /*
+ * If the boot option of "idle=nomwait" is added,
+ * it means that mwait will be disabled for CPU C2/C3
+ * states. In such case it won't touch the variable
+ * of boot_option_idle_override.
+ */
+ boot_option_idle_override = IDLE_NOMWAIT;
+ } else
+ return -1;
+
+ return 0;
+}
+early_param("idle", idle_setup);
+
+static struct cpuidle_driver default_idle_driver = {
+ .name = "default_idle",
+ .owner = THIS_MODULE,
+ .priority = 100,
+};
+
+static int setup_cpuidle(int cpu)
+{
+ struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device),
+ GFP_KERNEL);
+ int count = CPUIDLE_DRIVER_STATE_START;
+ dev->cpu = cpu;
+ dev->drv = &default_idle_driver;
+
+ BUG_ON(opt_state == NULL);
+ dev->states[count] = *opt_state;
+ count++;
+
+ dev->state_count = count;
+
+ if (cpuidle_register_device(dev))
+ return -EIO;
+ return 0;
+}
+
+static int __init default_idle_init(void)
+{
+ int retval, i;
+ retval = cpuidle_register_driver(&default_idle_driver);
+
+ for_each_online_cpu(i) {
+ setup_cpuidle(i);
+ }
+
+ return 0;
+}
+
+
+static void __exit default_idle_exit(void)
+{
+ cpuidle_unregister_driver(&default_idle_driver);
+ return;
+}
+
+device_initcall(default_idle_init);

2011-03-22 12:33:37

by Trinabh Gupta

[permalink] [raw]
Subject: [RFC PATCH V4 4/5] cpuidle: driver for xen

This patch implements a default cpuidle driver for xen.
Earlier pm_idle was flipped to default_idle. Maybe there
is a better way to ensure default_idle is called
without using this cpuidle driver.

Signed-off-by: Trinabh Gupta <[email protected]>
---

arch/x86/xen/setup.c | 42 +++++++++++++++++++++++++++++++++++++++++-
1 files changed, 41 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index a8a66a5..4fce4cd 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -9,6 +9,8 @@
#include <linux/mm.h>
#include <linux/pm.h>
#include <linux/memblock.h>
+#include <linux/cpuidle.h>
+#include <linux/slab.h>

#include <asm/elf.h>
#include <asm/vdso.h>
@@ -321,6 +323,44 @@ void __cpuinit xen_enable_syscall(void)
#endif /* CONFIG_X86_64 */
}

+static struct cpuidle_driver xen_idle_driver = {
+ .name = "xen_idle",
+ .owner = THIS_MODULE,
+ .priority = 1,
+};
+
+extern struct cpuidle_state state_default_state;
+
+static int setup_cpuidle(int cpu)
+{
+ struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device),
+ GFP_KERNEL);
+ int count = CPUIDLE_DRIVER_STATE_START;
+ dev->cpu = cpu;
+ dev->drv = &xen_idle_driver;
+
+ dev->states[count] = state_default_idle;
+ count++;
+
+ dev->state_count = count;
+
+ if (cpuidle_register_device(dev))
+ return -EIO;
+ return 0;
+}
+
+static int xen_idle_init(void)
+{
+ int retval, i;
+ retval = cpuidle_register_driver(&xen_idle_driver);
+
+ for_each_online_cpu(i) {
+ setup_cpuidle(i);
+ }
+
+ return 0;
+}
+
void __init xen_arch_setup(void)
{
xen_panic_handler_init();
@@ -354,7 +394,7 @@ void __init xen_arch_setup(void)
#ifdef CONFIG_X86_32
boot_cpu_data.hlt_works_ok = 1;
#endif
- pm_idle = default_idle;
+ xen_idle_init();
boot_option_idle_override = IDLE_HALT;

fiddle_vdso();

2011-03-22 12:33:49

by Trinabh Gupta

[permalink] [raw]
Subject: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm

Signed-off-by: Trinabh Gupta <[email protected]>
---

arch/x86/kernel/apm_32.c | 75 +++++++++++++++++++++++++++++++++++++---------
1 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/apm_32.c b/arch/x86/kernel/apm_32.c
index 0e4f24c..22d40bf 100644
--- a/arch/x86/kernel/apm_32.c
+++ b/arch/x86/kernel/apm_32.c
@@ -882,8 +882,6 @@ static void apm_do_busy(void)
#define IDLE_CALC_LIMIT (HZ * 100)
#define IDLE_LEAKY_MAX 16

-static void (*original_pm_idle)(void) __read_mostly;
-
/**
* apm_cpu_idle - cpu idling for APM capable Linux
*
@@ -947,10 +945,7 @@ recalc:
break;
}
}
- if (original_pm_idle)
- original_pm_idle();
- else
- default_idle();
+ default_idle();
local_irq_disable();
jiffies_since_last_check = jiffies - last_jiffies;
if (jiffies_since_last_check > idle_period)
@@ -2256,6 +2251,63 @@ static struct dmi_system_id __initdata apm_dmi_table[] = {
{ }
};

+int apm_idle_wrapper(struct cpuidle_device *dev,
+ struct cpuidle_state *state)
+{
+ apm_cpu_idle();
+ return 0;
+}
+
+struct cpuidle_state state_apm_idle = {
+ .name = "APM-IDLE",
+ .desc = "APM idle routine",
+ .exit_latency = 1,
+ .target_residency = 1,
+ .enter = &apm_idle_wrapper,
+};
+
+static struct cpuidle_driver apm_idle_driver = {
+ .name = "apm_idle",
+ .owner = THIS_MODULE,
+ .priority = 1,
+};
+
+static int apm_setup_cpuidle(int cpu)
+{
+ struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device),
+ GFP_KERNEL);
+ int count = CPUIDLE_DRIVER_STATE_START;
+ dev->cpu = cpu;
+ dev->drv = &apm_idle_driver;
+
+ dev->states[count] = state_apm_idle;
+ count++;
+
+ dev->state_count = count;
+
+ if (cpuidle_register_device(dev))
+ return -EIO;
+ return 0;
+}
+
+static int apm_idle_init(void)
+{
+ int retval, i;
+ retval = cpuidle_register_driver(&apm_idle_driver);
+
+ for_each_online_cpu(i) {
+ apm_setup_cpuidle(i);
+ }
+
+ return 0;
+}
+
+static void apm_idle_exit(void)
+{
+ cpuidle_unregister_driver(&apm_idle_driver);
+ return;
+}
+
/*
* Just start the APM thread. We do NOT want to do APM BIOS
* calls from anything but the APM thread, if for no other reason
@@ -2393,8 +2445,7 @@ static int __init apm_init(void)
if (HZ != 100)
idle_period = (idle_period * HZ) / 100;
if (idle_threshold < 100) {
- original_pm_idle = pm_idle;
- pm_idle = apm_cpu_idle;
+ apm_idle_init();
set_pm_idle = 1;
}

@@ -2406,13 +2457,7 @@ static void __exit apm_exit(void)
int error;

if (set_pm_idle) {
- pm_idle = original_pm_idle;
- /*
- * We are about to unload the current idle thread pm callback
- * (pm_idle), Wait for all processors to update cached/local
- * copies of pm_idle before proceeding.
- */
- cpu_idle_wait();
+ apm_idle_exit();
}
if (((apm_info.bios.flags & APM_BIOS_DISENGAGED) == 0)
&& (apm_info.connection_version > 0x0100)) {

2011-03-22 14:53:08

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [RFC PATCH V4 4/5] cpuidle: driver for xen

On Tue, Mar 22, 2011 at 06:03:28PM +0530, Trinabh Gupta wrote:
> This patch implements a default cpuidle driver for xen.
> Earlier pm_idle was flipped to default_idle. Maybe there
> is a better way to ensure default_idle is called
> without using this cpuidle driver.

Please also CC the Xen devel mailing list (I did this for you)

I couldn't find it in the description, but I was wondering
what is that you are trying to fix? What is the problem description?

Two, you mention in your description that it was tested on x86 systems. did
you test this with Xen? If so, what version.

>
> Signed-off-by: Trinabh Gupta <[email protected]>
> ---
>
> arch/x86/xen/setup.c | 42 +++++++++++++++++++++++++++++++++++++++++-
> 1 files changed, 41 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index a8a66a5..4fce4cd 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -9,6 +9,8 @@
> #include <linux/mm.h>
> #include <linux/pm.h>
> #include <linux/memblock.h>
> +#include <linux/cpuidle.h>
> +#include <linux/slab.h>
>
> #include <asm/elf.h>
> #include <asm/vdso.h>
> @@ -321,6 +323,44 @@ void __cpuinit xen_enable_syscall(void)
> #endif /* CONFIG_X86_64 */
> }
>
> +static struct cpuidle_driver xen_idle_driver = {
> + .name = "xen_idle",
> + .owner = THIS_MODULE,
> + .priority = 1,
> +};
> +
> +extern struct cpuidle_state state_default_state;
> +
> +static int setup_cpuidle(int cpu)
> +{
> + struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device),
> + GFP_KERNEL);

No checking to see if dev == NULL?
> + int count = CPUIDLE_DRIVER_STATE_START;
> + dev->cpu = cpu;
> + dev->drv = &xen_idle_driver;
> +
> + dev->states[count] = state_default_idle;
> + count++;
> +
> + dev->state_count = count;
> +
> + if (cpuidle_register_device(dev))
> + return -EIO;
No cleanup of the 'dev' so that we don't leak memory?

> + return 0;
> +}
> +
> +static int xen_idle_init(void)
> +{
> + int retval, i;
> + retval = cpuidle_register_driver(&xen_idle_driver);
> +
> + for_each_online_cpu(i) {
> + setup_cpuidle(i);
> + }
> +
> + return 0;
> +}
> +
> void __init xen_arch_setup(void)
> {
> xen_panic_handler_init();
> @@ -354,7 +394,7 @@ void __init xen_arch_setup(void)
> #ifdef CONFIG_X86_32
> boot_cpu_data.hlt_works_ok = 1;
> #endif
> - pm_idle = default_idle;
> + xen_idle_init();
> boot_option_idle_override = IDLE_HALT;
>
> fiddle_vdso();
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-03-23 01:00:58

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [RFC PATCH V4 1/5] cpuidle: Remove pm_idle pointer for x86

Hi,

Just some simple comments.

Does having this patch first in the series break APM idle?

On Tue, 22 Mar 2011 18:02:27 +0530 Trinabh Gupta <[email protected]> wrote:
>
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 8d12878..17b7101 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -74,6 +74,8 @@ static inline void play_dead(void)
> }
> #endif
>
> +extern void cpuidle_idle_call(void);

Put this declaration in a header file and include that header file here
and in the file that defines that function.

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (696.00 B)
(No filename) (490.00 B)
Download all attachments

2011-03-23 01:15:09

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm

Hi,

On Tue, 22 Mar 2011 18:03:40 +0530 Trinabh Gupta <[email protected]> wrote:
>
> +static int apm_setup_cpuidle(int cpu)
> +{
> + struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device),
> + GFP_KERNEL);

Same as xen comments: no NULL check.

> + int count = CPUIDLE_DRIVER_STATE_START;
> + dev->cpu = cpu;
> + dev->drv = &apm_idle_driver;

Also wondering why you would ever have a different idle routine on
different cpus?

> +
> + dev->states[count] = state_apm_idle;
> + count++;
> +
> + dev->state_count = count;
> +
> + if (cpuidle_register_device(dev))
> + return -EIO;

And we leak dev.

> +static void apm_idle_exit(void)
> +{
> + cpuidle_unregister_driver(&apm_idle_driver);

Do we leak the per cpu device structure here?

> + return;

Unnecessary return statement.

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (907.00 B)
(No filename) (490.00 B)
Download all attachments

2011-03-23 02:59:51

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection

the original cpuidle prototype supported multiple driver registration,
but no production use for it could be imagined, and so it was deleted.

Subsequently on x86, we added intel_idle to replace acpi_idle
and a typical kernel will have them both built in.
We still don't allow mutliple registrations, we just arrange
affairs such that the preferred intel_idle probes before
the backup, acpi_idle. If intel_idle recognizes the platform,
its probe succeeds, else acpi_idle gets a go.
If there is a problem with intel_idle, or a comparison needs to be made,
a bootparam is available to tell intel_idle not to probe.

This mechanism takes approximately 10 lines of code -- the bootparam
to disable the preferred driver.

What is the benefit of all the code to support the feature of run-time
multiple driver registration and switching?

thanks,
Len Brown, Intel Open Source Technology Center

2011-03-23 03:13:53

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH V4 3/5] cpuidle: default idle driver for x86

Why is this patch a step forward?

> +obj-$(CONFIG_X86) += default_driver.o

BTW, that's a pretty generic name for an x86 specific idle driver...

I think that on builds that support intel_idle and acpi_idle,
everything in this file will be unused, unless somebody uses some
debugging cmdline params that should have been deleted ages ago.

thanks,
Len Brown, Intel Open Source Technology Center

2011-03-23 09:22:45

by Trinabh Gupta

[permalink] [raw]
Subject: Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection

Hi Len,

The goal of the patch series is to remove exported pm_idle function
pointer (see http://lkml.org/lkml/2009/8/28/43 and
http://lkml.org/lkml/2009/8/28/50 for problems related to pm_idle).
The first patch in the series removes pm_idle for x86 and we
now directly call cpuidle_idle_call as suggested by Arjan
(https://lkml.org/lkml/2010/10/19/453).

But we also have to replace the functionality provided by pm_idle,
i.e. call default_idle for platforms where no better idle routine
exists, call mwait for pre-nehalem platforms, use intel_idle or
acpi_idle for nehalem architectures etc. To manage all this
we need a registration mechanism which is conveniently provided
by cpuidle.

In theory I agree that we can maybe do without list based
registration i.e probe and pick the best for the platform, but things
may become less predictable and difficult to manage as
we have more and more platforms and drivers.
By directly calling into cpuidle, we already have arch default
other than intel_idle and acpi_idle. Then APM and xen (though
it uses default_idle) also have their own idle routines.
List based management and selection based on priority would provide
a cleaner solution.

Thanks,
-Trinabh

On 03/23/2011 08:29 AM, Len Brown wrote:
> the original cpuidle prototype supported multiple driver registration,
> but no production use for it could be imagined, and so it was deleted.
>
> Subsequently on x86, we added intel_idle to replace acpi_idle
> and a typical kernel will have them both built in.
> We still don't allow mutliple registrations, we just arrange
> affairs such that the preferred intel_idle probes before
> the backup, acpi_idle. If intel_idle recognizes the platform,
> its probe succeeds, else acpi_idle gets a go.
> If there is a problem with intel_idle, or a comparison needs to be made,
> a bootparam is available to tell intel_idle not to probe.
>
> This mechanism takes approximately 10 lines of code -- the bootparam
> to disable the preferred driver.
>
> What is the benefit of all the code to support the feature of run-time
> multiple driver registration and switching?
>
> thanks,
> Len Brown, Intel Open Source Technology Center
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-03-23 09:31:19

by Trinabh Gupta

[permalink] [raw]
Subject: Re: [RFC PATCH V4 3/5] cpuidle: default idle driver for x86


On 03/23/2011 08:43 AM, Len Brown wrote:
> Why is this patch a step forward?

Hi Len,

I have basically moved the code for arch default and mwait
idle from arch/x86/kernel/process.c to a driver. This was
suggested by Venki (https://lkml.org/lkml/2010/10/19/460)
as part of pm_idle cleanup and direct call of
cpuidle_idle_call(). There is not much new code here.

>
>> +obj-$(CONFIG_X86) += default_driver.o
>
> BTW, that's a pretty generic name for an x86 specific idle driver...
>
> I think that on builds that support intel_idle and acpi_idle,
> everything in this file will be unused, unless somebody uses some
> debugging cmdline params that should have been deleted ages ago.

Yes, I agree that the name has to be x86 specific. I think the
routines would be used for pre-nehalem architectures that use
arch default or mwait.

Thanks,
-Trinabh

>
> thanks,
> Len Brown, Intel Open Source Technology Center
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-03-23 09:57:41

by Trinabh Gupta

[permalink] [raw]
Subject: Re: [RFC PATCH V4 4/5] cpuidle: driver for xen



On 03/22/2011 08:20 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, Mar 22, 2011 at 06:03:28PM +0530, Trinabh Gupta wrote:
>> This patch implements a default cpuidle driver for xen.
>> Earlier pm_idle was flipped to default_idle. Maybe there
>> is a better way to ensure default_idle is called
>> without using this cpuidle driver.
>

Hi Konrad,

> Please also CC the Xen devel mailing list (I did this for you)

Yes, I will. Thanks

>
> I couldn't find it in the description, but I was wondering
> what is that you are trying to fix? What is the problem description?

We are trying to remove the exported function pointer pm_idle which
is set to the desired idle routine to be used. For example, xen
sets it to default_idle. Having exported function pointer is
not very secure.

The first problem is that this exported pointer can be modified/flipped
by any subsystem. There is no tracking or notification mechanism.
Secondly and more importantly, various subsystems save the value of
this pointer, flip it and later restore to the saved value. There is
no guarantee that the saved value is still valid (see
http://lkml.org/lkml/2009/8/28/43 and http://lkml.org/lkml/2009/8/28/50)

The first patch of the series removed pm_idle and now we directly
call into CPUIdle subsystem. As a result for all the subsystems
using pm_idle, we have to implement a driver that can be registered
to cpuidle.

>
> Two, you mention in your description that it was tested on x86 systems. did
> you test this with Xen? If so, what version.

The patches are still in RFC stage and haven't been tested with Xen.
Once we are clear on a particular solution, we will carefully
test the patches.

Thanks for the code review. Yes, I have missed a couple of things.
I will look at how to maintain per cpu dev pointers and free the
memory.

Thanks,
-Trinabh

>
>>
>> Signed-off-by: Trinabh Gupta<[email protected]>
>> ---
>>
>> arch/x86/xen/setup.c | 42 +++++++++++++++++++++++++++++++++++++++++-
>> 1 files changed, 41 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
>> index a8a66a5..4fce4cd 100644
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -9,6 +9,8 @@
>> #include<linux/mm.h>
>> #include<linux/pm.h>
>> #include<linux/memblock.h>
>> +#include<linux/cpuidle.h>
>> +#include<linux/slab.h>
>>
>> #include<asm/elf.h>
>> #include<asm/vdso.h>
>> @@ -321,6 +323,44 @@ void __cpuinit xen_enable_syscall(void)
>> #endif /* CONFIG_X86_64 */
>> }
>>
>> +static struct cpuidle_driver xen_idle_driver = {
>> + .name = "xen_idle",
>> + .owner = THIS_MODULE,
>> + .priority = 1,
>> +};
>> +
>> +extern struct cpuidle_state state_default_state;
>> +
>> +static int setup_cpuidle(int cpu)
>> +{
>> + struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device),
>> + GFP_KERNEL);
>
> No checking to see if dev == NULL?
>> + int count = CPUIDLE_DRIVER_STATE_START;
>> + dev->cpu = cpu;
>> + dev->drv =&xen_idle_driver;
>> +
>> + dev->states[count] = state_default_idle;
>> + count++;
>> +
>> + dev->state_count = count;
>> +
>> + if (cpuidle_register_device(dev))
>> + return -EIO;
> No cleanup of the 'dev' so that we don't leak memory?
>
>> + return 0;
>> +}
>> +
>> +static int xen_idle_init(void)
>> +{
>> + int retval, i;
>> + retval = cpuidle_register_driver(&xen_idle_driver);
>> +
>> + for_each_online_cpu(i) {
>> + setup_cpuidle(i);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> void __init xen_arch_setup(void)
>> {
>> xen_panic_handler_init();
>> @@ -354,7 +394,7 @@ void __init xen_arch_setup(void)
>> #ifdef CONFIG_X86_32
>> boot_cpu_data.hlt_works_ok = 1;
>> #endif
>> - pm_idle = default_idle;
>> + xen_idle_init();
>> boot_option_idle_override = IDLE_HALT;
>>
>> fiddle_vdso();
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-03-23 10:10:54

by Trinabh Gupta

[permalink] [raw]
Subject: Re: [RFC PATCH V4 1/5] cpuidle: Remove pm_idle pointer for x86

Hi Stephen,

On 03/23/2011 06:30 AM, Stephen Rothwell wrote:
> Hi,
>
> Just some simple comments.
>
> Does having this patch first in the series break APM idle?

Thanks for reviewing.
Yes, I think removal of "pm_idle = default_idle" statement in
APM may have to be moved here.
>
> On Tue, 22 Mar 2011 18:02:27 +0530 Trinabh Gupta<[email protected]> wrote:
>>
>> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
>> index 8d12878..17b7101 100644
>> --- a/arch/x86/kernel/process_32.c
>> +++ b/arch/x86/kernel/process_32.c
>> @@ -74,6 +74,8 @@ static inline void play_dead(void)
>> }
>> #endif
>>
>> +extern void cpuidle_idle_call(void);
>
> Put this declaration in a header file and include that header file here
> and in the file that defines that function.
>

Ok, thanks

-Trinabh

2011-03-23 10:25:11

by Trinabh Gupta

[permalink] [raw]
Subject: Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm

Hi Stephen,

Thanks for reviewing.

On 03/23/2011 06:44 AM, Stephen Rothwell wrote:
> Hi,
>
> On Tue, 22 Mar 2011 18:03:40 +0530 Trinabh Gupta<[email protected]> wrote:
>>
>> +static int apm_setup_cpuidle(int cpu)
>> +{
>> + struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device),
>> + GFP_KERNEL);
>
> Same as xen comments: no NULL check.
>
>> + int count = CPUIDLE_DRIVER_STATE_START;
>> + dev->cpu = cpu;
>> + dev->drv =&apm_idle_driver;
>
> Also wondering why you would ever have a different idle routine on
> different cpus?

Yes, this is an ongoing debate. Apparently it is a possibility
because of ACPI bugs. CPU's can have asymmetric C-states
and overall different idle routines on different cpus. Please
refer to http://lkml.org/lkml/2009/9/24/132 and
https://lkml.org/lkml/2011/2/10/37 for a discussion around this.

I have posted a patch series that does global registration
i.e same idle routines for each cpu. Please check
http://lkml.org/lkml/2011/3/22/161 . That series applies on
top of this series. Global registration significantly
simplifies the design, but still we are not sure about the
direction to take.

>
>> +
>> + dev->states[count] = state_apm_idle;
>> + count++;
>> +
>> + dev->state_count = count;
>> +
>> + if (cpuidle_register_device(dev))
>> + return -EIO;
>
> And we leak dev.
>
>> +static void apm_idle_exit(void)
>> +{
>> + cpuidle_unregister_driver(&apm_idle_driver);
>
> Do we leak the per cpu device structure here?

I will see how we can save
per cpu device structure pointers so that we can free them.

>
>> + return;
>
> Unnecessary return statement.
>

Thanks,
-Trinabh

2011-03-23 20:32:18

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm

> > Also wondering why you would ever have a different idle routine on
> > different cpus?
>
> Yes, this is an ongoing debate. Apparently it is a possibility
> because of ACPI bugs. CPU's can have asymmetric C-states
> and overall different idle routines on different cpus. Please
> refer to http://lkml.org/lkml/2009/9/24/132 and
> https://lkml.org/lkml/2011/2/10/37 for a discussion around this.

Althought the ACPI specification allows the BIOS to tell the OS
about different C-states per-processor, I know of zero system
in the field and zero systems in development that require that
capability. That isn't a guarantee that capability will never
be used, but I'm not holding my breath.

If there are systems with broken tables that make them
appear asymetric, then we should have a workaround that handles
that case, rather than complicating the normal code for
the broken case.

So I recommend deleting the extra per-cpu registration stuff
unless there is some other architecture that requires it
and can't hadle the asymmetry in another way.

> I have posted a patch series that does global registration
> i.e same idle routines for each cpu. Please check
> http://lkml.org/lkml/2011/3/22/161 . That series applies on
> top of this series. Global registration significantly
> simplifies the design, but still we are not sure about the
> direction to take.

I'll review that.

thanks,
Len Brown, Intel Open Source Technology Center

2011-03-23 20:51:44

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection

> The goal of the patch series is to remove exported pm_idle function
> pointer (see http://lkml.org/lkml/2009/8/28/43 and
> http://lkml.org/lkml/2009/8/28/50 for problems related to pm_idle).
> The first patch in the series removes pm_idle for x86 and we
> now directly call cpuidle_idle_call as suggested by Arjan
> (https://lkml.org/lkml/2010/10/19/453).

So the problem statement with "pm_idle" is that it is visible to modules
and thus potentially racey and unsafe?

Any reason we can't delete his line today to address most of the concern?

EXPORT_SYMBOL(pm_idle);

> But we also have to replace the functionality provided by pm_idle,
> i.e. call default_idle for platforms where no better idle routine
> exists, call mwait for pre-nehalem platforms, use intel_idle or
> acpi_idle for nehalem architectures etc. To manage all this
> we need a registration mechanism which is conveniently provided
> by cpuidle.

It isn't immediately clear to me that all of these options
need to be preserved.

Are we suggesting that x86 must always build with cpuidle?
I'm sure that somebody someplace will object to that.

OTOH, if cpuidle is included, I'd like to see the
non-cpuidle code excluded, since nobody will run it...

> In theory I agree that we can maybe do without list based
> registration i.e probe and pick the best for the platform, but things
> may become less predictable and difficult to manage as
> we have more and more platforms and drivers.
> By directly calling into cpuidle, we already have arch default
> other than intel_idle and acpi_idle. Then APM and xen (though
> it uses default_idle) also have their own idle routines.
> List based management and selection based on priority would provide

Does anybody actually use the latest kernel in APM mode?
I'm not even sure the last version of Windows that would talk to APM,
it was whatever was before Windows-95, I think.

But don't get me wrong, I agree that pm_idle should go.
I agree that cpuidle should have a default other than
the polling loop it currently uses.
I just don't think we should spend a lot of code and time preserving
every conceivable option and feature. We should first
do some spring cleaning to see if we can simplify the problem.

thanks,
-Len Brown, Intel Open Source Technology Center

2011-03-24 04:28:32

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm

does anybody still have an APM-capable box that can
run the latest kernel and test this code?

thanks,
Len Brown, Intel Open Source Technology Center

2011-03-24 04:41:51

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection

> Does anybody actually use the latest kernel in APM mode?

Google tells me that Windows 2000 still supported APM,
and it was still present in Windows XP
APM was not present in Windows Vista, aka Windows 2006.

MS dropped support in their latest OS 5 years ago.
When will the latest Linux drop APM support?

-Len

2011-03-24 07:18:46

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH V4 4/5] cpuidle: driver for xen

Is a CONFIG_XEN kernel supposed to use just HLT in idle?

xen_arch_setup() does this:

pm_idle = default_idle;
boot_option_idle_override = IDLE_HALT;

which has that effect. I guess this makes sense b/c the
CONFIG_XEN kernel is Dom0 and the real C-sates are done
by the hypervisor?

Would the same CONFIG_XEN kernel binary ever not
run xen_arch_setup(), run on raw hardware, and want
to use idle states other than HLT?

thanks,
-Len Brown, Intel Open Source Technology Center

2011-03-24 12:06:14

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [RFC PATCH V4 4/5] cpuidle: driver for xen

On Thu, Mar 24, 2011 at 03:18:14AM -0400, Len Brown wrote:
> Is a CONFIG_XEN kernel supposed to use just HLT in idle?

For right now..
>
> xen_arch_setup() does this:
>
> pm_idle = default_idle;
> boot_option_idle_override = IDLE_HALT;
>
> which has that effect. I guess this makes sense b/c the
> CONFIG_XEN kernel is Dom0 and the real C-sates are done
> by the hypervisor?

Correct. There are some patches that make the C-states
be visible in the Linux kernel, but that hasn't been ported
over yet.

>
> Would the same CONFIG_XEN kernel binary ever not
> run xen_arch_setup(), run on raw hardware, and want

ever not? I am not sure of the question, so let me state:
The Linux kernel if compiled with CONFIG_XEN, and if run on
native hardware, would _never_ run 'xen_arch_setup()'*. It would
run the normal, native type setup.

> to use idle states other than HLT?
>

*: It could if you really really wanted. You would need to change
the GRUB2 to inject some extra data in the 'sub_hardware' flag to be
the Xen specific.

2011-03-24 14:13:52

by Trinabh Gupta

[permalink] [raw]
Subject: Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection



On 03/24/2011 02:21 AM, Len Brown wrote:
>> The goal of the patch series is to remove exported pm_idle function
>> pointer (see http://lkml.org/lkml/2009/8/28/43 and
>> http://lkml.org/lkml/2009/8/28/50 for problems related to pm_idle).
>> The first patch in the series removes pm_idle for x86 and we
>> now directly call cpuidle_idle_call as suggested by Arjan
>> (https://lkml.org/lkml/2010/10/19/453).
>
> So the problem statement with "pm_idle" is that it is visible to modules
> and thus potentially racey and unsafe?
>
> Any reason we can't delete his line today to address most of the concern?

Hi Len,

I think there are other problems too, related to saving and restoring
of pm_idle pointer. For example, cpuidle itself saves current value
of pm_idle, flips it and then restores the saved value. There is
no guarantee that the saved function still exists. APM does exact
same thing (though it may not be used these days).

The problem also is that a number of architectures have copied the
same design based on pm_idle; so its spreading.

>
> EXPORT_SYMBOL(pm_idle);
>
>> But we also have to replace the functionality provided by pm_idle,
>> i.e. call default_idle for platforms where no better idle routine
>> exists, call mwait for pre-nehalem platforms, use intel_idle or
>> acpi_idle for nehalem architectures etc. To manage all this
>> we need a registration mechanism which is conveniently provided
>> by cpuidle.
>
> It isn't immediately clear to me that all of these options
> need to be preserved.

So what do you suggest can be removed?
>
> Are we suggesting that x86 must always build with cpuidle?
> I'm sure that somebody someplace will object to that.

Arjan argued that since almost everyone today runs cpuidle
it may be best to include it in the kernel
(https://lkml.org/lkml/2010/10/20/243). But yes, we agreed
that we would have to make cpuidle lighter incrementally.
Making ladder governor optional could be one way for example.
>
> OTOH, if cpuidle is included, I'd like to see the
> non-cpuidle code excluded, since nobody will run it...
>
>> In theory I agree that we can maybe do without list based
>> registration i.e probe and pick the best for the platform, but things
>> may become less predictable and difficult to manage as
>> we have more and more platforms and drivers.
>> By directly calling into cpuidle, we already have arch default
>> other than intel_idle and acpi_idle. Then APM and xen (though
>> it uses default_idle) also have their own idle routines.
>> List based management and selection based on priority would provide
>
> Does anybody actually use the latest kernel in APM mode?
> I'm not even sure the last version of Windows that would talk to APM,
> it was whatever was before Windows-95, I think.
>
> But don't get me wrong, I agree that pm_idle should go.
> I agree that cpuidle should have a default other than
> the polling loop it currently uses.

Sure, thanks

-Trinabh

2011-03-24 14:28:36

by Trinabh Gupta

[permalink] [raw]
Subject: Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm



On 03/24/2011 02:02 AM, Len Brown wrote:
>>> Also wondering why you would ever have a different idle routine on
>>> different cpus?
>>
>> Yes, this is an ongoing debate. Apparently it is a possibility
>> because of ACPI bugs. CPU's can have asymmetric C-states
>> and overall different idle routines on different cpus. Please
>> refer to http://lkml.org/lkml/2009/9/24/132 and
>> https://lkml.org/lkml/2011/2/10/37 for a discussion around this.
>
> Althought the ACPI specification allows the BIOS to tell the OS
> about different C-states per-processor, I know of zero system
> in the field and zero systems in development that require that
> capability. That isn't a guarantee that capability will never
> be used, but I'm not holding my breath.
>
> If there are systems with broken tables that make them
> appear asymetric, then we should have a workaround that handles
> that case, rather than complicating the normal code for
> the broken case.
>
> So I recommend deleting the extra per-cpu registration stuff
> unless there is some other architecture that requires it
> and can't hadle the asymmetry in another way.

Yes, lets go forward with removal of per-cpu registration
and handle rare case of asymmetry in some other may.

Using intersection or union of C-states for each cpu may
be a solution. Using intersection or lowest common C-state
has the corner case that we could have packages/cores
supporting a new lower C-state in case of thermal limit and
they would want OS to go to this state. Using intersection
or lowest common C-state may prevent this.

Another option is to use union of C-states;
but I am not sure what happens if a CPU uses a state that
is not reported for it???

Maybe there is some other way to handle asymmetry ??

>
>> I have posted a patch series that does global registration
>> i.e same idle routines for each cpu. Please check
>> http://lkml.org/lkml/2011/3/22/161 . That series applies on
>> top of this series. Global registration significantly
>> simplifies the design, but still we are not sure about the
>> direction to take.
>
> I'll review that.

Thanks; please review especially the data structure changes
https://lkml.org/lkml/2011/3/22/162

-Trinabh

2011-03-24 16:21:49

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm

* Trinabh Gupta <[email protected]> [2011-03-24 19:58:29]:

>
>
> On 03/24/2011 02:02 AM, Len Brown wrote:
> >>>Also wondering why you would ever have a different idle routine on
> >>>different cpus?
> >>
> >>Yes, this is an ongoing debate. Apparently it is a possibility
> >>because of ACPI bugs. CPU's can have asymmetric C-states
> >>and overall different idle routines on different cpus. Please
> >>refer to http://lkml.org/lkml/2009/9/24/132 and
> >>https://lkml.org/lkml/2011/2/10/37 for a discussion around this.
> >
> >Althought the ACPI specification allows the BIOS to tell the OS
> >about different C-states per-processor, I know of zero system
> >in the field and zero systems in development that require that
> >capability. That isn't a guarantee that capability will never
> >be used, but I'm not holding my breath.
> >
> >If there are systems with broken tables that make them
> >appear asymetric, then we should have a workaround that handles
> >that case, rather than complicating the normal code for
> >the broken case.
> >
> >So I recommend deleting the extra per-cpu registration stuff
> >unless there is some other architecture that requires it
> >and can't hadle the asymmetry in another way.

Hi Len,

The fear of breaking legacy hardware is what is holding us. Arjan
also pointed that there are systems that has asymmetric C-States
either intentionally or due to a bug. I agree with you that we should
remove per-cpu registration at this point and move forward with
a single registration. We can workaround the corner cases.

> Yes, lets go forward with removal of per-cpu registration
> and handle rare case of asymmetry in some other may.

Yes. Lets discuss the design/problems on the other patch series.
(http://lkml.org/lkml/2011/3/22/161)

> Using intersection or union of C-states for each cpu may
> be a solution. Using intersection or lowest common C-state
> has the corner case that we could have packages/cores
> supporting a new lower C-state in case of thermal limit and
> they would want OS to go to this state. Using intersection
> or lowest common C-state may prevent this.
>
> Another option is to use union of C-states;
> but I am not sure what happens if a CPU uses a state that
> is not reported for it???
>
> Maybe there is some other way to handle asymmetry ??

The simplest method will be a union of all C-States. Basically let
the CPU that reports the maximum C-States to register or override the
current setup. During boot-up allow the first CPU to do the
registration, but later override if another CPU comes up with more
C-States.

This will work assuming that calling a new (deeper) C-State on CPUs
that did not report them will cause no harm. It should degenerate to
closest supported C-State.

> >
> >>I have posted a patch series that does global registration
> >>i.e same idle routines for each cpu. Please check
> >>http://lkml.org/lkml/2011/3/22/161 . That series applies on
> >>top of this series. Global registration significantly
> >>simplifies the design, but still we are not sure about the
> >>direction to take.
> >
> >I'll review that.
> Thanks; please review especially the data structure changes
> https://lkml.org/lkml/2011/3/22/162

Single registration will allow us to combine struct cpuidle_device and
struct cpuidle_driver, and simplify the framework for large systems.
AFAIK other archs like powerpc or ARM would not need per-cpu
definitions.

--Vaidy

2011-03-24 16:32:32

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH V4 3/5] cpuidle: default idle driver for x86

* Trinabh Gupta <[email protected]> [2011-03-23 15:01:14]:

>
> On 03/23/2011 08:43 AM, Len Brown wrote:
> >Why is this patch a step forward?
>
> Hi Len,
>
> I have basically moved the code for arch default and mwait
> idle from arch/x86/kernel/process.c to a driver. This was
> suggested by Venki (https://lkml.org/lkml/2010/10/19/460)
> as part of pm_idle cleanup and direct call of
> cpuidle_idle_call(). There is not much new code here.
>
> >
> >>+obj-$(CONFIG_X86) += default_driver.o
> >
> >BTW, that's a pretty generic name for an x86 specific idle driver...
> >
> >I think that on builds that support intel_idle and acpi_idle,
> >everything in this file will be unused, unless somebody uses some
> >debugging cmdline params that should have been deleted ages ago.
>
> Yes, I agree that the name has to be x86 specific. I think the
> routines would be used for pre-nehalem architectures that use
> arch default or mwait.

Mainly selection between default_idle (safe_halt), mwait_idle and
c1e_idle needs to be placed in a default driver. This is the code
that was 'outside' of cpuidle framework and directly used pm_idle().
This is mostly unused and overridden by intel_idle or acpi_idle, but
still cannot be discarded.

Maybe keep this as a module and probe/load only if both intel_idle and
acpi_idle failed to load or excluded by command line or otherwise.

--Vaidy

2011-03-24 16:52:12

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection

* Trinabh Gupta <[email protected]> [2011-03-24 19:43:43]:


[snip]

> >>But we also have to replace the functionality provided by pm_idle,
> >>i.e. call default_idle for platforms where no better idle routine
> >>exists, call mwait for pre-nehalem platforms, use intel_idle or
> >>acpi_idle for nehalem architectures etc. To manage all this
> >>we need a registration mechanism which is conveniently provided
> >>by cpuidle.
> >
> >It isn't immediately clear to me that all of these options
> >need to be preserved.
>
> So what do you suggest can be removed?

Can we use safe_halt() until intel_idle/acpi_idle take over? But what
if they do not take over? If safe_halt() is not very bad compared to
the variants like mwait_idle and c1e_idle, then we can remove the old
code and no need to move them to default driver.

> >Are we suggesting that x86 must always build with cpuidle?
> >I'm sure that somebody someplace will object to that.
>
> Arjan argued that since almost everyone today runs cpuidle
> it may be best to include it in the kernel
> (https://lkml.org/lkml/2010/10/20/243). But yes, we agreed
> that we would have to make cpuidle lighter incrementally.
> Making ladder governor optional could be one way for example.
> >
> >OTOH, if cpuidle is included, I'd like to see the
> >non-cpuidle code excluded, since nobody will run it...

The non-cpuidle code will be the select_idle_routine() and related
function that cam move to default_driver that register to cpuidle.
We can load on-demand as module if better routines fail to register.
Maybe we don't need this at all as discussed in the above point?

--Vaidy

[snip]

2011-03-25 07:05:52

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection

> I think there are other problems too, related to saving and restoring
> of pm_idle pointer. For example, cpuidle itself saves current value
> of pm_idle, flips it and then restores the saved value. There is
> no guarantee that the saved function still exists. APM does exact
> same thing (though it may not be used these days).
>
> The problem also is that a number of architectures have copied the
> same design based on pm_idle; so its spreading.

pm_idle is a primitive design yes, but I think the issue
with pm_idle is a theoretical one, at least on x86;
as there isn't any other code scribbling on pm_idle
in practice. So this is clean-up, rather than bug-fix work...

> > It isn't immediately clear to me that all of these options
> > need to be preserved.
>
> So what do you suggest can be removed?

I sent a series of small patches yesterday to get the ball rolling...
https://lkml.org/lkml/2011/3/24/54

I think the xen thing can go away.

I proposed that APM be removed entirely,
but that will take a few releases to conclude....

> > Are we suggesting that x86 must always build with cpuidle?
> > I'm sure that somebody someplace will object to that.
>
> Arjan argued that since almost everyone today runs cpuidle
> it may be best to include it in the kernel
> (https://lkml.org/lkml/2010/10/20/243). But yes, we agreed
> that we would have to make cpuidle lighter incrementally.
> Making ladder governor optional could be one way for example.

ladder is already optional.

cheers,
-Len Brown, Intel Open Source Technology Center

2011-03-25 07:13:37

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection

> > So what do you suggest can be removed?
>
> Can we use safe_halt() until intel_idle/acpi_idle take over? But what
> if they do not take over? If safe_halt() is not very bad compared to
> the variants like mwait_idle and c1e_idle, then we can remove the old
> code and no need to move them to default driver.

One reason I'd like a default cpuidle driver is that today
there is a race. cpuidle registers, but until its driver
registers it will use polling. go ahead and look:

grep . /sys/devices/system/cpu/cpu*/cpuidle/state0/usage

that should be 0, but it isn't...

> > >Are we suggesting that x86 must always build with cpuidle?
> > >I'm sure that somebody someplace will object to that.
> >
> > Arjan argued that since almost everyone today runs cpuidle
> > it may be best to include it in the kernel
> > (https://lkml.org/lkml/2010/10/20/243). But yes, we agreed
> > that we would have to make cpuidle lighter incrementally.
> > Making ladder governor optional could be one way for example.
> > >
> > >OTOH, if cpuidle is included, I'd like to see the
> > >non-cpuidle code excluded, since nobody will run it...
>
> The non-cpuidle code will be the select_idle_routine() and related
> function that cam move to default_driver that register to cpuidle.
> We can load on-demand as module if better routines fail to register.
> Maybe we don't need this at all as discussed in the above point?

Right, though I don't share your fascination with modules.

cheers,
Len Brown, Intel Open Source Technology Center

2011-03-25 07:19:48

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH V4 4/5] cpuidle: driver for xen

> On Thu, Mar 24, 2011 at 03:18:14AM -0400, Len Brown wrote:
> > Is a CONFIG_XEN kernel supposed to use just HLT in idle?
>
> For right now..
> >
> > xen_arch_setup() does this:
> >
> > pm_idle = default_idle;
> > boot_option_idle_override = IDLE_HALT;
> >
> > which has that effect. I guess this makes sense b/c the
> > CONFIG_XEN kernel is Dom0 and the real C-sates are done
> > by the hypervisor?
>
> Correct. There are some patches that make the C-states
> be visible in the Linux kernel, but that hasn't been ported
> over yet.

The Xen Dom0 kernel will trap into the hypervisor
whenever it does a HLT or an MWAIT, yes?

What is the benefit of having Dom0 decided between
C-states that it can't actually enter?

What is the mechanism by which those C-states are
made visible to Dom0, and how are those states
related to the states that are supported on
the bare iron?

> > Would the same CONFIG_XEN kernel binary ever not
> > run xen_arch_setup(), run on raw hardware, and want
> > to use idle states other than HLT?

> ever not? I am not sure of the question, so let me state:
> The Linux kernel if compiled with CONFIG_XEN, and if run on
> native hardware, would _never_ run 'xen_arch_setup()'*. It would
> run the normal, native type setup.

Thanks, that clarifies.
-Len Brown, Intel Open Source Technolgy Center

2011-03-25 07:24:38

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm


> Maybe there is some other way to handle asymmetry ??

When somebody invents a system that needs asymmetry on purpose,
that is the time to get fancy. Until that day, simply print a
FW_BUG WARNING and fall back to default_idle().

cheers,
Len Brown, Intel Open Source Technology Center

2011-03-25 14:39:14

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen

On 03/24/2011 12:05 PM, Konrad Rzeszutek Wilk wrote:
> On Thu, Mar 24, 2011 at 03:18:14AM -0400, Len Brown wrote:
>> Is a CONFIG_XEN kernel supposed to use just HLT in idle?
> For right now..

For always, I should think.

>> xen_arch_setup() does this:
>>
>> pm_idle = default_idle;
>> boot_option_idle_override = IDLE_HALT;
>>
>> which has that effect. I guess this makes sense b/c the
>> CONFIG_XEN kernel is Dom0 and the real C-sates are done
>> by the hypervisor?
> Correct. There are some patches that make the C-states
> be visible in the Linux kernel, but that hasn't been ported
> over yet.

All we need is for the idle CPU to block in the hypervisor; a plain
"hlt" is always going to be sufficient (which is overridden as a pvop
into a sched_idle hypercall).

Xen will choose an appropriate power state for the physical cpus
depending on the overall busyness of the system (which any individual
virtual machine can't determine).

J

2011-03-25 14:43:23

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen

On 03/25/2011 07:19 AM, Len Brown wrote:
> The Xen Dom0 kernel will trap into the hypervisor
> whenever it does a HLT or an MWAIT, yes?

Yes, on hlt.

> What is the benefit of having Dom0 decided between
> C-states that it can't actually enter?

There might be a slight benefit to allow a domain to tell Xen what its
overall utilisation is (ie, "I'd like this VCPU to run, but it isn't
very important so you can take that into account when choosing
scheduling priority and/or PCPU performance"). But there's nothing like
that at present.

> What is the mechanism by which those C-states are
> made visible to Dom0, and how are those states
> related to the states that are supported on
> the bare iron?

Because dom0 is the official ACPI owner (ie, it has the AML
interpreter), we need dom0 to handle complex interactions with ACPI (the
hypervisor can do simple table parsing). At present the mechanism for
power states is that dom0 gets them out of ACPI, and then passes them to
Xen which actually uses them. But no guest kernel has any runtime use
of power states.

J

2011-03-25 15:36:11

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection

On Fri, Mar 25, 2011 at 03:05:36AM -0400, Len Brown wrote:
> > I think there are other problems too, related to saving and restoring
> > of pm_idle pointer. For example, cpuidle itself saves current value
> > of pm_idle, flips it and then restores the saved value. There is
> > no guarantee that the saved function still exists. APM does exact
> > same thing (though it may not be used these days).
> >
> > The problem also is that a number of architectures have copied the
> > same design based on pm_idle; so its spreading.
>
> pm_idle is a primitive design yes, but I think the issue
> with pm_idle is a theoretical one, at least on x86;
> as there isn't any other code scribbling on pm_idle
> in practice. So this is clean-up, rather than bug-fix work...
>
> > > It isn't immediately clear to me that all of these options
> > > need to be preserved.
> >
> > So what do you suggest can be removed?
>
> I sent a series of small patches yesterday to get the ball rolling...
> https://lkml.org/lkml/2011/3/24/54
>
> I think the xen thing can go away.

The xen thing being the setting of cpuidle to halt or the proposed
patch?

2011-03-25 18:02:12

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm

* Len Brown <[email protected]> [2011-03-25 03:24:24]:

>
> > Maybe there is some other way to handle asymmetry ??
>
> When somebody invents a system that needs asymmetry on purpose,
> that is the time to get fancy. Until that day, simply print a
> FW_BUG WARNING and fall back to default_idle().

We may print a warning but not switch to default_idle(). There could
be laptops that gets an extra C-State when run on battery. So when
the ACPI _PPC event happens for the C-State change, we need to handle
the situation such that the states are refreshed for all CPUs in one step.

Basically while we are updating the new C-State, the framework should
not detect this as asymmetric and switch to default_idle(). Please
fell free to suggest a more elegant solution to handle the runtime
C-State change event, though symmetric across all CPUs.

--Vaidy

2011-03-31 02:03:32

by Len Brown

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen

> >> Is a CONFIG_XEN kernel supposed to use just HLT in idle?
> > For right now..
>
> For always, I should think.

Yay!

> >> xen_arch_setup() does this:
> >>
> >> pm_idle = default_idle;
> >> boot_option_idle_override = IDLE_HALT;
> >>
> >> which has that effect. I guess this makes sense b/c the
> >> CONFIG_XEN kernel is Dom0 and the real C-sates are done
> >> by the hypervisor?
> > Correct. There are some patches that make the C-states
> > be visible in the Linux kernel, but that hasn't been ported
> > over yet.
>
> All we need is for the idle CPU to block in the hypervisor; a plain
> "hlt" is always going to be sufficient (which is overridden as a pvop
> into a sched_idle hypercall).
>
> Xen will choose an appropriate power state for the physical cpus
> depending on the overall busyness of the system (which any individual
> virtual machine can't determine).

Okay, knowing that the Dom0 kernel

1. can boot in non-xen mode on bare hardware and run cpuidle
2. needs just HLT when booted in xen mode

will help us keep things simple.

thanks,
Len Brown, Intel Open Source Technology Center

2011-03-31 02:17:55

by Len Brown

[permalink] [raw]
Subject: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm)

> > > Maybe there is some other way to handle asymmetry ??

I mis-spoke on asymmetry.

Moorestown is already an example of an asymmetric system,
since its deepest c-state is available on cpu0, but not on cpu1.
So it needs different tables for each cpu.

I think what would work is a default c-state table for the system,
and the ability of a per-cpu override table. I think that would
gracefully handle the case of many identical cpus, and also systems
with different tables per cpu.

The same goes for write-access to the tables.
In the typical case, a single table can be shared for the entire system
and nobody will be writing to it. However, with the governor changes
to call dev->prepare and sift through all the states to find the
legal one with the lowest power_usage... There is software today
out of tree that updates that power_usage entry from prepare().

As I mentioned, I'm not fond of that mechanism - it looks racey
to me. I'd rather see the capability of a drivers idle handler
to demote to another handler in the driver and for the accounting
to not get messed up when that happens. I think the way to do that
is to let the driver do the accounting rather than doing it in
the cpuidle caller.

cheers,
-Len Brown, Intel Open Source Technology Center

2011-03-31 02:26:03

by Len Brown

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH V4 2/5] cpuidle: list based cpuidle driver registration and selection



thanks,
Len Brown, Intel Open Source Technology Center

On Fri, 25 Mar 2011, Konrad Rzeszutek Wilk wrote:

> On Fri, Mar 25, 2011 at 03:05:36AM -0400, Len Brown wrote:
> > > I think there are other problems too, related to saving and restoring
> > > of pm_idle pointer. For example, cpuidle itself saves current value
> > > of pm_idle, flips it and then restores the saved value. There is
> > > no guarantee that the saved function still exists. APM does exact
> > > same thing (though it may not be used these days).
> > >
> > > The problem also is that a number of architectures have copied the
> > > same design based on pm_idle; so its spreading.
> >
> > pm_idle is a primitive design yes, but I think the issue
> > with pm_idle is a theoretical one, at least on x86;
> > as there isn't any other code scribbling on pm_idle
> > in practice. So this is clean-up, rather than bug-fix work...
> >
> > > > It isn't immediately clear to me that all of these options
> > > > need to be preserved.
> > >
> > > So what do you suggest can be removed?
> >
> > I sent a series of small patches yesterday to get the ball rolling...
> > https://lkml.org/lkml/2011/3/24/54
> >
> > I think the xen thing can go away.
>
> The xen thing being the setting of cpuidle to halt or the proposed
> patch?

I don't think Xen needs a cpuidle driver.
Xen just needs to tell the kernel to call HALT,
and I think we'll keep that around for the non-cpuidle case,
and for the idle periods before cpuidle initializes.

cheers,
Len Brown, Intel Open Source Technology Center

2011-03-31 13:20:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm)

On Wed, 2011-03-30 at 22:17 -0400, Len Brown wrote:
>
> Moorestown is already an example of an asymmetric system,
> since its deepest c-state is available on cpu0, but not on cpu1.
> So it needs different tables for each cpu.

wtf are these hardware guys smoking and how the heck are we supposed to
schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0?

2011-03-31 21:27:09

by Len Brown

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen

> > >> xen_arch_setup() does this:
> > >>
> > >> pm_idle = default_idle;
> > >> boot_option_idle_override = IDLE_HALT;

What happens on a Xen kernel if these lines are not there?
Does Xen export the C-states tables to Dom0 kernel, and the Dom0
kernel has an acpi processor driver, and thus it would try to
use all the C-states?

If yes, must Xen show those tables to Dom?
If it did not, then the lines above would not be necessary,
as in the absence of any C-states, the kernel should
use halt by default.

thanks,
-Len Brown, Intel Open Source Technology Center

2011-03-31 22:37:19

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen

On 03/31/2011 02:26 PM, Len Brown wrote:
>>>>> xen_arch_setup() does this:
>>>>>
>>>>> pm_idle = default_idle;
>>>>> boot_option_idle_override = IDLE_HALT;
> What happens on a Xen kernel if these lines are not there?
> Does Xen export the C-states tables to Dom0 kernel, and the Dom0
> kernel has an acpi processor driver, and thus it would try to
> use all the C-states?

If they're no there it tries to use the Intel cpuidle driver, which
fails (just hangs forever in idle, I think).

> If yes, must Xen show those tables to Dom?
> If it did not, then the lines above would not be necessary,
> as in the absence of any C-states, the kernel should
> use halt by default.

The dom0 kernel gets all the ACPI state so it can get all the juicy
goodness from it. It does extract the C state info, but passes it back
to Xen rather than use it itself. We don't generally try to filter the
ACPI state before letting dom0 see it (DMAR being the exception, since
dom0 really has no business knowing about that).

(We have this basic problem that neither Xen nor dom0 are the ideal
owners of ACPI. In principle Xen should own ACPI as the most privileged
"OS", but it really only cares about things like power states, interrupt
routing, system topology, busses, etc. But dom0 cares about lid
switches, magic keyboard keys, volume controls, video output switching,
etc, etc. At the moment it seems to work best if dom0 do all ACPI
processing then pass Xen the parts it needs, which are generally
fixed-at-startup config info items.)

J

2011-04-01 03:04:11

by Len Brown

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH V4 4/5] cpuidle: driver for xen

> >>>>> xen_arch_setup() does this:
> >>>>>
> >>>>> pm_idle = default_idle;
> >>>>> boot_option_idle_override = IDLE_HALT;
> > What happens on a Xen kernel if these lines are not there?
> > Does Xen export the C-states tables to Dom0 kernel, and the Dom0
> > kernel has an acpi processor driver, and thus it would try to
> > use all the C-states?
>
> If they're no there it tries to use the Intel cpuidle driver, which
> fails (just hangs forever in idle, I think).
>
> > If yes, must Xen show those tables to Dom?
> > If it did not, then the lines above would not be necessary,
> > as in the absence of any C-states, the kernel should
> > use halt by default.
>
> The dom0 kernel gets all the ACPI state so it can get all the juicy
> goodness from it. It does extract the C state info, but passes it back
> to Xen rather than use it itself. We don't generally try to filter the
> ACPI state before letting dom0 see it (DMAR being the exception, since
> dom0 really has no business knowing about that).
>
> (We have this basic problem that neither Xen nor dom0 are the ideal
> owners of ACPI. In principle Xen should own ACPI as the most privileged
> "OS", but it really only cares about things like power states, interrupt
> routing, system topology, busses, etc. But dom0 cares about lid
> switches, magic keyboard keys, volume controls, video output switching,
> etc, etc. At the moment it seems to work best if dom0 do all ACPI
> processing then pass Xen the parts it needs, which are generally
> fixed-at-startup config info items.)

Okay.
Since a Xen kernel can also boot on bare iron, and thus
includes cpuidle, acpi_idle, intel_idle; and when
in Dom0 mode it simply wants to run HLT in idle...

I think what we want to do is simply disable cpuidle
when booted in Dom0 mode. That will allow it to
fall back to default_idle w/o xen_setup needing to
tinker with installing an idle driver.

I'll send a patch in the next series.
If you can try it out, that would be great.

thanks,
-Len

2011-04-01 04:09:39

by Len Brown

[permalink] [raw]
Subject: Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm)

> > Moorestown is already an example of an asymmetric system,
> > since its deepest c-state is available on cpu0, but not on cpu1.
> > So it needs different tables for each cpu.
>
> wtf are these hardware guys smoking and how the heck are we supposed to
> schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0?

they are smoking micro-amps:-)

S0i3 on cpu0 can be entered only after cpu1 is already off-line,
among other system hardware dependencies...

So it makes no sense to export S0i3 as a c-state on cpu1.

When cpu1 is online, the scheduler treats it as a normal SMP.

cheers,
-Len Brown, Intel Open Source Technology Center

2011-04-01 07:02:13

by Trinabh Gupta

[permalink] [raw]
Subject: Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm)



On 03/31/2011 07:47 AM, Len Brown wrote:
>>>> Maybe there is some other way to handle asymmetry ??
>
> I mis-spoke on asymmetry.
>
> Moorestown is already an example of an asymmetric system,
> since its deepest c-state is available on cpu0, but not on cpu1.
> So it needs different tables for each cpu.
>
> I think what would work is a default c-state table for the system,
> and the ability of a per-cpu override table. I think that would
> gracefully handle the case of many identical cpus, and also systems
> with different tables per cpu.

Hi Len,

What would happen if a cpu enters a state which wasn't
reported for it? I am wondering if an approach like union
of states of each would work.

Can we handle asymmetry through checking and demotion inside the
routine itself; just like you are proposing as dev->prepare
alternative? But I guess this may not be efficient if this
happens often.

I am not sure if having a per-cpu override would be very tidy
(ideas ?); and much better than per-cpu stuff. So just want to check
what would be the best way forward?

>
> The same goes for write-access to the tables.
> In the typical case, a single table can be shared for the entire system
> and nobody will be writing to it. However, with the governor changes
> to call dev->prepare and sift through all the states to find the
> legal one with the lowest power_usage... There is software today
> out of tree that updates that power_usage entry from prepare().
>
> As I mentioned, I'm not fond of that mechanism - it looks racey
> to me. I'd rather see the capability of a drivers idle handler
> to demote to another handler in the driver and for the accounting
> to not get messed up when that happens. I think the way to do that
> is to let the driver do the accounting rather than doing it in
> the cpuidle caller.

I agree with this; we should move update of statistics inside the
driver routines (enter routines). They already take device/stats
structure as parameter.

Thanks,
-Trinabh

2011-04-01 08:15:33

by Dipankar Sarma

[permalink] [raw]
Subject: Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm)

On Fri, Apr 01, 2011 at 12:09:25AM -0400, Len Brown wrote:
> > > Moorestown is already an example of an asymmetric system,
> > > since its deepest c-state is available on cpu0, but not on cpu1.
> > > So it needs different tables for each cpu.
> >
> > wtf are these hardware guys smoking and how the heck are we supposed to
> > schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0?
>
> they are smoking micro-amps:-)
>
> S0i3 on cpu0 can be entered only after cpu1 is already off-line,
> among other system hardware dependencies...
>
> So it makes no sense to export S0i3 as a c-state on cpu1.
>
> When cpu1 is online, the scheduler treats it as a normal SMP.

Isn't S0i3 a "system" state, as opposed to cpu state ? Perhaps
we can treat it as such and still keep the c-states symmetric.
The cpu can transition to S0i3 from any cpu as long as others are
offlined, no ? In that sense, really all the cpus would
be in S0i3, which would make it symmetric. If this isn't how
mrst cpuidle works, then cpuidle accounting may be broken
in principle because of this.

Thanks
Dipankar

2011-04-01 14:03:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm)

On Fri, 2011-04-01 at 00:09 -0400, Len Brown wrote:
> > > Moorestown is already an example of an asymmetric system,
> > > since its deepest c-state is available on cpu0, but not on cpu1.
> > > So it needs different tables for each cpu.
> >
> > wtf are these hardware guys smoking and how the heck are we supposed to
> > schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0?
>
> they are smoking micro-amps:-)

Has anybody told them that pushing lots of logic into software generally
burns more amps because it keeps the thing running longer?

> S0i3 on cpu0 can be entered only after cpu1 is already off-line,
> among other system hardware dependencies...
>
> So it makes no sense to export S0i3 as a c-state on cpu1.
>
> When cpu1 is online, the scheduler treats it as a normal SMP.

Dipankar's reply seems to address this issue well.

2011-04-01 14:38:52

by Arjan van de Ven

[permalink] [raw]
Subject: Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm)

On 4/1/2011 1:15 AM, Dipankar Sarma wrote:
> On Fri, Apr 01, 2011 at 12:09:25AM -0400, Len Brown wrote:
>>>> Moorestown is already an example of an asymmetric system,
>>>> since its deepest c-state is available on cpu0, but not on cpu1.
>>>> So it needs different tables for each cpu.
>>> wtf are these hardware guys smoking and how the heck are we supposed to
>>> schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0?
>> they are smoking micro-amps:-)
>>
>> S0i3 on cpu0 can be entered only after cpu1 is already off-line,
>> among other system hardware dependencies...
>>
>> So it makes no sense to export S0i3 as a c-state on cpu1.
>>
>> When cpu1 is online, the scheduler treats it as a normal SMP.
> Isn't S0i3 a "system" state, as opposed to cpu state ?

it's misnamed. it's a C state to the OS.

2011-04-03 16:19:08

by Dipankar Sarma

[permalink] [raw]
Subject: Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm)

On Fri, Apr 01, 2011 at 07:38:23AM -0700, Arjan van de Ven wrote:
> On 4/1/2011 1:15 AM, Dipankar Sarma wrote:
> >On Fri, Apr 01, 2011 at 12:09:25AM -0400, Len Brown wrote:
> >>>>Moorestown is already an example of an asymmetric system,
> >>>>since its deepest c-state is available on cpu0, but not on cpu1.
> >>>>So it needs different tables for each cpu.
> >>>wtf are these hardware guys smoking and how the heck are we supposed to
> >>>schedule on such a machine? Prefer to keep cpu1 busy while idling cpu0?
> >>they are smoking micro-amps:-)
> >>
> >>S0i3 on cpu0 can be entered only after cpu1 is already off-line,
> >>among other system hardware dependencies...
> >>
> >>So it makes no sense to export S0i3 as a c-state on cpu1.
> >>
> >>When cpu1 is online, the scheduler treats it as a normal SMP.
> >Isn't S0i3 a "system" state, as opposed to cpu state ?
>
> it's misnamed. it's a C state to the OS.

I understand that it has been implemented as a C-state from this -
http://build.meego.com/package/view_file?file=linux-2.6.37-mrst-s0i3.patch&package=kernel-adaptation-mrst&project=home%3Adliu9&srcmd5=a0929a2863150f5c8454507d6cd8f09d

The key question is this -

+int mrst_check_state_availability(struct cpuidle_device *dev)
+{
+ int cpu = smp_processor_id();
+
+ /*
+ * If there is another CPU running, the GPU is active,
+ * the PMU is uninitialized, or there is a still-unprocessed
+ * PMU command, we cannot enter S0i3.
+ */
+ if (!pmu_reg || !cpumask_equal(cpu_online_mask, cpumask_of(cpu)) ||
+ s0i3_pmu_command_pending)
+ dev->states[5].flags |= CPUIDLE_FLAG_IGNORE;
+ else
+ dev->states[5].flags &= ~CPUIDLE_FLAG_IGNORE;

Is this really asymetric ? Or is it that if the other
cpu(s) are in C6, the chip can enter S0i3 ? If that is the case,
then isn't this equivalent to all the cpus in the chip
entering S0i3 and thus symmetrical ? Also, if going to S0i3
relies on other cpus offlined, then we don't need to
worry about asymetry from the scheduler/cpuidle point of
view, no ?

Thanks
Dipankar

2011-04-04 14:33:13

by Dipankar Sarma

[permalink] [raw]
Subject: Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm)

On Fri, Apr 01, 2011 at 04:02:36PM +0200, Peter Zijlstra wrote:
>
> > S0i3 on cpu0 can be entered only after cpu1 is already off-line,
> > among other system hardware dependencies...
> >
> > So it makes no sense to export S0i3 as a c-state on cpu1.
> >
> > When cpu1 is online, the scheduler treats it as a normal SMP.
>
> Dipankar's reply seems to address this issue well.

I can't find any Moorestown documentation at the Intel site, but
thinking about Len's inputs a bit more, it seems there may
be still a problem asymetry from the scheduler perspective.

If cpu0 or cpu1 either of them can be offlined, there is no
asymetry. If only cpu1 can be offlined, it would mean that
one cpu may be more efficient depending on how we do
cpu offlining for power savings. It gets a bit messy.

Len, what exacty is the significance of offlining here ?
Apart from going to C6, what else is needed in cpu1 for
the chip to go to S0i3 ? Why is idle C6 not enough ?

Thanks
Dipankar

2011-04-05 15:01:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm)

On Mon, 2011-04-04 at 20:02 +0530, Dipankar Sarma wrote:
> On Fri, Apr 01, 2011 at 04:02:36PM +0200, Peter Zijlstra wrote:
> >
> > > S0i3 on cpu0 can be entered only after cpu1 is already off-line,
> > > among other system hardware dependencies...
> > >
> > > So it makes no sense to export S0i3 as a c-state on cpu1.
> > >
> > > When cpu1 is online, the scheduler treats it as a normal SMP.
> >
> > Dipankar's reply seems to address this issue well.
>
> I can't find any Moorestown documentation at the Intel site, but
> thinking about Len's inputs a bit more, it seems there may
> be still a problem asymetry from the scheduler perspective.
>
> If cpu0 or cpu1 either of them can be offlined, there is no
> asymetry. If only cpu1 can be offlined, it would mean that
> one cpu may be more efficient depending on how we do
> cpu offlining for power savings. It gets a bit messy.
>
> Len, what exacty is the significance of offlining here ?
> Apart from going to C6, what else is needed in cpu1 for
> the chip to go to S0i3 ? Why is idle C6 not enough ?

I don't think offlining is relevant, anybody using that for power
management is doing it wrong, _very_ wrong.

2011-04-05 15:48:32

by Dipankar Sarma

[permalink] [raw]
Subject: Re: cpuidle asymmetry (was Re: [RFC PATCH V4 5/5] cpuidle: cpuidle driver for apm)

On Tue, Apr 05, 2011 at 05:01:32PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-04-04 at 20:02 +0530, Dipankar Sarma wrote:
> > On Fri, Apr 01, 2011 at 04:02:36PM +0200, Peter Zijlstra wrote:
> > I can't find any Moorestown documentation at the Intel site, but
> > thinking about Len's inputs a bit more, it seems there may
> > be still a problem asymetry from the scheduler perspective.
> >
> > If cpu0 or cpu1 either of them can be offlined, there is no
> > asymetry. If only cpu1 can be offlined, it would mean that
> > one cpu may be more efficient depending on how we do
> > cpu offlining for power savings. It gets a bit messy.
> >
> > Len, what exacty is the significance of offlining here ?
> > Apart from going to C6, what else is needed in cpu1 for
> > the chip to go to S0i3 ? Why is idle C6 not enough ?
>
> I don't think offlining is relevant, anybody using that for power
> management is doing it wrong, _very_ wrong.

I am suggesting that it depends on the offlining logic. If cpu1
is being used as an added co-processor for some specific apps
and mostly offline otherwise, it may not be an issue. If
offlining is being used as a meta-scheduler over the kernel
scheduler (like power savings or whatever logic) than it
will cause asymmetry problems dependent on the ooffline logic - e.g.
it may be more advantageous for the kernel scheduler to schedule
on cpu0 keeping cpu1 free leading to S0i3 more often. Not advocating
it when we are trying to run away from it on powerpc :)

For now, it seems we are OK in handling S0i3 through cpuidle, just that
it would be nice to understand the overall offline logic and why it is
needed. Similar questions have come up in my discussions with ARM guys
in the recent times as well.

Thanks
Dipankar