2006-08-29 20:50:08

by Adam Belay

[permalink] [raw]
Subject: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

Hi All,

This patch improves the ACPI c-state selection algorithm. It also
includes a major cleanup and simplification of the processor idle code.

The new implementation considers the full menu of available c-states.
Just as the previous implementation, decisions are primarily based on
the residency time of the last c-state entry. This is generally an
effective metric because it allows for detection of interrupt activity.
However, the new algorithm differs in that it does not promote or demote
through the c-states in succession. Rather, it immediately jumps to
whatever c-state has the best expected power consumption advantage for
the predicted residency time (i.e. the previously measured residency).
If the residency time is too short during a deep c-state entry, then the
cost of entering the state outweighs any power consumption advantage.
Similarly, if a shallow c-state is entered and resident for an
excessively long duration, then a potential opportunity to save more
power is missed.

The changes in this patch allow the ACPI idle processor mechanism to
react more quickly to sudden bursts of activity because it can jump
directly to whatever c-state is appropriate. However, because of the
"menu" nature of c-state selection, the code works best when ACPI
implementations expose all of the c-states supported by hardware.

The bus master activity mechanism has undergone similar improvements.
During capability detection, the deepest c-state that allows bus master
activity is determined. BM_STS is then polled each time the ACPI code
prepares to enter a c-state. If bus master activity is detected, then
the previously mentioned bus master capable c-state becomes the deepest
c-state allowed for that quantum. In contrast, the old implementation
would permit bus master activity to cause a promotion from one C3-type
state to the next shallower C3-type state, imposing unnecessary latency.
As a further optimization, BM_STS is cleared each time
acpi_processor_idle() is entered. This prevents any stale bus master
status from affecting c-state policy, as it may have occurred long ago
during scheduled work.

Finally, it's worth mentioning that the bulk of c-state policy
calculations have been moved to take place before c-states are entered.
This should further reduce exit latency when returning from a c-state.

This algorithm has not yet been carefully benchmarked (e.g. bltk or
power meters). However, I can say with some confidence that it saves a
small amount more power during an idle workload and a larger amount more
power during typical user-input oriented workloads such as word
processing.

I would really appreciate any comments, suggestions, or testing.

Cheers,
Adam

P.S.: It would be great if we had an accurate way to determine the ticks
spent in the C1 state. Currently, I work around the issue by setting
"sleep_ticks" such that it promotes to the next deeper state during the
next quantum.

Patch is against 2.6.18-rc4.
Signed-off-by: Adam Belay <[email protected]>

---
drivers/acpi/processor_idle.c | 502 +++++++++++++++---------------------------
include/acpi/processor.h | 18 -
2 files changed, 184 insertions(+), 336 deletions(-)

--- a/drivers/acpi/processor_idle.c 2006-08-28 17:14:40.000000000 -0400
+++ b/drivers/acpi/processor_idle.c 2006-08-28 17:13:56.000000000 -0400
@@ -8,6 +8,8 @@
* - Added processor hotplug support
* Copyright (C) 2005 Venkatesh Pallipadi <[email protected]>
* - Added support for C3 on SMP
+ * Copyright (C) 2006 Adam Belay <[email protected]>
+ * - New policy algorithm, several cleanups
*
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*
@@ -52,8 +54,6 @@
ACPI_MODULE_NAME("acpi_processor")
#define ACPI_PROCESSOR_FILE_POWER "power"
#define US_TO_PM_TIMER_TICKS(t) ((t * (PM_TIMER_FREQUENCY/1000)) / 1000)
-#define C2_OVERHEAD 4 /* 1us (3.579 ticks per us) */
-#define C3_OVERHEAD 4 /* 1us (3.579 ticks per us) */
static void (*pm_idle_save) (void) __read_mostly;
module_param(max_cstate, uint, 0644);

@@ -61,15 +61,10 @@
module_param(nocst, uint, 0000);

/*
- * bm_history -- bit-mask with a bit per jiffy of bus-master activity
- * 1000 HZ: 0xFFFFFFFF: 32 jiffies = 32ms
- * 800 HZ: 0xFFFFFFFF: 32 jiffies = 40ms
- * 100 HZ: 0x0000000F: 4 jiffies = 40ms
- * reduce history for more aggressive entry into C3
+ * Currently, we aim for the entry/exit latency to be 20% of measured residency.
*/
-static unsigned int bm_history __read_mostly =
- (HZ >= 800 ? 0xFFFFFFFF : ((1U << (HZ / 25)) - 1));
-module_param(bm_history, uint, 0644);
+#define RESIDENCY_TO_LATENCY_RATIO 5
+
/* --------------------------------------------------------------------------
Power Management
-------------------------------------------------------------------------- */
@@ -165,6 +160,13 @@
return ((0xFFFFFFFF - t1) + t2);
}

+static atomic_t c3_cpu_count;
+
+/**
+ * acpi_processor_power_activate - prepares for the next power state
+ * @power: power data
+ * @new: the target power state
+ */
static void
acpi_processor_power_activate(struct acpi_processor *pr,
struct acpi_processor_cx *new)
@@ -176,10 +178,6 @@

old = pr->power.state;

- if (old)
- old->promotion.count = 0;
- new->demotion.count = 0;
-
/* Cleanup from old state. */
if (old) {
switch (old->type) {
@@ -207,330 +205,216 @@
return;
}

-static void acpi_safe_halt(void)
+
+/**
+ * acpi_check_bm_status - determines if there is BM activity
+ *
+ * Returns: a non-zero value to indicate BM activity
+ */
+static inline int acpi_check_bm_status(void)
{
- current_thread_info()->status &= ~TS_POLLING;
- smp_mb__after_clear_bit();
- if (!need_resched())
- safe_halt();
- current_thread_info()->status |= TS_POLLING;
-}
+ u32 bm_status;

-static atomic_t c3_cpu_count;
+ acpi_get_register(ACPI_BITREG_BUS_MASTER_STATUS,
+ &bm_status, ACPI_MTX_DO_NOT_LOCK);
+ if (bm_status) {
+ acpi_set_register(ACPI_BITREG_BUS_MASTER_STATUS,
+ 1, ACPI_MTX_DO_NOT_LOCK);
+ return 1;
+ }
+ /*
+ * PIIX4 Erratum #18: Note that BM_STS doesn't always reflect
+ * the true state of bus mastering activity; forcing us to
+ * manually check the BMIDEA bit of each IDE channel.
+ */
+ else if (errata.piix4.bmisx) {
+ if ((inb_p(errata.piix4.bmisx + 0x02) & 0x01)
+ || (inb_p(errata.piix4.bmisx + 0x0A) & 0x01))
+ return 1;
+ }
+
+ return 0;
+}

+/**
+ * acpi_processor_idle - the main ACPI idle loop
+ *
+ * This function determines and enters the most appropriate ACPI c-state based
+ * on current system conditions.
+ */
static void acpi_processor_idle(void)
{
struct acpi_processor *pr = NULL;
struct acpi_processor_cx *cx = NULL;
- struct acpi_processor_cx *next_state = NULL;
- int sleep_ticks = 0;
- u32 t1, t2 = 0;
+ u32 sleep_ticks, state_idx, t1, t2, i;

pr = processors[smp_processor_id()];
if (!pr)
return;

/*
- * Interrupts must be disabled during bus mastering calculations and
- * for C2/C3 transitions.
- */
- local_irq_disable();
-
- /*
- * Check whether we truly need to go idle, or should
- * reschedule:
- */
- if (unlikely(need_resched())) {
- local_irq_enable();
- return;
- }
-
- cx = pr->power.state;
- if (!cx) {
- if (pm_idle_save)
- pm_idle_save();
- else
- acpi_safe_halt();
- return;
- }
-
- /*
- * Check BM Activity
- * -----------------
- * Check for bus mastering activity (if required), record, and check
- * for demotion.
- */
- if (pr->flags.bm_check) {
- u32 bm_status = 0;
- unsigned long diff = jiffies - pr->power.bm_check_timestamp;
-
- if (diff > 31)
- diff = 31;
-
- pr->power.bm_activity <<= diff;
-
- acpi_get_register(ACPI_BITREG_BUS_MASTER_STATUS,
- &bm_status, ACPI_MTX_DO_NOT_LOCK);
- if (bm_status) {
- pr->power.bm_activity |= 0x1;
- acpi_set_register(ACPI_BITREG_BUS_MASTER_STATUS,
- 1, ACPI_MTX_DO_NOT_LOCK);
+ * We assume there's a good chance the idle conditions will be similar
+ * to those before we scheduled work. Therefore, the next state is
+ * determined by the idle ticks of the last sleep state entered.
+ */
+ sleep_ticks = pr->power.last_ticks;
+ state_idx = pr->power.count;
+
+ /*
+ * We also clear BM_STS, as it may have been a while since we last
+ * checked it.
+ */
+ acpi_set_register(ACPI_BITREG_BUS_MASTER_STATUS,
+ 1, ACPI_MTX_DO_NOT_LOCK);
+
+ while (!need_resched()) {
+ int count = min(pr->power.count, (int) max_cstate);
+ cx = &pr->power.states[state_idx];
+
+ if (cx->target_ticks < sleep_ticks) { /* promotion */
+ for (i = state_idx + 1; i <= count; i++) {
+ cx = &pr->power.states[i];
+ if (!cx->valid)
+ continue;
+ state_idx = i;
+ if (cx->target_ticks >= sleep_ticks)
+ break;
+ }
+ } else { /* demotion */
+ for (i = state_idx - 1; i > 0; i--) {
+ cx = &pr->power.states[i];
+ if (!cx->valid)
+ continue;
+ state_idx = i;
+ if (cx->target_ticks < sleep_ticks)
+ break;
+ }
}
+
/*
- * PIIX4 Erratum #18: Note that BM_STS doesn't always reflect
- * the true state of bus mastering activity; forcing us to
- * manually check the BMIDEA bit of each IDE channel.
+ * Interrupts must be disabled during bus mastering
+ * calculations and for C-state transitions.
*/
- else if (errata.piix4.bmisx) {
- if ((inb_p(errata.piix4.bmisx + 0x02) & 0x01)
- || (inb_p(errata.piix4.bmisx + 0x0A) & 0x01))
- pr->power.bm_activity |= 0x1;
- }
+ local_irq_disable();

- pr->power.bm_check_timestamp = jiffies;
+ if (unlikely(need_resched())) {
+ local_irq_enable();
+ return;
+ }

/*
- * If bus mastering is or was active this jiffy, demote
- * to avoid a faulty transition. Note that the processor
- * won't enter a low-power state during this call (to this
- * function) but should upon the next.
- *
- * TBD: A better policy might be to fallback to the demotion
- * state (use it for this quantum only) istead of
- * demoting -- and rely on duration as our sole demotion
- * qualification. This may, however, introduce DMA
- * issues (e.g. floppy DMA transfer overrun/underrun).
+ * Check bus master status, if active ensure we enter a state
+ * that allows bus master transactions.
*/
- if ((pr->power.bm_activity & 0x1) &&
- cx->demotion.threshold.bm) {
- local_irq_enable();
- next_state = cx->demotion.state;
- goto end;
+ if (pr->flags.bm_check && acpi_check_bm_status()) {
+ pr->power.bm_activity++;
+ state_idx = min(state_idx, pr->power.bm_veto_state);
}
- }

#ifdef CONFIG_HOTPLUG_CPU
- /*
- * Check for P_LVL2_UP flag before entering C2 and above on
- * an SMP system. We do it here instead of doing it at _CST/P_LVL
- * detection phase, to work cleanly with logical CPU hotplug.
- */
- if ((cx->type != ACPI_STATE_C1) && (num_online_cpus() > 1) &&
- !pr->flags.has_cst && !acpi_fadt.plvl2_up)
- cx = &pr->power.states[ACPI_STATE_C1];
+ /*
+ * Check for P_LVL2_UP flag before entering C2 and above on
+ * an SMP system. We do it here instead of doing it at _CST/P_LVL
+ * detection phase, to work cleanly with logical CPU hotplug.
+ */
+ if ((cx->type != ACPI_STATE_C1) && (num_online_cpus() > 1) &&
+ !pr->flags.has_cst && !acpi_fadt.plvl2_up)
+ state_idx = ACPI_STATE_C1;
#endif

- /*
- * Sleep:
- * ------
- * Invoke the current Cx state to put the processor to sleep.
- */
- if (cx->type == ACPI_STATE_C2 || cx->type == ACPI_STATE_C3) {
+ cx = &pr->power.states[state_idx];
+
+ acpi_processor_power_activate(pr, cx);
+
current_thread_info()->status &= ~TS_POLLING;
smp_mb__after_clear_bit();
+
if (need_resched()) {
current_thread_info()->status |= TS_POLLING;
local_irq_enable();
return;
}
- }
-
- switch (cx->type) {

- case ACPI_STATE_C1:
- /*
- * Invoke C1.
- * Use the appropriate idle routine, the one that would
- * be used without acpi C-states.
- */
- if (pm_idle_save)
- pm_idle_save();
- else
- acpi_safe_halt();
-
- /*
- * TBD: Can't get time duration while in C1, as resumes
- * go to an ISR rather than here. Need to instrument
- * base interrupt handler.
- */
- sleep_ticks = 0xFFFFFFFF;
- break;
+ if (cx->type == ACPI_STATE_C1) { /* enter C1 */
+ safe_halt();
+ /*
+ * TBD: Can't get time duration while in C1, as resumes
+ * go to an ISR rather than here. Need to instrument
+ * base interrupt handler.
+ */
+ sleep_ticks = cx->target_ticks + 1;
+ } else { /* enter C2 or C3 */
+ if (cx->type == ACPI_STATE_C3) {
+ if (pr->flags.bm_check) {
+ if (atomic_inc_return(&c3_cpu_count) ==
+ num_online_cpus()) {
+ /*
+ * All CPUs are trying to go to C3
+ * Disable bus master arbitration
+ */
+ acpi_set_register(ACPI_BITREG_ARB_DISABLE, 1,
+ ACPI_MTX_DO_NOT_LOCK);
+ }
+ } else {
+ /* SMP with no shared cache... Invalidate cache */
+ ACPI_FLUSH_CPU_CACHE();
+ }
+ }

- case ACPI_STATE_C2:
- /* Get start time (ticks) */
- t1 = inl(acpi_fadt.xpm_tmr_blk.address);
- /* Invoke C2 */
- inb(cx->address);
- /* Dummy wait op - must do something useless after P_LVL2 read
- because chipsets cannot guarantee that STPCLK# signal
- gets asserted in time to freeze execution properly. */
- t2 = inl(acpi_fadt.xpm_tmr_blk.address);
- /* Get end time (ticks) */
- t2 = inl(acpi_fadt.xpm_tmr_blk.address);
+ /* Get start time (ticks) */
+ t1= inl(acpi_fadt.xpm_tmr_blk.address);
+ /* invoke the target C-state */
+ inb(cx->address);
+ /* Dummy wait op - must do something useless after P_LVL2/3 read
+ because chipsets cannot guarantee that STPCLK# signal gets
+ asserted in time to freeze execution properly. */
+ t2 = inl(acpi_fadt.xpm_tmr_blk.address);
+ /* Get end time (ticks) */
+ t2 = inl(acpi_fadt.xpm_tmr_blk.address);
+
+ if (cx->type == ACPI_STATE_C3 && pr->flags.bm_check) {
+ /* Enable bus master arbitration */
+ atomic_dec(&c3_cpu_count);
+ acpi_set_register(ACPI_BITREG_ARB_DISABLE, 0,
+ ACPI_MTX_DO_NOT_LOCK);
+ }

#ifdef CONFIG_GENERIC_TIME
- /* TSC halts in C2, so notify users */
- mark_tsc_unstable();
+ /* TSC halts, so notify users */
+ mark_tsc_unstable();
#endif
- /* Re-enable interrupts */
- local_irq_enable();
- current_thread_info()->status |= TS_POLLING;
- /* Compute time (ticks) that we were actually asleep */
- sleep_ticks =
- ticks_elapsed(t1, t2) - cx->latency_ticks - C2_OVERHEAD;
- break;

- case ACPI_STATE_C3:
-
- if (pr->flags.bm_check) {
- if (atomic_inc_return(&c3_cpu_count) ==
- num_online_cpus()) {
- /*
- * All CPUs are trying to go to C3
- * Disable bus master arbitration
- */
- acpi_set_register(ACPI_BITREG_ARB_DISABLE, 1,
- ACPI_MTX_DO_NOT_LOCK);
- }
- } else {
- /* SMP with no shared cache... Invalidate cache */
- ACPI_FLUSH_CPU_CACHE();
+ /* Compute time (ticks) that we were actually asleep */
+ sleep_ticks = ticks_elapsed(t1, t2);
}

- /* Get start time (ticks) */
- t1 = inl(acpi_fadt.xpm_tmr_blk.address);
- /* Invoke C3 */
- inb(cx->address);
- /* Dummy wait op (see above) */
- t2 = inl(acpi_fadt.xpm_tmr_blk.address);
- /* Get end time (ticks) */
- t2 = inl(acpi_fadt.xpm_tmr_blk.address);
- if (pr->flags.bm_check) {
- /* Enable bus master arbitration */
- atomic_dec(&c3_cpu_count);
- acpi_set_register(ACPI_BITREG_ARB_DISABLE, 0,
- ACPI_MTX_DO_NOT_LOCK);
- }
-
-#ifdef CONFIG_GENERIC_TIME
- /* TSC halts in C3, so notify users */
- mark_tsc_unstable();
-#endif
- /* Re-enable interrupts */
local_irq_enable();
current_thread_info()->status |= TS_POLLING;
- /* Compute time (ticks) that we were actually asleep */
- sleep_ticks =
- ticks_elapsed(t1, t2) - cx->latency_ticks - C3_OVERHEAD;
- break;

- default:
- local_irq_enable();
- return;
- }
- cx->usage++;
- if ((cx->type != ACPI_STATE_C1) && (sleep_ticks > 0))
+ cx->usage++;
cx->time += sleep_ticks;
-
- next_state = pr->power.state;
-
-#ifdef CONFIG_HOTPLUG_CPU
- /* Don't do promotion/demotion */
- if ((cx->type == ACPI_STATE_C1) && (num_online_cpus() > 1) &&
- !pr->flags.has_cst && !acpi_fadt.plvl2_up) {
- next_state = cx;
- goto end;
- }
-#endif
-
- /*
- * Promotion?
- * ----------
- * Track the number of longs (time asleep is greater than threshold)
- * and promote when the count threshold is reached. Note that bus
- * mastering activity may prevent promotions.
- * Do not promote above max_cstate.
- */
- if (cx->promotion.state &&
- ((cx->promotion.state - pr->power.states) <= max_cstate)) {
- if (sleep_ticks > cx->promotion.threshold.ticks) {
- cx->promotion.count++;
- cx->demotion.count = 0;
- if (cx->promotion.count >=
- cx->promotion.threshold.count) {
- if (pr->flags.bm_check) {
- if (!
- (pr->power.bm_activity & cx->
- promotion.threshold.bm)) {
- next_state =
- cx->promotion.state;
- goto end;
- }
- } else {
- next_state = cx->promotion.state;
- goto end;
- }
- }
- }
- }
-
- /*
- * Demotion?
- * ---------
- * Track the number of shorts (time asleep is less than time threshold)
- * and demote when the usage threshold is reached.
- */
- if (cx->demotion.state) {
- if (sleep_ticks < cx->demotion.threshold.ticks) {
- cx->demotion.count++;
- cx->promotion.count = 0;
- if (cx->demotion.count >= cx->demotion.threshold.count) {
- next_state = cx->demotion.state;
- goto end;
- }
- }
}

- end:
- /*
- * Demote if current state exceeds max_cstate
- */
- if ((pr->power.state - pr->power.states) > max_cstate) {
- if (cx->demotion.state)
- next_state = cx->demotion.state;
- }
-
- /*
- * New Cx State?
- * -------------
- * If we're going to start using a new Cx state we must clean up
- * from the previous and prepare to use the new.
- */
- if (next_state != pr->power.state)
- acpi_processor_power_activate(pr, next_state);
+ pr->power.last_ticks = sleep_ticks;
}

+/**
+ * acpi_processor_set_power_policy - sets the default idle policy
+ * @pr: the processor
+ *
+ * This function sets the default Cx state policy (OS idle handler).
+ * Note that the Cx state policy is completely customizable and can
+ * be altered dynamically.
+ */
static int acpi_processor_set_power_policy(struct acpi_processor *pr)
{
unsigned int i;
unsigned int state_is_set = 0;
- struct acpi_processor_cx *lower = NULL;
- struct acpi_processor_cx *higher = NULL;
struct acpi_processor_cx *cx;

-
if (!pr)
return -EINVAL;

- /*
- * This function sets the default Cx state policy (OS idle handler).
- * Our scheme is to promote quickly to C2 but more conservatively
- * to C3. We're favoring C2 for its characteristics of low latency
- * (quick response), good power savings, and ability to allow bus
- * mastering activity. Note that the Cx state policy is completely
- * customizable and can be altered dynamically.
- */
-
/* startup state */
for (i = 1; i < ACPI_PROCESSOR_MAX_POWER; i++) {
cx = &pr->power.states[i];
@@ -546,41 +430,31 @@
if (!state_is_set)
return -ENODEV;

- /* demotion */
- for (i = 1; i < ACPI_PROCESSOR_MAX_POWER; i++) {
+ state_is_set = 0;
+
+ /* find deepest bus master compatible state */
+ for (i = (ACPI_PROCESSOR_MAX_POWER - 1); i > 0; i--) {
cx = &pr->power.states[i];
if (!cx->valid)
continue;
+ if (cx->type == ACPI_STATE_C3)
+ continue;

- if (lower) {
- cx->demotion.state = lower;
- cx->demotion.threshold.ticks = cx->latency_ticks;
- cx->demotion.threshold.count = 1;
- if (cx->type == ACPI_STATE_C3)
- cx->demotion.threshold.bm = bm_history;
- }
-
- lower = cx;
+ pr->power.bm_veto_state = i;
+ state_is_set = 1;
+ break;
}

- /* promotion */
- for (i = (ACPI_PROCESSOR_MAX_POWER - 1); i > 0; i--) {
+ if (!state_is_set)
+ return -ENODEV;
+
+ /* determine target sleep ticks */
+ for (i = 1; i < ACPI_PROCESSOR_MAX_POWER; i++) {
cx = &pr->power.states[i];
if (!cx->valid)
continue;

- if (higher) {
- cx->promotion.state = higher;
- cx->promotion.threshold.ticks = cx->latency_ticks;
- if (cx->type >= ACPI_STATE_C2)
- cx->promotion.threshold.count = 4;
- else
- cx->promotion.threshold.count = 10;
- if (higher->type == ACPI_STATE_C3)
- cx->promotion.threshold.bm = bm_history;
- }
-
- higher = cx;
+ cx->target_ticks = cx->latency_ticks * RESIDENCY_TO_LATENCY_RATIO;
}

return 0;
@@ -1009,7 +883,7 @@

seq_printf(seq, "active state: C%zd\n"
"max_cstate: C%d\n"
- "bus master activity: %08x\n",
+ "bus master activity: %d\n",
pr->power.state ? pr->power.state - pr->power.states : 0,
max_cstate, (unsigned)pr->power.bm_activity);

@@ -1040,20 +914,6 @@
break;
}

- if (pr->power.states[i].promotion.state)
- seq_printf(seq, "promotion[C%zd] ",
- (pr->power.states[i].promotion.state -
- pr->power.states));
- else
- seq_puts(seq, "promotion[--] ");
-
- if (pr->power.states[i].demotion.state)
- seq_printf(seq, "demotion[C%zd] ",
- (pr->power.states[i].demotion.state -
- pr->power.states));
- else
- seq_puts(seq, "demotion[--] ");
-
seq_printf(seq, "latency[%03d] usage[%08d] duration[%020llu]\n",
pr->power.states[i].latency,
pr->power.states[i].usage,
--- a/include/acpi/processor.h 2006-08-28 17:14:40.000000000 -0400
+++ b/include/acpi/processor.h 2006-08-28 16:37:35.000000000 -0400
@@ -43,17 +43,6 @@
u64 address;
} __attribute__ ((packed));

-struct acpi_processor_cx_policy {
- u32 count;
- struct acpi_processor_cx *state;
- struct {
- u32 time;
- u32 ticks;
- u32 count;
- u32 bm;
- } threshold;
-};
-
struct acpi_processor_cx {
u8 valid;
u8 type;
@@ -63,15 +52,14 @@
u32 power;
u32 usage;
u64 time;
- struct acpi_processor_cx_policy promotion;
- struct acpi_processor_cx_policy demotion;
+ u32 target_ticks;
};

struct acpi_processor_power {
struct acpi_processor_cx *state;
- unsigned long bm_check_timestamp;
- u32 default_state;
u32 bm_activity;
+ u32 bm_veto_state;
+ u32 last_ticks;
int count;
struct acpi_processor_cx states[ACPI_PROCESSOR_MAX_POWER];
};



2006-08-30 18:48:46

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: RE: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements



>-----Original Message-----
>From: [email protected]
>[mailto:[email protected]] On Behalf Of Adam Belay
>Sent: Tuesday, August 29, 2006 1:51 PM
>To: Brown, Len
>Cc: ACPI ML; Linux Kernel ML; Dominik Brodowski; Arjan van de Ven
>Subject: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements
>
>Hi All,
>
>This patch improves the ACPI c-state selection algorithm. It also
>includes a major cleanup and simplification of the processor idle code.
>
>The new implementation considers the full menu of available c-states.
>Just as the previous implementation, decisions are primarily based on
>the residency time of the last c-state entry. This is generally an
>effective metric because it allows for detection of interrupt activity.
>However, the new algorithm differs in that it does not promote
>or demote
>through the c-states in succession. Rather, it immediately jumps to
>whatever c-state has the best expected power consumption advantage for
>the predicted residency time (i.e. the previously measured residency).
>If the residency time is too short during a deep c-state
>entry, then the
>cost of entering the state outweighs any power consumption advantage.
>Similarly, if a shallow c-state is entered and resident for an
>excessively long duration, then a potential opportunity to save more
>power is missed.
>
>The changes in this patch allow the ACPI idle processor mechanism to
>react more quickly to sudden bursts of activity because it can jump
>directly to whatever c-state is appropriate. However, because of the
>"menu" nature of c-state selection, the code works best when ACPI
>implementations expose all of the c-states supported by hardware.
>
>The bus master activity mechanism has undergone similar improvements.
>During capability detection, the deepest c-state that allows bus master
>activity is determined. BM_STS is then polled each time the ACPI code
>prepares to enter a c-state. If bus master activity is detected, then
>the previously mentioned bus master capable c-state becomes the deepest
>c-state allowed for that quantum. In contrast, the old implementation
>would permit bus master activity to cause a promotion from one C3-type
>state to the next shallower C3-type state, imposing
>unnecessary latency.
>As a further optimization, BM_STS is cleared each time
>acpi_processor_idle() is entered. This prevents any stale bus master
>status from affecting c-state policy, as it may have occurred long ago
>during scheduled work.
>
>Finally, it's worth mentioning that the bulk of c-state policy
>calculations have been moved to take place before c-states are entered.
>This should further reduce exit latency when returning from a c-state.
>
>This algorithm has not yet been carefully benchmarked (e.g. bltk or
>power meters). However, I can say with some confidence that it saves a
>small amount more power during an idle workload and a larger
>amount more
>power during typical user-input oriented workloads such as word
>processing.
>
>I would really appreciate any comments, suggestions, or testing.
>

Nice changes. Will test and let you know how it goes.

While we are at cleaning up the code, I think it will be much better to
move out C-state policy out of this acpi code altogether. We should have

just a generic interface, where any low level driver (acpi) can
register/unregister a idle routine with latency, power and other
characteristics (BM_STS). That way the policy can be generic and
out of ACPI code. We had a patch earlier that does something like this
here:
http://www.mail-archive.com/[email protected]/msg00129.html
http://www.mail-archive.com/[email protected]/msg00130.html
But, that did not go anywhere at that time. Probably we can do some
cleanup like that, along with this patch....

Thanks,
Venki

2006-08-30 19:44:08

by Matthew Garrett

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

On Wed, Aug 30, 2006 at 11:40:16AM -0700, Pallipadi, Venkatesh wrote:

(Added [email protected] to the Cc:)

> While we are at cleaning up the code, I think it will be much better to
> move out C-state policy out of this acpi code altogether. We should have

That would be helpful. For the One Laptop Per Child project (or whatever
it's called today), it would be advantageous to run without acpi. At the
moment that would cost us deeper C states, so an interface to allow a
platform driver to register and provide the same functionality without
code duplication would be helpful.

--
Matthew Garrett | [email protected]

2006-08-31 23:12:50

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

On Wednesday 30 August 2006 13:43, Matthew Garrett wrote:
> That would be helpful. For the One Laptop Per Child project (or whatever
> it's called today), it would be advantageous to run without acpi.

Out of curiosity, what is the motivation for running without acpi?
It costs a lot to diverge from the mainstream in areas like that,
so there must be a big payoff. But maybe if OLPC depends on acpi
being smarter about power or code size or whatever, those improvements
could be made and everybody would benefit.

2006-09-01 00:30:34

by Jim Gettys

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

On Thu, 2006-08-31 at 17:13 -0600, Bjorn Helgaas wrote:
> On Wednesday 30 August 2006 13:43, Matthew Garrett wrote:
> > That would be helpful. For the One Laptop Per Child project (or whatever
> > it's called today), it would be advantageous to run without acpi.
>
> Out of curiosity, what is the motivation for running without acpi?
> It costs a lot to diverge from the mainstream in areas like that,
> so there must be a big payoff. But maybe if OLPC depends on acpi
> being smarter about power or code size or whatever, those improvements
> could be made and everybody would benefit.

Good question; I see Matthew beat me to part of the explanation, but
here is more detail:

Our screen consumes of order 1/10th the power of a conventional flat
panel, and can consume a half watt or so (yes, we now have working
screens; this is not mythological hardware; I got my own personal first
hand look at prototype display running this afternoon :-); I always do
line new toys...).

Even though the base machine may take only a couple watts of power
(Geode GX + the rest of the base logic), 2-3 watts is too much power to
use; a small child can generate only 7-10 watts. So if we want a decent
"learn" to "generate" ratio, we have to do better than the 2-4 to 1
ratio we might get conventionally. In January, we saw this staring us
in the face, and knew we had to do better, or we'd have just told a good
fraction of the kids in the world they can't have the advantages of a
computer. Our goal has always been a 10 to 1 ratio, for at least the
most important use cases (e.g. reading).

OK, what to do? We built a chip that lets us suspend the processor and
keep the screen alive, and chose a wireless chip that will let us keep
the mesh network alive, and we intend to suspend/resume the processor
to/from RAM at the drop of a hat. This gets our idle consumption from
about 2.5-3 watts (with screen and wireless on), to under one watt.
We'd need resume to be as close to imperceptible as possible; touch a
key or the touchpad, the machine resumes so fast as you don't notice.

In short, we have novel hardware: we can have our screen on, and suspend
the processor to RAM, and use a half a watt. We can have our wireless
forwarding packets in our mesh networks, with the processor suspended,
consuming under 400mw (we hope 300mw by the time we ship). Both on, and
we're still under one watt.

For keyboard activity, human perception is in the 100-200 millisecond
range; for some other stuff, it is even less much than that. So that's
the necessity; now the invention.

I've done a straw pole among kernel gurus at OLS and elsewhere on how
fast Linux might be able to resume. I've gotten answers of typically
"one second".

But, on other platforms (see attached), I have data I've measured myself
showing Linux going from resume from RAM to *scheduling user level
processes* 100 times faster than that, on a wimpy 200mhz ARM processor.
Yes, Matilda, Linux can, on non-braindead hardware, resume all the way
to scheduling user processes in 10 milliseconds on a 200mhz processor.

This will, for most use cases (you are reading, or your machine is
sitting there between bursts of activity), likely double / triple /
quadruple our battery life depending on what you are doing. Note that
on a conventional machine, with a conventional display, you'd not see
this large an improvement. Worst case, of course, it will make no
difference at all (e.g. watching a video).

Clearly we can't do any better than what our hardware allows
(stabilization of power supplies, PLL's, etc). I should have data on
that very shortly, now that I can measure it on LinuxBIOS pretty
directly. For those of you building chips and systems: please make the
hardware restart time as fast as possible: it matters. The CPU doesn't
have to go full speed instantly; just get it going at some speed as
quickly as you can.

Conventional PC's with conventional BIOS's using ACPI don't do anything
like as well. So, guess what? We don't plan to use a conventional
commercial BIOS, (we're using LinuxBIOS and Linux as Bootloader) and
will do whatever it takes (including ignoring however much of ACPI turns
out to be necessary) to get our resume down to what we know is possible.
ACPI is mostly an x86 aberration; on most architectures it does not
exist. So it does not require contorting Linux to not use ACPI, to the
extent we find it necessary. Most of *real* power management is done by
Linux, and not by ACPI.

Boy, human powered machines really *do* focus the mind on power
management ;-).
Regards,
- Jim Gettys

--
Jim Gettys
One Laptop Per Child


Attachments:
(No filename) (5.27 kB)
Attached message - Linux resume time on iPAQ (Linux resume can be *really* fast).

2006-09-01 03:52:09

by Brown, Len

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

On Thursday 31 August 2006 20:30, Jim Gettys wrote:
> On Thu, 2006-08-31 at 17:13 -0600, Bjorn Helgaas wrote:
> > On Wednesday 30 August 2006 13:43, Matthew Garrett wrote:
> > > That would be helpful. For the One Laptop Per Child project (or whatever
> > > it's called today), it would be advantageous to run without acpi.
> >
> > Out of curiosity, what is the motivation for running without acpi?
> > It costs a lot to diverge from the mainstream in areas like that,
> > so there must be a big payoff. But maybe if OLPC depends on acpi
> > being smarter about power or code size or whatever, those improvements
> > could be made and everybody would benefit.
>
> Good question; I see Matthew beat me to part of the explanation, but
> here is more detail:

I recommended that the OLPC guys not use ACPI.

I do not think it would benefit their system. Although it is an i386
instruction set, their system is more like an embedded device than
like a traditional laptop.

The Geode doesn't suport any C-states -- so ACPI wouldn't help them there anyway.

As Jim wrote, OLPC plans to suspend-to-ram from idle, and to keep video running,
so ACPI wouldn't help them on that either.

Re: optimizing suspend/resume speed
I expect suspend/resume speed has more to do with devices than with ACPI.
But frankly, with gaping functionality holes in Linux suspend/resume support such as
IDE and SATA, I think that optimizing for suspend/resume speed on a mainstream laptop
is somewhat "forward looking".

-Len

2006-09-01 04:12:36

by Matthew Garrett

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

On Thu, Aug 31, 2006 at 11:53:04PM -0400, Len Brown wrote:

> The Geode doesn't suport any C-states -- so ACPI wouldn't help them there anyway.

Are you sure of that? The docs I have here suggest C1 and C2, but it's
possible that that's just the companion chip and they aren't implemented
in the CPU.

--
Matthew Garrett | [email protected]

2006-09-01 13:15:07

by Carl-Daniel Hailfinger

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

Len Brown wrote:
>
> But frankly, with gaping functionality holes in Linux suspend/resume support such as
> IDE and SATA, I think that optimizing for suspend/resume speed on a mainstream laptop
> is somewhat "forward looking".

OLPC has no IDE/SATA devices, just 512 MB of onboard NAND flash.

Regards,
Carl-Daniel
--
http://www.hailfinger.org/

2006-09-01 15:51:18

by Jordan Crouse

[permalink] [raw]
Subject: Re: ACPI: Idle Processor PM Improvements

On 01/09/06 05:12 +0100, Matthew Garrett wrote:
> On Thu, Aug 31, 2006 at 11:53:04PM -0400, Len Brown wrote:
>
> > The Geode doesn't suport any C-states -- so ACPI wouldn't help them there anyway.
>
> Are you sure of that? The docs I have here suggest C1 and C2, but it's
> possible that that's just the companion chip and they aren't implemented
> in the CPU.

C1 is essentially suspend on hlt. We have something called Automatic Hardware
Clock Gating that kicks in when the blocks go unused, so that saves a bit
more power (especially in the south bridge) then we would with just a simple
hlt. In any event, this already happens without the assistance of ACPI.

The 5536 has support for a C2 state as well, but I don't know if that
has any effect on the GX or not.

Jordan

--
Jordan Crouse
Senior Linux Engineer
Advanced Micro Devices, Inc.
<http://www.amd.com/embeddedprocessors>


2006-09-01 21:52:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements


Len Brown <[email protected]> writes:
>
> Re: optimizing suspend/resume speed
> I expect suspend/resume speed has more to do with devices than with ACPI.
> But frankly, with gaping functionality holes in Linux suspend/resume support such as
> IDE and SATA, I think that optimizing for suspend/resume speed on a mainstream laptop
> is somewhat "forward looking".

What are these gaping holes? SATA seems to work at least on many
drivers with an out of tree patch (that will hopefully be merged soon)
And IDE mostly works too except for HPA on thinkpads (which can be
disabled in the BIOS). While certainly not perfect it doesn't seem
that bad to me.

-Andi


--
VGER BF report: H 0

2006-09-01 22:36:42

by Alan

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

Ar Gwe, 2006-09-01 am 23:52 +0200, ysgrifennodd Andi Kleen:
> What are these gaping holes? SATA seems to work at least on many
> drivers with an out of tree patch (that will hopefully be merged soon)

SATA ought to be pretty good now.

> And IDE mostly works too except for HPA on thinkpads (which can be
> disabled in the BIOS). While certainly not perfect it doesn't seem
> that bad to me.

IDE also fails for various chipsets where PLLs need a recalibration or
setup needs redoing, and some users report things like floating IRQ 14
hangs on suspend or resume.

HPA now has a -mm proposed patch.

Alan


--
VGER BF report: H 0

2006-09-04 13:00:38

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

Hi!

> This patch improves the ACPI c-state selection algorithm. It also
> includes a major cleanup and simplification of the processor idle code.

Nice!

> @@ -1009,7 +883,7 @@
>
> seq_printf(seq, "active state: C%zd\n"
> "max_cstate: C%d\n"
> - "bus master activity: %08x\n",
> + "bus master activity: %d\n",
> pr->power.state ? pr->power.state - pr->power.states : 0,
> max_cstate, (unsigned)pr->power.bm_activity);
>

This changes kernel - user interface. You should change the field
description, or keep it in hex...

BTW will you be on september's labs conference?

Pavel
--
Thanks for all the (sleeping) penguins.

2006-09-04 13:09:53

by Pavel Machek

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

Hi!

> In short, we have novel hardware: we can have our screen on, and suspend
> the processor to RAM, and use a half a watt. We can have our wireless
> forwarding packets in our mesh networks, with the processor suspended,
> consuming under 400mw (we hope 300mw by the time we ship). Both on, and
> we're still under one watt.
>
> For keyboard activity, human perception is in the 100-200 millisecond
> range; for some other stuff, it is even less much than that. So that's
> the necessity; now the invention.
>
> I've done a straw pole among kernel gurus at OLS and elsewhere on how
> fast Linux might be able to resume. I've gotten answers of typically
> "one second".
>
> But, on other platforms (see attached), I have data I've measured myself
> showing Linux going from resume from RAM to *scheduling user level
> processes* 100 times faster than that, on a wimpy 200mhz ARM processor.
> Yes, Matilda, Linux can, on non-braindead hardware, resume all the way
> to scheduling user processes in 10 milliseconds on a 200mhz processor.

2.4 and 2.6 are *very* different here. You'll probably need to optimize freezer
in 2.6 a bit...
Pavel
--
Thanks for all the (sleeping) penguins.

2006-09-04 13:13:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

On Thu 31-08-06 23:53:04, Len Brown wrote:
> On Thursday 31 August 2006 20:30, Jim Gettys wrote:
> > On Thu, 2006-08-31 at 17:13 -0600, Bjorn Helgaas wrote:
> > > On Wednesday 30 August 2006 13:43, Matthew Garrett wrote:
> > > > That would be helpful. For the One Laptop Per Child project (or whatever
> > > > it's called today), it would be advantageous to run without acpi.
> > >
> > > Out of curiosity, what is the motivation for running without acpi?
> > > It costs a lot to diverge from the mainstream in areas like that,
> > > so there must be a big payoff. But maybe if OLPC depends on acpi
> > > being smarter about power or code size or whatever, those improvements
> > > could be made and everybody would benefit.
> >
> > Good question; I see Matthew beat me to part of the explanation, but
> > here is more detail:
>
> I recommended that the OLPC guys not use ACPI.
>
> I do not think it would benefit their system. Although it is an i386
> instruction set, their system is more like an embedded device than
> like a traditional laptop.
>
> The Geode doesn't suport any C-states -- so ACPI wouldn't help them there anyway.
>
> As Jim wrote, OLPC plans to suspend-to-ram from idle, and to keep video running,
> so ACPI wouldn't help them on that either.
>
> Re: optimizing suspend/resume speed
> I expect suspend/resume speed has more to do with devices than with ACPI.
> But frankly, with gaping functionality holes in Linux suspend/resume support such as
> IDE and SATA, I think that optimizing for suspend/resume speed on a mainstream laptop
> is somewhat "forward looking".

Well, list of hardware where s2ram works okay is long and growing...
of course, help is always wanted. And yes, it would be nice if someone
optimized suspend/resume speed. There are somelow-hanging fruits
there.

--
Thanks for all the (sleeping) penguins.

2006-09-05 02:17:25

by Adam Belay

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

Hi Pavel,

On Mon, 2006-09-04 at 12:59 +0000, Pavel Machek wrote:
> Hi!
>
> > This patch improves the ACPI c-state selection algorithm. It also
> > includes a major cleanup and simplification of the processor idle code.
>
> Nice!
>
> > @@ -1009,7 +883,7 @@
> >
> > seq_printf(seq, "active state: C%zd\n"
> > "max_cstate: C%d\n"
> > - "bus master activity: %08x\n",
> > + "bus master activity: %d\n",
> > pr->power.state ? pr->power.state - pr->power.states : 0,
> > max_cstate, (unsigned)pr->power.bm_activity);
> >
>
> This changes kernel - user interface. You should change the field
> description, or keep it in hex...

Good catch! Essentially the field now counts the number of times bus
master activity was detected, rather than bitshifting. I'll change its
name in the next iteration.

>
> BTW will you be on september's labs conference?

It's not currently in my plans, but I'd love to attend one at some
point.

>
> Pavel

Thanks for the comments.

Regards,
Adam


2006-09-05 14:32:15

by Jim Gettys

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

On Mon, 2006-09-04 at 13:09 +0000, Pavel Machek wrote:

>
> 2.4 and 2.6 are *very* different here. You'll probably need to optimize freezer
> in 2.6 a bit...
>

Among other problems: e.g. 2.4 did not automatically do a VT switch; 2.6
does; we'll have to have a way to signal "we're a sane display driver;
don't switch away from me on suspend".
- Jim

--
Jim Gettys
One Laptop Per Child


2006-09-06 10:37:23

by Pavel Machek

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

Hi!

> > 2.4 and 2.6 are *very* different here. You'll probably need to optimize freezer
> > in 2.6 a bit...
> >
>
> Among other problems: e.g. 2.4 did not automatically do a VT switch; 2.6
> does; we'll have to have a way to signal "we're a sane display driver;
> don't switch away from me on suspend".

Not like that, please.

You are using X running over framebuffer, right? So that kernel is
controlling the graphics hardware. In such case it is safe to avoid VT
switch.
Pavel
--
Thanks, Sharp!

2006-09-06 14:57:27

by Jordan Crouse

[permalink] [raw]
Subject: Re: ACPI: Idle Processor PM Improvements

On 06/09/06 12:37 +0200, Pavel Machek wrote:
> Hi!
>
> > > 2.4 and 2.6 are *very* different here. You'll probably need to optimize freezer
> > > in 2.6 a bit...
> > >
> >
> > Among other problems: e.g. 2.4 did not automatically do a VT switch; 2.6
> > does; we'll have to have a way to signal "we're a sane display driver;
> > don't switch away from me on suspend".
>
> Not like that, please.
>
> You are using X running over framebuffer, right? So that kernel is
> controlling the graphics hardware. In such case it is safe to avoid VT
> switch.

Actually not - the Geode GX has full 2D hardware acceleration with a complete
X driver to match. No Xfbdev here.

Jordan

Pavel
--
Jordan Crouse
Senior Linux Engineer
Advanced Micro Devices, Inc.
<http://www.amd.com/embeddedprocessors>


2006-09-06 15:19:35

by Jim Gettys

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

On Wed, 2006-09-06 at 12:37 +0200, Pavel Machek wrote:
> Hi!
>
> > > 2.4 and 2.6 are *very* different here. You'll probably need to optimize freezer
> > > in 2.6 a bit...
> > >
> >
> > Among other problems: e.g. 2.4 did not automatically do a VT switch; 2.6
> > does; we'll have to have a way to signal "we're a sane display driver;
> > don't switch away from me on suspend".
>
> Not like that, please.
>
> You are using X running over framebuffer, right? So that kernel is
> controlling the graphics hardware. In such case it is safe to avoid VT
> switch.

It should be perfectly safe.

The Geode has significantly more than dumb frame buffer support, even
though it can't support 3D in hardware (we do get blit and alpha
blending, and YUV->RGB support in hardware).

We have an fbdev driver for the hardware (in fact, have to finally have
a decent driver in general, as the transfer to and from DCON controlled
display has to happen at interrupt time). We won't be doing thing evil
in X behind the operating system's back the way most XF86 drivers do,
but very much the way display drivers supported X before the strange
notion of completely OS independent drivers without any kernel support
twisted the way XF86 drivers usually work. Ah, back to the future
(past)....
- Jim


--
Jim Gettys
One Laptop Per Child


2006-09-12 09:20:58

by Pavel Machek

[permalink] [raw]
Subject: Re: ACPI: Idle Processor PM Improvements

On Wed 2006-09-06 08:58:49, Jordan Crouse wrote:
> On 06/09/06 12:37 +0200, Pavel Machek wrote:
> > Hi!
> >
> > > > 2.4 and 2.6 are *very* different here. You'll probably need to optimize freezer
> > > > in 2.6 a bit...
> > > >
> > >
> > > Among other problems: e.g. 2.4 did not automatically do a VT switch; 2.6
> > > does; we'll have to have a way to signal "we're a sane display driver;
> > > don't switch away from me on suspend".
> >
> > Not like that, please.
> >
> > You are using X running over framebuffer, right? So that kernel is
> > controlling the graphics hardware. In such case it is safe to avoid VT
> > switch.
>
> Actually not - the Geode GX has full 2D hardware acceleration with a complete
> X driver to match. No Xfbdev here.

Ok, so what is needed is message to X "we are suspending", and X needs
to respond "okay, I'm ready, no need for console switch".

Alternatively, hack kernel to take control from X without actually
switching consoles. That should be possible even with current
interface.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-09-12 09:21:46

by Pavel Machek

[permalink] [raw]
Subject: Re: [OLPC-devel] Re: [RFC][PATCH 1/2] ACPI: Idle Processor PM Improvements

On Wed 2006-09-06 11:19:09, Jim Gettys wrote:
> On Wed, 2006-09-06 at 12:37 +0200, Pavel Machek wrote:
> > Hi!
> >
> > > > 2.4 and 2.6 are *very* different here. You'll probably need to optimize freezer
> > > > in 2.6 a bit...
> > > >
> > >
> > > Among other problems: e.g. 2.4 did not automatically do a VT switch; 2.6
> > > does; we'll have to have a way to signal "we're a sane display driver;
> > > don't switch away from me on suspend".
> >
> > Not like that, please.
> >
> > You are using X running over framebuffer, right? So that kernel is
> > controlling the graphics hardware. In such case it is safe to avoid VT
> > switch.
>
> It should be perfectly safe.

Okay, but per-driver flag is wrong way to go (see the other mail).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-09-12 18:17:15

by Jim Gettys

[permalink] [raw]
Subject: Re: ACPI: Idle Processor PM Improvements

On Tue, 2006-09-12 at 11:21 +0200, Pavel Machek wrote:

> Ok, so what is needed is message to X "we are suspending", and X needs
> to respond "okay, I'm ready, no need for console switch".

This presumes an external agent to X controlling the fast
suspend/resume, with messages having to flow to and from X, and to and
from the kernel, with the kernel in the middle.

Another simpler option is X itself just telling the kernel to suspend
without console switch, as the handoff of the display to the DCON chip
has to be done with X and with an interrupt signaling completion of the
handoff. This would be triggered by an inactivity timeout in the X
server.

I'm not sure which is best right now: generality vs. simplicity. We
just got samples of hardware to do some prototyping on in the last two
weeks. (see wiki.laptop.org for photographs of our screen and the DCON
in action).

>
> Alternatively, hack kernel to take control from X without actually
> switching consoles. That should be possible even with current
> interface.

This would require saving/restoring all graphics state in the kernel
(and X already has that state internally). Feasible, but seems like
duplication of effort. I haven't checked if there are any write-only
registers in the Geode (though, thankfully, this kind of brain damage is
rarer than it once was). This then begs interesting kernel/X
synchronization issues, of course.
- Jim


> Pavel
--
Jim Gettys
One Laptop Per Child


2006-09-12 18:27:32

by Mitch Bradley

[permalink] [raw]
Subject: Re: ACPI: Idle Processor PM Improvements

Jim Gettys wrote:
>
> I haven't checked if there are any write-only
> registers in the Geode (though, thankfully, this kind of brain damage is
> rarer than it once was).
I've been going through the Geode and 5536 specs with a fine-toothed
comb, and so far haven't seen any write-only registers apart from the
ones in the ISA legacy devices.

2006-09-12 20:14:54

by Jordan Crouse

[permalink] [raw]
Subject: Re: ACPI: Idle Processor PM Improvements

On 12/09/06 14:14 -0400, Jim Gettys wrote:
> > Alternatively, hack kernel to take control from X without actually
> > switching consoles. That should be possible even with current
> > interface.
>
> This would require saving/restoring all graphics state in the kernel
> (and X already has that state internally). Feasible, but seems like
> duplication of effort. I haven't checked if there are any write-only
> registers in the Geode (though, thankfully, this kind of brain damage is
> rarer than it once was). This then begs interesting kernel/X
> synchronization issues, of course.

We don't need any kernel output during suspend or resume. Thus, if the VT
doesn't change, then the kernel doesn't need worry about saving or restoring
the graphics state, and thats the way it should be, IMHO.
Whoever owns the current VT should be in charge of saving and restoring
the registers.

So, we would need some way of indicating the "ownership" of the VT. And
in reality, we really only to know if the framebuffer console owns it or
not, so a boolean would suffice. In the past, I've used KD_TEXT and
KD_GRAPHICS for this purpose. As an example, on the Geode LX, I assume
that if the vc_mode is KD_GRAPHICS, then we don't own it, and we don't
do 2D accelerations. If the mode is KD_TEXT then we are free to use the
2D engine. All I needed to add ws a notifier chain to let the framebuffer
know when the mode switched, and I was happy. I'm not sure if thats the
smartest way to handle it permanently, but it works in a pinch.

Jordan

--
Jordan Crouse
Senior Linux Engineer
Advanced Micro Devices, Inc.
<http://www.amd.com/embeddedprocessors>


2006-09-14 09:18:40

by Pavel Machek

[permalink] [raw]
Subject: Re: ACPI: Idle Processor PM Improvements

On Tue 2006-09-12 14:14:30, Jim Gettys wrote:
> On Tue, 2006-09-12 at 11:21 +0200, Pavel Machek wrote:
>
> > Ok, so what is needed is message to X "we are suspending", and X needs
> > to respond "okay, I'm ready, no need for console switch".
>
> This presumes an external agent to X controlling the fast
> suspend/resume, with messages having to flow to and from X, and to and
> from the kernel, with the kernel in the middle.
>
> Another simpler option is X itself just telling the kernel to suspend
> without console switch, as the handoff of the display to the DCON chip
> has to be done with X and with an interrupt signaling completion of the
> handoff. This would be triggered by an inactivity timeout in the X
> server.

Whoa... that's a hack.. but yes, you can probably do that, and I think
kernel even has neccessary interfaces already. (They were needed for
uswsusp).

> > Alternatively, hack kernel to take control from X without actually
> > switching consoles. That should be possible even with current
> > interface.
>
> This would require saving/restoring all graphics state in the kernel
> (and X already has that state internally). Feasible, but seems like

Hmm, save/restore graphics state from the kernel would of course be
clean solution, but you should have that anyway... what if someone
suspends without X running?

And of course you can just cheat, and not do kernel save-state on your
system.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-09-14 09:20:43

by Pavel Machek

[permalink] [raw]
Subject: Re: ACPI: Idle Processor PM Improvements

On Tue 2006-09-12 14:18:05, Jordan Crouse wrote:
> On 12/09/06 14:14 -0400, Jim Gettys wrote:
> > > Alternatively, hack kernel to take control from X without actually
> > > switching consoles. That should be possible even with current
> > > interface.
> >
> > This would require saving/restoring all graphics state in the kernel
> > (and X already has that state internally). Feasible, but seems like
> > duplication of effort. I haven't checked if there are any write-only
> > registers in the Geode (though, thankfully, this kind of brain damage is
> > rarer than it once was). This then begs interesting kernel/X
> > synchronization issues, of course.
>
> We don't need any kernel output during suspend or resume. Thus, if the VT
> doesn't change, then the kernel doesn't need worry about saving or restoring
> the graphics state, and thats the way it should be, IMHO.
> Whoever owns the current VT should be in charge of saving and restoring
> the registers.
>
> So, we would need some way of indicating the "ownership" of the VT. And
> in reality, we really only to know if the framebuffer console owns it or
> not, so a boolean would suffice. In the past, I've used KD_TEXT and
> KD_GRAPHICS for this purpose. As an example, on the Geode LX, I assume
> that if the vc_mode is KD_GRAPHICS, then we don't own it, and we don't
> do 2D accelerations. If the mode is KD_TEXT then we are free to use the
> 2D engine. All I needed to add ws a notifier chain to let the framebuffer
> know when the mode switched, and I was happy. I'm not sure if thats the
> smartest way to handle it permanently, but it works in a pinch.

KD_TEXT vs. KD_GRAPHICS looks like the way to go. Just tell X you want
console back, but then don't actually redraw/switch consoles. We
probably want that on normal PCs, too... console switch for
suspend-to-RAM looks ugly.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-09-14 11:30:12

by Jim Gettys

[permalink] [raw]
Subject: Re: ACPI: Idle Processor PM Improvements

On Thu, 2006-09-14 at 11:18 +0200, Pavel Machek wrote:
> On Tue, 2006-09-12 at 11:21 +0200, Pavel Machek wrote:
> >
> > > Ok, so what is needed is message to X "we are suspending", and X
> needs
> > > to respond "okay, I'm ready, no need for console switch".
> >
> > This presumes an external agent to X controlling the fast
> > suspend/resume, with messages having to flow to and from X, and to
> and
> > from the kernel, with the kernel in the middle.
> >
> > Another simpler option is X itself just telling the kernel to suspend
> > without console switch, as the handoff of the display to the DCON chip
> > has to be done with X and with an interrupt signaling completion of the
> > handoff. This would be triggered by an inactivity timeout in the X
> > server.
>
> Whoa... that's a hack.. but yes, you can probably do that, and I think
> kernel even has neccessary interfaces already. (They were needed for
> uswsusp).

Glad you like it ;-). Dunno which way we'll go yet, though it will get
to the top of the pile to implement this fall. I suspect we may go this
route to get going, but explore the more general solution as we get more
sophisticated power management policies and standards in place.

>
> > > Alternatively, hack kernel to take control from X without actually
> > > switching consoles. That should be possible even with current
> > > interface.
> >
> > This would require saving/restoring all graphics state in the kernel
> > (and X already has that state internally). Feasible, but seems like
>
> Hmm, save/restore graphics state from the kernel would of course be
> clean solution, but you should have that anyway... what if someone
> suspends without X running?

X knows its graphics state; it has to remember it all to know when it
has to be changed; on resume, resume can reinit the graphics state to
what the console wants/needs.

If you VT switch back to X, X can restore the graphics state to what it
remembers.

>
> And of course you can just cheat, and not do kernel save-state on your
> system.

Yup, though it isn't clear to me I'd call it cheating. In some ways,
what I just described to handle suspends when X is not running is really
robust and simple. And you don't have divided responsibility for
remembering the state. Simple == good in my book.

- Jim

--
Jim Gettys
One Laptop Per Child