On some SoC chips, HW resources may be in use during any particular idle
period. As a consequence, the cpuidle states that the SoC is safe to
enter can change from idle period to idle period. In addition, the
latency and threshold of each cpuidle state can vary, depending on the
operating condition when the CPU becomes idle, e.g. the current cpu
frequency, the current state of the HW blocks, etc.
cpuidle core and the menu governor, in the current form, are geared
towards cpuidle states that are static, i.e. the availabiltiy of the
states, their latencies, their thresholds are non-changing during run
time. cpuidle does not provide any hook that cpuidle drivers can use
to adjust those values on the fly for the current idle period before the
menu governor selects the target cpuidle state.
This patch extends cpuidle core and the menu governor to handle states
that are dynamic. There are three additions in the patch and the patch
maintains backwards-compatibility with existing cpuidle drivers.
1) add prepare() to struct cpuidle_device. A cpuidle driver can hook
into the callback and the menu governor will call prepare() in
menu_select(). The callback gives the cpuidle driver a chance to update
the dynamic information of the cpuidle states for the current idle
period, e.g. state availability, latencies, thresholds, power values,
etc.
2) add CPUIDLE_FLAG_IGNORE as one of the state flags. In the prepare()
function, a cpuidle driver can set/clear the flag to indicate to the
menu governor whether a cpuidle state should be ignored, i.e. not
available, during the current idle period.
3) add compare_power bit to struct cpuidle_device. The menu governor
currently assumes that the cpuidle states are arranged in the order of
increasing latency, threshold, and power savings. This is true or can
be made true for static states. Once the state parameters are dynamic,
the latencies, thresholds, and power savings for the cpuidle states can
increase or decrease by different amounts from idle period to idle
period. So the assumption of increasing latency, threshold, and power
savings from Cn to C(n+1) can no longer be guaranteed.
It can be straight forward to calculate the power consumption of each
available state for the predicted idle period. The menu governor then
selects the state that has the lowest power consumption and that still
satisfies all other critieria. When the compare_power bit is true, the
menu governor uses the power_usage fields to find the lowest power
state instead of relying on the above assumption.
Signed-off-by: Ai Li <[email protected]>
---
drivers/cpuidle/governors/menu.c | 59 +++++++++++++++++++++++++++++--------
include/linux/cpuidle.h | 4 ++
2 files changed, 50 insertions(+), 13 deletions(-)
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 1b12870..b3854cc 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -271,6 +271,9 @@ static int menu_select(struct cpuidle_device *dev)
detect_repeating_patterns(data);
+ if (dev->prepare)
+ dev->prepare(dev, data->predicted_us);
+
/*
* We want to default to C1 (hlt), not to busy polling
* unless the timer is happening really really soon.
@@ -278,19 +281,49 @@ static int menu_select(struct cpuidle_device *dev)
if (data->expected_us > 5)
data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
-
- /* find the deepest idle state that satisfies our constraints */
- for (i = CPUIDLE_DRIVER_STATE_START; i < dev->state_count; i++) {
- struct cpuidle_state *s = &dev->states[i];
-
- if (s->target_residency > data->predicted_us)
- break;
- if (s->exit_latency > latency_req)
- break;
- if (s->exit_latency * multiplier > data->predicted_us)
- break;
- data->exit_us = s->exit_latency;
- data->last_state_idx = i;
+ if (dev->compare_power) {
+ /* find the idle state with the lowest power while satisfying
+ * our constraints
+ */
+ unsigned int power_usage = (unsigned int) ~0UL;
+
+ for (i = CPUIDLE_DRIVER_STATE_START; i < dev->state_count; i++) {
+ struct cpuidle_state *s = &dev->states[i];
+
+ if (s->flags & CPUIDLE_FLAG_IGNORE)
+ continue;
+ if (s->target_residency > data->predicted_us)
+ continue;
+ if (s->exit_latency > latency_req)
+ continue;
+ if (s->exit_latency * multiplier > data->predicted_us)
+ continue;
+
+ if (s->power_usage < power_usage) {
+ power_usage = s->power_usage;
+ data->exit_us = s->exit_latency;
+ data->last_state_idx = i;
+ }
+ }
+ } else {
+ /* find the deepest idle state that satisfies our
+ * constraints
+ */
+ for (i = CPUIDLE_DRIVER_STATE_START; i < dev->state_count; i++) {
+ struct cpuidle_state *s = &dev->states[i];
+
+ if (s->flags & CPUIDLE_FLAG_IGNORE)
+ continue;
+
+ if (s->target_residency > data->predicted_us)
+ break;
+ if (s->exit_latency > latency_req)
+ break;
+ if (s->exit_latency * multiplier > data->predicted_us)
+ break;
+ data->exit_us = s->exit_latency;
+ data->last_state_idx = i;
+ }
}
return data->last_state_idx;
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 55215cc..4406670 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -52,6 +52,7 @@ struct cpuidle_state {
#define CPUIDLE_FLAG_SHALLOW (0x20) /* low latency, minimal savings */
#define CPUIDLE_FLAG_BALANCED (0x40) /* medium latency, moderate savings */
#define CPUIDLE_FLAG_DEEP (0x80) /* high latency, large savings */
+#define CPUIDLE_FLAG_IGNORE (0x100) /* ignore during this idle period */
#define CPUIDLE_DRIVER_FLAGS_MASK (0xFFFF0000)
@@ -84,6 +85,7 @@ struct cpuidle_state_kobj {
struct cpuidle_device {
unsigned int registered:1;
unsigned int enabled:1;
+ unsigned int compare_power:1;
unsigned int cpu;
int last_residency;
@@ -97,6 +99,8 @@ struct cpuidle_device {
struct completion kobj_unregister;
void *governor_data;
struct cpuidle_state *safe_state;
+
+ int (*prepare) (struct cpuidle_device *dev, int idle_us);
};
DECLARE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
--
1.5.6.3
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum
On 7/15/2010 1:30 PM, Ai Li wrote:
I'm ok with the general idea, but have a few comments about the
implementation
> Signed-off-by: Ai Li<[email protected]>
> ---
> drivers/cpuidle/governors/menu.c | 59 +++++++++++++++++++++++++++++--------
> include/linux/cpuidle.h | 4 ++
> 2 files changed, 50 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 1b12870..b3854cc 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -271,6 +271,9 @@ static int menu_select(struct cpuidle_device *dev)
>
> detect_repeating_patterns(data);
>
> + if (dev->prepare)
> + dev->prepare(dev, data->predicted_us);
>
I don't like the idea of passing predicted_us here.
the states and their updates should be independent of how long we think
we'll be idle;
it's up to the menu governor to then pick a good one, not for the
platform to muck with things
based on this.
Also I would like the cpuidle code, not the governor, to call this
prepare function.
The need to call ->prepare is governor independent....
+ if (dev->compare_power) {
I'm not a big fan of this as a flag; either we always do this, which I
can understand, or we sort things, which is also fine with me.
Doing this condition like this.... not a fan.
> > + if (dev->prepare)
> > + dev->prepare(dev, data->predicted_us);
> >
>
> I don't like the idea of passing predicted_us here.
> the states and their updates should be independent of how long we
> think we'll be idle;
The power_usage value, total or average, would depend on how long the
predicted idle period is. On our SoCs, a cpuidle state has three
stages: entry stage, low power stage, and exit stage. Entry and exit
stages consume more power than the low power stage but have fixed
durations, irrespective how long the idle period is. As the
predicted idle period changes, the entry and exit duration stay the
same but the low power duration changes, resulting in different total
or average power for the idle period.
> Also I would like the cpuidle code, not the governor, to call this
> prepare function.
> The need to call ->prepare is governor independent....
I agree that it would be cleaner to call ->prepare from the cpuidle
code. But if we need values calculated in the governor's select
function, I'm not sure what is the best way to do that from cpuidle
code.
> + if (dev->compare_power) {
>
> I'm not a big fan of this as a flag; either we always do this,
> which I
> can understand, or we sort things, which is also fine with me.
> Doing this condition like this.... not a fan.
One of the concerns I have is backwards compatibility. As far as I
know, none of the current cpuidle drivers use the power_usage field.
If we always do compare_power, those drivers would break until
someone with technical device knowledge update the drivers to specify
power... I could derive fake power_usage numbers by default, using
the cstate index position. That seems kind of hacky but it would
remove the need for the compare_power flag and retain the current
behavior when cpuidle drivers do not provide their own power numbers.
~Ai
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum
On 7/16/2010 10:25 AM, Ai Li wrote:
>>> + if (dev->prepare)
>>> + dev->prepare(dev, data->predicted_us);
>>>
>>>
>> I don't like the idea of passing predicted_us here.
>> the states and their updates should be independent of how long we
>> think we'll be idle;
>>
> The power_usage value, total or average, would depend on how long the
> predicted idle period is. On our SoCs, a cpuidle state has three
> stages: entry stage, low power stage, and exit stage. Entry and exit
> stages consume more power than the low power stage but have fixed
> durations, irrespective how long the idle period is. As the
> predicted idle period changes, the entry and exit duration stay the
> same but the low power duration changes, resulting in different total
> or average power for the idle period.
>
the power value in the structure should represent ONLY the power level
during the low power stage.
And this should be independent of total duration.
all other power is taken into account in terms of break even point/etc...
> One of the concerns I have is backwards compatibility. As far as I
> know, none of the current cpuidle drivers use the power_usage field.
> If we always do compare_power, those drivers would break until
> someone with technical device knowledge update the drivers to specify
> power... I could derive fake power_usage numbers by default, using
> the cstate index position. That seems kind of hacky but it would
> remove the need for the compare_power flag and retain the current
> behavior when cpuidle drivers do not provide their own power numbers.
>
I'm fine with this approach actually; if someone does not fill it in, we
fake data that makes it
valid... better than getting complex code.
> the power value in the structure should represent ONLY the power
> level during the low power stage.
> And this should be independent of total duration.
> all other power is taken into account in terms of break even
> point/etc...
With static cstates, determining the break even point is
straitforward, compare the power numbers of state Cn and Cn-1, since
the states are ordered in increasing order of latency and power.
With dynamic cstates, Cn-1 may not be a valid state to compare any
more, for example, because Cn-1's latency may have become too high.
It seems the driver would need to know which cstate the govenor would
compare Cn to, and that would break the design philosophy of driver +
govenor. The break even point does not seem to have a transistive
property, where the govenor can calculat Cn vs Cn-2 from some
arithmatic combination of Cn vs Cn-1 and Cn-1 vs Cn-2 values. On the
other hand, if the power_usage field also includes the entry and exit
stages, then the driver does not need to know whether it should
calculate break even point for Cn vs Cn-1, or Cn vs Cn-2, etc.
> > One of the concerns I have is backwards compatibility. As far as
> I
> > know, none of the current cpuidle drivers use the power_usage
> field.
> > If we always do compare_power, those drivers would break until
> > someone with technical device knowledge update the drivers to
> specify
> > power... I could derive fake power_usage numbers by default,
> using
> > the cstate index position. That seems kind of hacky but it would
> > remove the need for the compare_power flag and retain the current
> > behavior when cpuidle drivers do not provide their own power
> numbers.
> >
>
> I'm fine with this approach actually; if someone does not fill it
> in, we fake data that makes it
> valid... better than getting complex code.
OK.
~Ai
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum
On 7/16/2010 12:19 PM, Ai Li wrote:
>> the power value in the structure should represent ONLY the power
>> level during the low power stage.
>> And this should be independent of total duration.
>> all other power is taken into account in terms of break even
>> point/etc...
>>
> With static cstates, determining the break even point is
> straitforward, compare the power numbers of state Cn and Cn-1, since
> the states are ordered in increasing order of latency and power.
> With dynamic cstates, Cn-1 may not be a valid state to compare any
> more, for example, because Cn-1's latency may have become too high.
> It seems the driver would need to know which cstate the govenor would
> compare Cn to, and that would break the design philosophy of driver +
> govenor. The break even point does not seem to have a transistive
> property, where the govenor can calculat Cn vs Cn-2 from some
> arithmatic combination of Cn vs Cn-1 and Cn-1 vs Cn-2 values. On the
> other hand, if the power_usage field also includes the entry and exit
> stages, then the driver does not need to know whether it should
> calculate break even point for Cn vs Cn-1, or Cn vs Cn-2, etc.
>
that's nice in theory.
in practice though, this is all noise compared to some of the accuracy
in the predictions.
break even generally is done against C1 only (since C1 is assumed to
always be there)....
yes it'd be nice to also have it against Cx in a matrix form, but that
is a level of complexity that
hasn't been worth it.
Note that the prediction is.... a prediction. I can show you data on how
well it does (now that it's
much better in 2.6.35-rc), but it's still "50% of the time we're within
a factor of two of actual".
not "we're 90% of the time within 10%".
> that's nice in theory.
> in practice though, this is all noise compared to some of the
> accuracy in the predictions.
>
> break even generally is done against C1 only (since C1 is assumed
> to always be there)....
> yes it'd be nice to also have it against Cx in a matrix form, but
> that is a level of complexity that
> hasn't been worth it.
>
> Note that the prediction is.... a prediction. I can show you data
> on how well it does (now that it's
> much better in 2.6.35-rc), but it's still "50% of the time we're
> within a factor of two of actual".
> not "we're 90% of the time within 10%".
OK. I guess we can always add something like predicted_us later,
especially when the predication is more accurate. For this patch, I
will change to
int (*prepare) (struct cpuidle_device *dev), and call it in cpuidle
instead of the govenor.
~Ai
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum
> -----Original Message-----
> From: Arjan van de Ven [mailto:[email protected]]
> Sent: Friday, July 16, 2010 1:33 PM
> To: Ai Li
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; linux-arm-
> [email protected]; [email protected]
> Subject: Re: [PATCH] cpuidle: extend cpuidle and menu governor to
> handle dynamic states
>
> On 7/16/2010 12:19 PM, Ai Li wrote:
> >> the power value in the structure should represent ONLY the power
> >> level during the low power stage.
> >> And this should be independent of total duration.
> >> all other power is taken into account in terms of break even
> >> point/etc...
> >>
> > With static cstates, determining the break even point is
> > straitforward, compare the power numbers of state Cn and Cn-1,
> since
> > the states are ordered in increasing order of latency and power.
> > With dynamic cstates, Cn-1 may not be a valid state to compare
> any
> > more, for example, because Cn-1's latency may have become too
> high.
> > It seems the driver would need to know which cstate the govenor
> would
> > compare Cn to, and that would break the design philosophy of
> driver +
> > govenor. The break even point does not seem to have a
> transistive
> > property, where the govenor can calculat Cn vs Cn-2 from some
> > arithmatic combination of Cn vs Cn-1 and Cn-1 vs Cn-2 values. On
> the
> > other hand, if the power_usage field also includes the entry and
> exit
> > stages, then the driver does not need to know whether it should
> > calculate break even point for Cn vs Cn-1, or Cn vs Cn-2, etc.
> >
>
> that's nice in theory.
> in practice though, this is all noise compared to some of the
> accuracy
> in the predictions.
>
> break even generally is done against C1 only (since C1 is assumed
> to
> always be there)....
> yes it'd be nice to also have it against Cx in a matrix form, but
> that
> is a level of complexity that
> hasn't been worth it.
>
> Note that the prediction is.... a prediction. I can show you data
> on how
> well it does (now that it's
> much better in 2.6.35-rc), but it's still "50% of the time we're
> within
> a factor of two of actual".
> not "we're 90% of the time within 10%".