Currently, the cpuidle governors (menu/ladder) determine what idle state a
idling CPU should enter into based on heuristics that depend on the idle
history on that CPU. Given that no predictive heuristic is perfect, there
are cases where the governor predicts a shallow idle state, hoping that
the CPU will be busy soon. However, if no new workload is scheduled on
that CPU in the near future, the CPU will end up in the shallow state.
Motivation
----------
In case of POWER, this is problematic, when the predicted state in the
aforementioned scenario is a lite stop state, as such lite states will
inhibit SMT folding, thereby depriving the other threads in the core from
using the core resources.
To address this, such lite states need to be autopromoted. The cpuidle-core
can queue timer to correspond with the residency value of the next
available state. Thus leading to auto-promotion to a deeper idle state as
soon as possible.
Experiment
----------
I performed experiments for three scenarios to collect some data.
case 1 :
Without this patch and without tick retained, i.e. in a upstream kernel,
It would spend more than even a second to get out of stop0_lite.
case 2 : With tick retained(as suggested) -
Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected
it to take 8 sched tick to get out of stop0_lite. Experimentally,
observation was
=========================================================
sample min max 99percentile
20 4ms 12ms 4ms
=========================================================
*ms = milliseconds
It would take atleast one sched tick to get out of stop0_lite.
case 2 : With this patch (not stopping tick, but explicitly queuing a
timer)
============================================================
sample min max 99percentile
============================================================
20 144us 192us 144us
============================================================
*us = microseconds
In this patch, we queue a timer just before entering into a stop0_lite
state. The timer fires at (residency of next available state + exit latency
of next available state * 2). Let's say if next state(stop0) is available
which has residency of 20us, it should get out in as low as (20+2*2)*8
[Based on the forumla (residency + 2xlatency)*history length] microseconds
= 192us. Ideally we would expect 8 iterations, it was observed to get out
in 6-7 iterations. Even if let's say stop2 is next available state(stop0
and stop1 both are unavailable), it would take (100+2*10)*8 = 960us to get
into stop2.
So, We are able to get out of stop0_lite generally in 150us(with this
patch) as compared to 4ms(with tick retained). As stated earlier, we do not
want to get stuck into stop0_lite as it inhibits SMT folding for other
sibling threads, depriving them of core resources. Current patch is using
auto-promotion only for stop0_lite, as it gives performance benefit(primary
reason) along with lowering down power consumption. We may extend this
model for other states in future.
Abhishek Goel (2):
cpuidle : auto-promotion for cpuidle states
cpuidle : Add auto-promotion flag to cpuidle flags
arch/powerpc/include/asm/opal-api.h | 1 +
drivers/cpuidle/Kconfig | 4 ++
drivers/cpuidle/cpuidle-powernv.c | 13 +++++-
drivers/cpuidle/cpuidle.c | 68 ++++++++++++++++++++++++++++-
drivers/cpuidle/governors/ladder.c | 3 +-
drivers/cpuidle/governors/menu.c | 22 +++++++++-
include/linux/cpuidle.h | 10 ++++-
7 files changed, 115 insertions(+), 6 deletions(-)
--
2.17.1
Currently, the cpuidle governors (menu /ladder) determine what idle state
an idling CPU should enter into based on heuristics that depend on the
idle history on that CPU. Given that no predictive heuristic is perfect,
there are cases where the governor predicts a shallow idle state, hoping
that the CPU will be busy soon. However, if no new workload is scheduled
on that CPU in the near future, the CPU will end up in the shallow state.
In case of POWER, this is problematic, when the predicted state in the
aforementioned scenario is a lite stop state, as such lite states will
inhibit SMT folding, thereby depriving the other threads in the core from
using the core resources.
To address this, such lite states need to be autopromoted. The cpuidle-
core can queue timer to correspond with the residency value of the next
available state. Thus leading to auto-promotion to a deeper idle state as
soon as possible.
Signed-off-by: Abhishek Goel <[email protected]>
---
v1->v2 : Removed timeout_needed and rebased to current upstream kernel
drivers/cpuidle/cpuidle.c | 68 +++++++++++++++++++++++++++++-
drivers/cpuidle/governors/ladder.c | 3 +-
drivers/cpuidle/governors/menu.c | 22 +++++++++-
include/linux/cpuidle.h | 10 ++++-
4 files changed, 99 insertions(+), 4 deletions(-)
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 7f108309e..11ce43f19 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -36,6 +36,11 @@ static int enabled_devices;
static int off __read_mostly;
static int initialized __read_mostly;
+struct auto_promotion {
+ struct hrtimer hrtimer;
+ unsigned long timeout_us;
+};
+
int cpuidle_disabled(void)
{
return off;
@@ -188,6 +193,54 @@ int cpuidle_enter_s2idle(struct cpuidle_driver *drv, struct cpuidle_device *dev)
}
#endif /* CONFIG_SUSPEND */
+enum hrtimer_restart auto_promotion_hrtimer_callback(struct hrtimer *hrtimer)
+{
+ return HRTIMER_NORESTART;
+}
+
+#ifdef CONFIG_CPU_IDLE_AUTO_PROMOTION
+DEFINE_PER_CPU(struct auto_promotion, ap);
+
+static void cpuidle_auto_promotion_start(int cpu, struct cpuidle_state *state)
+{
+ struct auto_promotion *this_ap = &per_cpu(ap, cpu);
+
+ if (state->flags & CPUIDLE_FLAG_AUTO_PROMOTION)
+ hrtimer_start(&this_ap->hrtimer, ns_to_ktime(this_ap->timeout_us
+ * 1000), HRTIMER_MODE_REL_PINNED);
+}
+
+static void cpuidle_auto_promotion_cancel(int cpu)
+{
+ struct hrtimer *hrtimer;
+
+ hrtimer = &per_cpu(ap, cpu).hrtimer;
+ if (hrtimer_is_queued(hrtimer))
+ hrtimer_cancel(hrtimer);
+}
+
+static void cpuidle_auto_promotion_update(int cpu, unsigned long timeout)
+{
+ per_cpu(ap, cpu).timeout_us = timeout;
+}
+
+static void cpuidle_auto_promotion_init(int cpu, struct cpuidle_driver *drv)
+{
+ struct auto_promotion *this_ap = &per_cpu(ap, cpu);
+
+ hrtimer_init(&this_ap->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ this_ap->hrtimer.function = auto_promotion_hrtimer_callback;
+}
+#else
+static inline void cpuidle_auto_promotion_start(int cpu, struct cpuidle_state
+ *state) { }
+static inline void cpuidle_auto_promotion_cancel(int cpu) { }
+static inline void cpuidle_auto_promotion_update(int cpu, unsigned long
+ timeout) { }
+static inline void cpuidle_auto_promotion_init(int cpu, struct cpuidle_driver
+ *drv) { }
+#endif
+
/**
* cpuidle_enter_state - enter the state and update stats
* @dev: cpuidle device for this cpu
@@ -225,12 +278,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ns_to_ktime(local_clock());
+ cpuidle_auto_promotion_start(dev->cpu, target_state);
+
stop_critical_timings();
entered_state = target_state->enter(dev, drv, index);
start_critical_timings();
sched_clock_idle_wakeup_event();
time_end = ns_to_ktime(local_clock());
+
+ cpuidle_auto_promotion_cancel(dev->cpu);
+
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
/* The cpu is no longer idle or about to enter idle. */
@@ -312,7 +370,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
bool *stop_tick)
{
- return cpuidle_curr_governor->select(drv, dev, stop_tick);
+ unsigned long timeout_us, ret;
+
+ timeout_us = UINT_MAX;
+ ret = cpuidle_curr_governor->select(drv, dev, stop_tick, &timeout_us);
+ cpuidle_auto_promotion_update(dev->cpu, timeout_us);
+
+ return ret;
}
/**
@@ -658,6 +722,8 @@ int cpuidle_register(struct cpuidle_driver *drv,
device = &per_cpu(cpuidle_dev, cpu);
device->cpu = cpu;
+ cpuidle_auto_promotion_init(cpu, drv);
+
#ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
/*
* On multiplatform for ARM, the coupled idle states could be
diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c
index f0dddc66a..65b518dd7 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -64,7 +64,8 @@ static inline void ladder_do_selection(struct ladder_device *ldev,
* @dummy: not used
*/
static int ladder_select_state(struct cpuidle_driver *drv,
- struct cpuidle_device *dev, bool *dummy)
+ struct cpuidle_device *dev, bool *dummy,
+ unsigned long *unused)
{
struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
struct ladder_device_state *last_state;
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 5951604e7..835e337de 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -276,7 +276,7 @@ static unsigned int get_typical_interval(struct menu_device *data,
* @stop_tick: indication on whether or not to stop the tick
*/
static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
- bool *stop_tick)
+ bool *stop_tick, unsigned long *timeout)
{
struct menu_device *data = this_cpu_ptr(&menu_devices);
int latency_req = cpuidle_governor_latency_req(dev->cpu);
@@ -442,6 +442,26 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
}
}
+#ifdef CPUIDLE_FLAG_AUTO_PROMOTION
+ if (drv->states[idx].flags & CPUIDLE_FLAG_AUTO_PROMOTION) {
+ /*
+ * Timeout is intended to be defined as sum of target residency
+ * of next available state, entry latency and exit latency. If
+ * time interval equal to timeout is spent in current state,
+ * and if it is a shallow lite state, we may want to auto-
+ * promote from such state.
+ */
+ for (i = idx + 1; i < drv->state_count; i++) {
+ if (drv->states[i].disabled ||
+ dev->states_usage[i].disable)
+ continue;
+ *timeout = drv->states[i].target_residency +
+ 2 * drv->states[i].exit_latency;
+ break;
+ }
+ }
+#endif
+
return idx;
}
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 3b3947232..84d76d1ec 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -72,6 +72,13 @@ struct cpuidle_state {
#define CPUIDLE_FLAG_POLLING BIT(0) /* polling state */
#define CPUIDLE_FLAG_COUPLED BIT(1) /* state applies to multiple cpus */
#define CPUIDLE_FLAG_TIMER_STOP BIT(2) /* timer is stopped on this state */
+/*
+ * State with only and only fast state bit set don't even lose user context.
+ * But such states prevent other sibling threads from thread folding benefits.
+ * And hence we don't want to stay for too long in such states and want to
+ * auto-promote from it.
+ */
+#define CPUIDLE_FLAG_AUTO_PROMOTION BIT(3)
struct cpuidle_device_kobj;
struct cpuidle_state_kobj;
@@ -243,7 +250,8 @@ struct cpuidle_governor {
int (*select) (struct cpuidle_driver *drv,
struct cpuidle_device *dev,
- bool *stop_tick);
+ bool *stop_tick, unsigned long
+ *timeout);
void (*reflect) (struct cpuidle_device *dev, int index);
};
--
2.17.1
This patch sets up flags for the state which needs to be auto-promoted. On
POWERNV system, only lite states need to be autopromoted. We identify lite
states by those which do not lose user context. That information has been
used to set the flag for lite states.
Signed-off-by: Abhishek Goel <[email protected]>
---
arch/powerpc/include/asm/opal-api.h | 1 +
drivers/cpuidle/Kconfig | 4 ++++
drivers/cpuidle/cpuidle-powernv.c | 13 +++++++++++--
3 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 870fb7b23..735dec731 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -226,6 +226,7 @@
*/
#define OPAL_PM_TIMEBASE_STOP 0x00000002
+#define OPAL_PM_LOSE_USER_CONTEXT 0x00001000
#define OPAL_PM_LOSE_HYP_CONTEXT 0x00002000
#define OPAL_PM_LOSE_FULL_CONTEXT 0x00004000
#define OPAL_PM_NAP_ENABLED 0x00010000
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 8caccbbd7..9b8e9b96a 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -35,6 +35,10 @@ config CPU_IDLE_GOV_TEO
config DT_IDLE_STATES
bool
+config CPU_IDLE_AUTO_PROMOTION
+ bool
+ default y if PPC_POWERNV
+
menu "ARM CPU Idle Drivers"
depends on ARM || ARM64
source "drivers/cpuidle/Kconfig.arm"
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 84b1ebe21..0dd767270 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -299,6 +299,7 @@ static int powernv_add_idle_states(void)
for (i = 0; i < dt_idle_states; i++) {
unsigned int exit_latency, target_residency;
bool stops_timebase = false;
+ bool lose_user_context = false;
struct pnv_idle_states_t *state = &pnv_idle_states[i];
/*
@@ -324,6 +325,9 @@ static int powernv_add_idle_states(void)
if (has_stop_states && !(state->valid))
continue;
+ if (state->flags & OPAL_PM_LOSE_USER_CONTEXT)
+ lose_user_context = true;
+
if (state->flags & OPAL_PM_TIMEBASE_STOP)
stops_timebase = true;
@@ -332,12 +336,17 @@ static int powernv_add_idle_states(void)
add_powernv_state(nr_idle_states, "Nap",
CPUIDLE_FLAG_NONE, nap_loop,
target_residency, exit_latency, 0, 0);
+ } else if (has_stop_states && !lose_user_context) {
+ add_powernv_state(nr_idle_states, state->name,
+ CPUIDLE_FLAG_AUTO_PROMOTION,
+ stop_loop, target_residency,
+ exit_latency, state->psscr_val,
+ state->psscr_mask);
} else if (has_stop_states && !stops_timebase) {
add_powernv_state(nr_idle_states, state->name,
CPUIDLE_FLAG_NONE, stop_loop,
target_residency, exit_latency,
- state->psscr_val,
- state->psscr_mask);
+ state->psscr_val, state->psscr_mask);
}
/*
--
2.17.1
Hi Abhishek,
> Currently, the cpuidle governors (menu /ladder) determine what idle state
> an idling CPU should enter into based on heuristics that depend on the
> idle history on that CPU. Given that no predictive heuristic is perfect,
> there are cases where the governor predicts a shallow idle state, hoping
> that the CPU will be busy soon. However, if no new workload is scheduled
> on that CPU in the near future, the CPU will end up in the shallow state.
>
> In case of POWER, this is problematic, when the predicted state in the
> aforementioned scenario is a lite stop state, as such lite states will
> inhibit SMT folding, thereby depriving the other threads in the core from
> using the core resources.
>
> To address this, such lite states need to be autopromoted. The cpuidle-
> core can queue timer to correspond with the residency value of the next
> available state. Thus leading to auto-promotion to a deeper idle state as
> soon as possible.
>
This sounds sensible to me, although I'm not really qualified to offer a
full power-management opinion on it. I have some general code questions
and comments, however, which are below:
> Signed-off-by: Abhishek Goel <[email protected]>
> ---
>
> v1->v2 : Removed timeout_needed and rebased to current upstream kernel
>
> drivers/cpuidle/cpuidle.c | 68 +++++++++++++++++++++++++++++-
> drivers/cpuidle/governors/ladder.c | 3 +-
> drivers/cpuidle/governors/menu.c | 22 +++++++++-
> include/linux/cpuidle.h | 10 ++++-
> 4 files changed, 99 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 7f108309e..11ce43f19 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -36,6 +36,11 @@ static int enabled_devices;
> static int off __read_mostly;
> static int initialized __read_mostly;
>
> +struct auto_promotion {
> + struct hrtimer hrtimer;
> + unsigned long timeout_us;
> +};
> +
> int cpuidle_disabled(void)
> {
> return off;
> @@ -188,6 +193,54 @@ int cpuidle_enter_s2idle(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> }
> #endif /* CONFIG_SUSPEND */
>
> +enum hrtimer_restart auto_promotion_hrtimer_callback(struct hrtimer *hrtimer)
> +{
> + return HRTIMER_NORESTART;
> +}
> +
> +#ifdef CONFIG_CPU_IDLE_AUTO_PROMOTION
As far as I can tell, this config flag isn't defined until the next
patch, making this dead code for now. Is this intentional?
> +DEFINE_PER_CPU(struct auto_promotion, ap);
A quick grep suggests that most per-cpu variable have more descriptive
names, perhaps this one should too.
> +
> +static void cpuidle_auto_promotion_start(int cpu, struct cpuidle_state *state)
> +{
> + struct auto_promotion *this_ap = &per_cpu(ap, cpu);
> +
> + if (state->flags & CPUIDLE_FLAG_AUTO_PROMOTION)
> + hrtimer_start(&this_ap->hrtimer, ns_to_ktime(this_ap->timeout_us
> + * 1000), HRTIMER_MODE_REL_PINNED);
Would it be clearer to have both sides of the multiplication on the same
line? i.e.
+ hrtimer_start(&this_ap->hrtimer,
+ ns_to_ktime(this_ap->timeout_us * 1000),
+ HRTIMER_MODE_REL_PINNED);
> +}
> +
> +static void cpuidle_auto_promotion_cancel(int cpu)
> +{
> + struct hrtimer *hrtimer;
> +
> + hrtimer = &per_cpu(ap, cpu).hrtimer;
> + if (hrtimer_is_queued(hrtimer))
> + hrtimer_cancel(hrtimer);
> +}
> +
> +static void cpuidle_auto_promotion_update(int cpu, unsigned long timeout)
> +{
> + per_cpu(ap, cpu).timeout_us = timeout;
> +}
> +
> +static void cpuidle_auto_promotion_init(int cpu, struct cpuidle_driver *drv)
> +{
> + struct auto_promotion *this_ap = &per_cpu(ap, cpu);
> +
> + hrtimer_init(&this_ap->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> + this_ap->hrtimer.function = auto_promotion_hrtimer_callback;
> +}
> +#else
> +static inline void cpuidle_auto_promotion_start(int cpu, struct cpuidle_state
> + *state) { }
> +static inline void cpuidle_auto_promotion_cancel(int cpu) { }
> +static inline void cpuidle_auto_promotion_update(int cpu, unsigned long
> + timeout) { }
> +static inline void cpuidle_auto_promotion_init(int cpu, struct cpuidle_driver
> + *drv) { }
Several of these have the type, then a line break, and then the name
(unsigned long\n timeout). This is a bit harder to read, they should
probably all be on the same line.
> +#endif
> +
> /**
> * cpuidle_enter_state - enter the state and update stats
> * @dev: cpuidle device for this cpu
> @@ -225,12 +278,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
> trace_cpu_idle_rcuidle(index, dev->cpu);
> time_start = ns_to_ktime(local_clock());
>
> + cpuidle_auto_promotion_start(dev->cpu, target_state);
> +
> stop_critical_timings();
> entered_state = target_state->enter(dev, drv, index);
> start_critical_timings();
>
> sched_clock_idle_wakeup_event();
> time_end = ns_to_ktime(local_clock());
> +
> + cpuidle_auto_promotion_cancel(dev->cpu);
> +
> trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
>
> /* The cpu is no longer idle or about to enter idle. */
> @@ -312,7 +370,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
> int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> bool *stop_tick)
> {
> - return cpuidle_curr_governor->select(drv, dev, stop_tick);
> + unsigned long timeout_us, ret;
> +
> + timeout_us = UINT_MAX;
> + ret = cpuidle_curr_governor->select(drv, dev, stop_tick, &timeout_us);
> + cpuidle_auto_promotion_update(dev->cpu, timeout_us);
> +
> + return ret;
> }
>
> /**
> @@ -658,6 +722,8 @@ int cpuidle_register(struct cpuidle_driver *drv,
> device = &per_cpu(cpuidle_dev, cpu);
> device->cpu = cpu;
>
> + cpuidle_auto_promotion_init(cpu, drv);
> +
> #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
> /*
> * On multiplatform for ARM, the coupled idle states could be
> diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c
> index f0dddc66a..65b518dd7 100644
> --- a/drivers/cpuidle/governors/ladder.c
> +++ b/drivers/cpuidle/governors/ladder.c
> @@ -64,7 +64,8 @@ static inline void ladder_do_selection(struct ladder_device *ldev,
> * @dummy: not used
I think you need an addition to the docstring for your new variable.
> */
> static int ladder_select_state(struct cpuidle_driver *drv,
> - struct cpuidle_device *dev, bool *dummy)
> + struct cpuidle_device *dev, bool *dummy,
> + unsigned long *unused)
> {
> struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
> struct ladder_device_state *last_state;
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 5951604e7..835e337de 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -276,7 +276,7 @@ static unsigned int get_typical_interval(struct menu_device *data,
> * @stop_tick: indication on whether or not to stop the tick
Likewise here.
> */
> static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> - bool *stop_tick)
> + bool *stop_tick, unsigned long *timeout)
> {
> struct menu_device *data = this_cpu_ptr(&menu_devices);
> int latency_req = cpuidle_governor_latency_req(dev->cpu);
> @@ -442,6 +442,26 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> }
> }
>
> +#ifdef CPUIDLE_FLAG_AUTO_PROMOTION
> + if (drv->states[idx].flags & CPUIDLE_FLAG_AUTO_PROMOTION) {
> + /*
> + * Timeout is intended to be defined as sum of target residency
> + * of next available state, entry latency and exit latency. If
> + * time interval equal to timeout is spent in current state,
> + * and if it is a shallow lite state, we may want to auto-
> + * promote from such state.
This comment makes sense if you already understand auto-promotion. That's
fair enough - you wrote it and you presumably understand what your code
does :) But for me it's a bit confusing! I think you want to start with
a sentence about what autopromotion is (preferably not using
power-specific terminology) and then explain the calculation of the
timeouts.
> + */
> + for (i = idx + 1; i < drv->state_count; i++) {
> + if (drv->states[i].disabled ||
> + dev->states_usage[i].disable)
> + continue;
> + *timeout = drv->states[i].target_residency +
> + 2 * drv->states[i].exit_latency;
> + break;
> + }
> + }
> +#endif
> +
> return idx;
> }
>
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index 3b3947232..84d76d1ec 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -72,6 +72,13 @@ struct cpuidle_state {
> #define CPUIDLE_FLAG_POLLING BIT(0) /* polling state */
> #define CPUIDLE_FLAG_COUPLED BIT(1) /* state applies to multiple cpus */
> #define CPUIDLE_FLAG_TIMER_STOP BIT(2) /* timer is stopped on this state */
> +/*
> + * State with only and only fast state bit set don't even lose user context.
"only and only"?
> + * But such states prevent other sibling threads from thread folding benefits.
> + * And hence we don't want to stay for too long in such states and want to
> + * auto-promote from it.
I think this comment mixes Power-specific and generic concepts. (But I'm
not a PM expert so tell me if I'm wrong here.) I think, if I've
understood correctly: in the generic code, the bit represents a state
that we do not want to linger in, which we want to definitely leave
after some time. On Power, we have a state that doesn't lose user
context but which prevents thread folding, so this is an example of a
state where we want to auto-promote.
> + */
> +#define CPUIDLE_FLAG_AUTO_PROMOTION BIT(3)
>
> struct cpuidle_device_kobj;
> struct cpuidle_state_kobj;
> @@ -243,7 +250,8 @@ struct cpuidle_governor {
>
> int (*select) (struct cpuidle_driver *drv,
> struct cpuidle_device *dev,
> - bool *stop_tick);
> + bool *stop_tick, unsigned long
> + *timeout);
> void (*reflect) (struct cpuidle_device *dev, int index);
> };
>
> --
> 2.17.1
Abhishek Goel <[email protected]> writes:
> This patch sets up flags for the state which needs to be auto-promoted. On
> POWERNV system, only lite states need to be autopromoted. We identify lite
> states by those which do not lose user context. That information has been
> used to set the flag for lite states.
>
> Signed-off-by: Abhishek Goel <[email protected]>
> ---
> arch/powerpc/include/asm/opal-api.h | 1 +
> drivers/cpuidle/Kconfig | 4 ++++
> drivers/cpuidle/cpuidle-powernv.c | 13 +++++++++++--
> 3 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
> index 870fb7b23..735dec731 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -226,6 +226,7 @@
> */
>
> #define OPAL_PM_TIMEBASE_STOP 0x00000002
> +#define OPAL_PM_LOSE_USER_CONTEXT 0x00001000
Is the important thing that you don't lose user context, or that the
state prevents thread folding? From your description, it seems from a
power managment point of view that the important thing is that the state
prevents thread folding, and it seems almost coincidental that it
preserves user context.
If this is mirrored from the way hardware or opal describes it (which
looking at the other flags it looks like it might be), it would be worth
adding a comment that explains why we want to leave such a state.
> #define OPAL_PM_LOSE_HYP_CONTEXT 0x00002000
> #define OPAL_PM_LOSE_FULL_CONTEXT 0x00004000
> #define OPAL_PM_NAP_ENABLED 0x00010000
> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
> index 8caccbbd7..9b8e9b96a 100644
> --- a/drivers/cpuidle/Kconfig
> +++ b/drivers/cpuidle/Kconfig
> @@ -35,6 +35,10 @@ config CPU_IDLE_GOV_TEO
> config DT_IDLE_STATES
> bool
>
> +config CPU_IDLE_AUTO_PROMOTION
> + bool
> + default y if PPC_POWERNV
> +
As I mentioned in the previous patch, this is used in the previous
patch. It's also not used here.
Regards,
Daniel
> menu "ARM CPU Idle Drivers"
> depends on ARM || ARM64
> source "drivers/cpuidle/Kconfig.arm"
> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
> index 84b1ebe21..0dd767270 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -299,6 +299,7 @@ static int powernv_add_idle_states(void)
> for (i = 0; i < dt_idle_states; i++) {
> unsigned int exit_latency, target_residency;
> bool stops_timebase = false;
> + bool lose_user_context = false;
> struct pnv_idle_states_t *state = &pnv_idle_states[i];
>
> /*
> @@ -324,6 +325,9 @@ static int powernv_add_idle_states(void)
> if (has_stop_states && !(state->valid))
> continue;
>
> + if (state->flags & OPAL_PM_LOSE_USER_CONTEXT)
> + lose_user_context = true;
> +
> if (state->flags & OPAL_PM_TIMEBASE_STOP)
> stops_timebase = true;
>
> @@ -332,12 +336,17 @@ static int powernv_add_idle_states(void)
> add_powernv_state(nr_idle_states, "Nap",
> CPUIDLE_FLAG_NONE, nap_loop,
> target_residency, exit_latency, 0, 0);
> + } else if (has_stop_states && !lose_user_context) {
> + add_powernv_state(nr_idle_states, state->name,
> + CPUIDLE_FLAG_AUTO_PROMOTION,
> + stop_loop, target_residency,
> + exit_latency, state->psscr_val,
> + state->psscr_mask);
> } else if (has_stop_states && !stops_timebase) {
> add_powernv_state(nr_idle_states, state->name,
> CPUIDLE_FLAG_NONE, stop_loop,
> target_residency, exit_latency,
> - state->psscr_val,
> - state->psscr_mask);
> + state->psscr_val, state->psscr_mask);
> }
>
> /*
> --
> 2.17.1
Hi,
Sorry, just realised another thing I wanted to ask:
> @@ -442,6 +442,26 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> }
> }
>
>
> +#ifdef CPUIDLE_FLAG_AUTO_PROMOTION
Why is this based on CPUIDLE_FLAG_ rather than CONFIG_CPU_IDLE_? Won't
this always be true, given that the flag is defined regardless of the
config option in the header?
> + if (drv->states[idx].flags & CPUIDLE_FLAG_AUTO_PROMOTION) {
> + /*
> + * Timeout is intended to be defined as sum of target residency
> + * of next available state, entry latency and exit latency. If
> + * time interval equal to timeout is spent in current state,
> + * and if it is a shallow lite state, we may want to auto-
> + * promote from such state.
> + */
Regards,
Daniel
> + for (i = idx + 1; i < drv->state_count; i++) {
> + if (drv->states[i].disabled ||
> + dev->states_usage[i].disable)
> + continue;
> + *timeout = drv->states[i].target_residency +
> + 2 * drv->states[i].exit_latency;
> + break;
> + }
> + }
> +#endif
> +
> return idx;
> }
>
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index 3b3947232..84d76d1ec 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -72,6 +72,13 @@ struct cpuidle_state {
> #define CPUIDLE_FLAG_POLLING BIT(0) /* polling state */
> #define CPUIDLE_FLAG_COUPLED BIT(1) /* state applies to multiple cpus */
> #define CPUIDLE_FLAG_TIMER_STOP BIT(2) /* timer is stopped on this state */
> +/*
> + * State with only and only fast state bit set don't even lose user context.
> + * But such states prevent other sibling threads from thread folding benefits.
> + * And hence we don't want to stay for too long in such states and want to
> + * auto-promote from it.
> + */
> +#define CPUIDLE_FLAG_AUTO_PROMOTION BIT(3)
>
> struct cpuidle_device_kobj;
> struct cpuidle_state_kobj;
> @@ -243,7 +250,8 @@ struct cpuidle_governor {
>
> int (*select) (struct cpuidle_driver *drv,
> struct cpuidle_device *dev,
> - bool *stop_tick);
> + bool *stop_tick, unsigned long
> + *timeout);
> void (*reflect) (struct cpuidle_device *dev, int index);
> };
>
> --
> 2.17.1
Hi Daniel,
Thanks for such a descriptive review. I will include all the suggestions
made in my next iteration.
--Abhishek
On 04/08/2019 07:42 PM, Daniel Axtens wrote:
> Hi Abhishek,
>
>> Currently, the cpuidle governors (menu /ladder) determine what idle state
>> an idling CPU should enter into based on heuristics that depend on the
>> idle history on that CPU. Given that no predictive heuristic is perfect,
>> there are cases where the governor predicts a shallow idle state, hoping
>> that the CPU will be busy soon. However, if no new workload is scheduled
>> on that CPU in the near future, the CPU will end up in the shallow state.
>>
>> In case of POWER, this is problematic, when the predicted state in the
>> aforementioned scenario is a lite stop state, as such lite states will
>> inhibit SMT folding, thereby depriving the other threads in the core from
>> using the core resources.
>>
>> To address this, such lite states need to be autopromoted. The cpuidle-
>> core can queue timer to correspond with the residency value of the next
>> available state. Thus leading to auto-promotion to a deeper idle state as
>> soon as possible.
>>
> This sounds sensible to me, although I'm not really qualified to offer a
> full power-management opinion on it. I have some general code questions
> and comments, however, which are below:
>
>> Signed-off-by: Abhishek Goel <[email protected]>
>> ---
>>
>> v1->v2 : Removed timeout_needed and rebased to current upstream kernel
>>
>> drivers/cpuidle/cpuidle.c | 68 +++++++++++++++++++++++++++++-
>> drivers/cpuidle/governors/ladder.c | 3 +-
>> drivers/cpuidle/governors/menu.c | 22 +++++++++-
>> include/linux/cpuidle.h | 10 ++++-
>> 4 files changed, 99 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index 7f108309e..11ce43f19 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -36,6 +36,11 @@ static int enabled_devices;
>> static int off __read_mostly;
>> static int initialized __read_mostly;
>>
>> +struct auto_promotion {
>> + struct hrtimer hrtimer;
>> + unsigned long timeout_us;
>> +};
>> +
>> int cpuidle_disabled(void)
>> {
>> return off;
>> @@ -188,6 +193,54 @@ int cpuidle_enter_s2idle(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>> }
>> #endif /* CONFIG_SUSPEND */
>>
>> +enum hrtimer_restart auto_promotion_hrtimer_callback(struct hrtimer *hrtimer)
>> +{
>> + return HRTIMER_NORESTART;
>> +}
>> +
>> +#ifdef CONFIG_CPU_IDLE_AUTO_PROMOTION
> As far as I can tell, this config flag isn't defined until the next
> patch, making this dead code for now. Is this intentional?
>
>> +DEFINE_PER_CPU(struct auto_promotion, ap);
> A quick grep suggests that most per-cpu variable have more descriptive
> names, perhaps this one should too.
>
>> +
>> +static void cpuidle_auto_promotion_start(int cpu, struct cpuidle_state *state)
>> +{
>> + struct auto_promotion *this_ap = &per_cpu(ap, cpu);
>> +
>> + if (state->flags & CPUIDLE_FLAG_AUTO_PROMOTION)
>> + hrtimer_start(&this_ap->hrtimer, ns_to_ktime(this_ap->timeout_us
>> + * 1000), HRTIMER_MODE_REL_PINNED);
> Would it be clearer to have both sides of the multiplication on the same
> line? i.e.
> + hrtimer_start(&this_ap->hrtimer,
> + ns_to_ktime(this_ap->timeout_us * 1000),
> + HRTIMER_MODE_REL_PINNED);
>
>> +}
>> +
>> +static void cpuidle_auto_promotion_cancel(int cpu)
>> +{
>> + struct hrtimer *hrtimer;
>> +
>> + hrtimer = &per_cpu(ap, cpu).hrtimer;
>> + if (hrtimer_is_queued(hrtimer))
>> + hrtimer_cancel(hrtimer);
>> +}
>> +
>> +static void cpuidle_auto_promotion_update(int cpu, unsigned long timeout)
>> +{
>> + per_cpu(ap, cpu).timeout_us = timeout;
>> +}
>> +
>> +static void cpuidle_auto_promotion_init(int cpu, struct cpuidle_driver *drv)
>> +{
>> + struct auto_promotion *this_ap = &per_cpu(ap, cpu);
>> +
>> + hrtimer_init(&this_ap->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> + this_ap->hrtimer.function = auto_promotion_hrtimer_callback;
>> +}
>> +#else
>> +static inline void cpuidle_auto_promotion_start(int cpu, struct cpuidle_state
>> + *state) { }
>> +static inline void cpuidle_auto_promotion_cancel(int cpu) { }
>> +static inline void cpuidle_auto_promotion_update(int cpu, unsigned long
>> + timeout) { }
>> +static inline void cpuidle_auto_promotion_init(int cpu, struct cpuidle_driver
>> + *drv) { }
> Several of these have the type, then a line break, and then the name
> (unsigned long\n timeout). This is a bit harder to read, they should
> probably all be on the same line.
>
>> +#endif
>> +
>> /**
>> * cpuidle_enter_state - enter the state and update stats
>> * @dev: cpuidle device for this cpu
>> @@ -225,12 +278,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
>> trace_cpu_idle_rcuidle(index, dev->cpu);
>> time_start = ns_to_ktime(local_clock());
>>
>> + cpuidle_auto_promotion_start(dev->cpu, target_state);
>> +
>> stop_critical_timings();
>> entered_state = target_state->enter(dev, drv, index);
>> start_critical_timings();
>>
>> sched_clock_idle_wakeup_event();
>> time_end = ns_to_ktime(local_clock());
>> +
>> + cpuidle_auto_promotion_cancel(dev->cpu);
>> +
>> trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
>>
>> /* The cpu is no longer idle or about to enter idle. */
>> @@ -312,7 +370,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
>> int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>> bool *stop_tick)
>> {
>> - return cpuidle_curr_governor->select(drv, dev, stop_tick);
>> + unsigned long timeout_us, ret;
>> +
>> + timeout_us = UINT_MAX;
>> + ret = cpuidle_curr_governor->select(drv, dev, stop_tick, &timeout_us);
>> + cpuidle_auto_promotion_update(dev->cpu, timeout_us);
>> +
>> + return ret;
>> }
>>
>> /**
>> @@ -658,6 +722,8 @@ int cpuidle_register(struct cpuidle_driver *drv,
>> device = &per_cpu(cpuidle_dev, cpu);
>> device->cpu = cpu;
>>
>> + cpuidle_auto_promotion_init(cpu, drv);
>> +
>> #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
>> /*
>> * On multiplatform for ARM, the coupled idle states could be
>> diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c
>> index f0dddc66a..65b518dd7 100644
>> --- a/drivers/cpuidle/governors/ladder.c
>> +++ b/drivers/cpuidle/governors/ladder.c
>> @@ -64,7 +64,8 @@ static inline void ladder_do_selection(struct ladder_device *ldev,
>> * @dummy: not used
> I think you need an addition to the docstring for your new variable.
>
>> */
>> static int ladder_select_state(struct cpuidle_driver *drv,
>> - struct cpuidle_device *dev, bool *dummy)
>> + struct cpuidle_device *dev, bool *dummy,
>> + unsigned long *unused)
>> {
>> struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
>> struct ladder_device_state *last_state;
>> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
>> index 5951604e7..835e337de 100644
>> --- a/drivers/cpuidle/governors/menu.c
>> +++ b/drivers/cpuidle/governors/menu.c
>> @@ -276,7 +276,7 @@ static unsigned int get_typical_interval(struct menu_device *data,
>> * @stop_tick: indication on whether or not to stop the tick
> Likewise here.
>
>> */
>> static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>> - bool *stop_tick)
>> + bool *stop_tick, unsigned long *timeout)
>> {
>> struct menu_device *data = this_cpu_ptr(&menu_devices);
>> int latency_req = cpuidle_governor_latency_req(dev->cpu);
>> @@ -442,6 +442,26 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>> }
>> }
>>
>> +#ifdef CPUIDLE_FLAG_AUTO_PROMOTION
>> + if (drv->states[idx].flags & CPUIDLE_FLAG_AUTO_PROMOTION) {
>> + /*
>> + * Timeout is intended to be defined as sum of target residency
>> + * of next available state, entry latency and exit latency. If
>> + * time interval equal to timeout is spent in current state,
>> + * and if it is a shallow lite state, we may want to auto-
>> + * promote from such state.
> This comment makes sense if you already understand auto-promotion. That's
> fair enough - you wrote it and you presumably understand what your code
> does :) But for me it's a bit confusing! I think you want to start with
> a sentence about what autopromotion is (preferably not using
> power-specific terminology) and then explain the calculation of the
> timeouts.
>
>> + */
>> + for (i = idx + 1; i < drv->state_count; i++) {
>> + if (drv->states[i].disabled ||
>> + dev->states_usage[i].disable)
>> + continue;
>> + *timeout = drv->states[i].target_residency +
>> + 2 * drv->states[i].exit_latency;
>> + break;
>> + }
>> + }
>> +#endif
>> +
>> return idx;
>> }
>>
>> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
>> index 3b3947232..84d76d1ec 100644
>> --- a/include/linux/cpuidle.h
>> +++ b/include/linux/cpuidle.h
>> @@ -72,6 +72,13 @@ struct cpuidle_state {
>> #define CPUIDLE_FLAG_POLLING BIT(0) /* polling state */
>> #define CPUIDLE_FLAG_COUPLED BIT(1) /* state applies to multiple cpus */
>> #define CPUIDLE_FLAG_TIMER_STOP BIT(2) /* timer is stopped on this state */
>> +/*
>> + * State with only and only fast state bit set don't even lose user context.
> "only and only"?
>> + * But such states prevent other sibling threads from thread folding benefits.
>> + * And hence we don't want to stay for too long in such states and want to
>> + * auto-promote from it.
> I think this comment mixes Power-specific and generic concepts. (But I'm
> not a PM expert so tell me if I'm wrong here.) I think, if I've
> understood correctly: in the generic code, the bit represents a state
> that we do not want to linger in, which we want to definitely leave
> after some time. On Power, we have a state that doesn't lose user
> context but which prevents thread folding, so this is an example of a
> state where we want to auto-promote.
>
>> + */
>> +#define CPUIDLE_FLAG_AUTO_PROMOTION BIT(3)
>>
>> struct cpuidle_device_kobj;
>> struct cpuidle_state_kobj;
>> @@ -243,7 +250,8 @@ struct cpuidle_governor {
>>
>> int (*select) (struct cpuidle_driver *drv,
>> struct cpuidle_device *dev,
>> - bool *stop_tick);
>> + bool *stop_tick, unsigned long
>> + *timeout);
>> void (*reflect) (struct cpuidle_device *dev, int index);
>> };
>>
>> --
>> 2.17.1
On Tue, Apr 9, 2019 at 11:29 AM Abhishek <[email protected]> wrote:
>
> Hi Daniel,
>
> Thanks for such a descriptive review. I will include all the suggestions
> made in my next iteration.
Please give me some time to send comments before that.
Hi Daniel,
On 04/08/2019 07:55 PM, Daniel Axtens wrote:
> Hi,
>
> Sorry, just realised another thing I wanted to ask:
>
>> @@ -442,6 +442,26 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>> }
>> }
>>
>>
>> +#ifdef CPUIDLE_FLAG_AUTO_PROMOTION
> Why is this based on CPUIDLE_FLAG_ rather than CONFIG_CPU_IDLE_? Won't
> this always be true, given that the flag is defined regardless of the
> config option in the header?
Yeah, You are right. This should have been CONFIG_CPU_IDLE_AUTO_PROMOTION.
--Abhishek
On 04/09/2019 03:00 PM, Rafael J. Wysocki wrote:
> On Tue, Apr 9, 2019 at 11:29 AM Abhishek <[email protected]> wrote:
>> Hi Daniel,
>>
>> Thanks for such a descriptive review. I will include all the suggestions
>> made in my next iteration.
> Please give me some time to send comments before that.
Sure, I will wait for your review.
On Fri, Apr 5, 2019 at 11:17 AM Abhishek Goel
<[email protected]> wrote:
>
> Currently, the cpuidle governors (menu /ladder) determine what idle state
There are three governors in 5.1-rc.
> an idling CPU should enter into based on heuristics that depend on the
> idle history on that CPU. Given that no predictive heuristic is perfect,
> there are cases where the governor predicts a shallow idle state, hoping
> that the CPU will be busy soon. However, if no new workload is scheduled
> on that CPU in the near future, the CPU will end up in the shallow state.
>
> In case of POWER, this is problematic, when the predicted state in the
> aforementioned scenario is a lite stop state, as such lite states will
> inhibit SMT folding, thereby depriving the other threads in the core from
> using the core resources.
>
> To address this, such lite states need to be autopromoted.
I don't quite agree with this statement and it doesn't even match what
the patch does AFAICS. "Autopromotion" would be going from the given
state to a deeper one without running state selection in between, but
that's not what's going on here.
> The cpuidle-core can queue timer to correspond with the residency value of the next
> available state. Thus leading to auto-promotion to a deeper idle state as
> soon as possible.
No, it doesn't automatically cause a deeper state to be used next
time. It simply kicks the CPU out of the idle state and one more
iteration of the idle loop runs on it. Whether or not a deeper state
will be selected in that iteration depends on the governor
computations carried out in it.
Now, this appears to be almost analogous to the "polling" state used
on x86 which uses the next idle state's target residency as a timeout.
While generally I'm not a big fan of setting up timers in the idle
loop (it sort of feels like pulling your own hair in order to get
yourself out of a swamp), if idle states like these are there in your
platform, setting up a timer to get out of them in the driver's
->enter() routine might not be particularly objectionable. Doing that
in the core is a whole different story, though.
Generally, this adds quite a bit of complexity (on the "ugly" side of
things IMO) to the core to cover a corner case present in one
platform, while IMO it can be covered in the driver for that platform
directly.
> Signed-off-by: Abhishek Goel <[email protected]>
> ---
>
> v1->v2 : Removed timeout_needed and rebased to current upstream kernel
>
> drivers/cpuidle/cpuidle.c | 68 +++++++++++++++++++++++++++++-
> drivers/cpuidle/governors/ladder.c | 3 +-
> drivers/cpuidle/governors/menu.c | 22 +++++++++-
> include/linux/cpuidle.h | 10 ++++-
> 4 files changed, 99 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 7f108309e..11ce43f19 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -36,6 +36,11 @@ static int enabled_devices;
> static int off __read_mostly;
> static int initialized __read_mostly;
>
> +struct auto_promotion {
> + struct hrtimer hrtimer;
> + unsigned long timeout_us;
> +};
> +
> int cpuidle_disabled(void)
> {
> return off;
> @@ -188,6 +193,54 @@ int cpuidle_enter_s2idle(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> }
> #endif /* CONFIG_SUSPEND */
>
> +enum hrtimer_restart auto_promotion_hrtimer_callback(struct hrtimer *hrtimer)
> +{
> + return HRTIMER_NORESTART;
> +}
> +
> +#ifdef CONFIG_CPU_IDLE_AUTO_PROMOTION
> +DEFINE_PER_CPU(struct auto_promotion, ap);
> +
> +static void cpuidle_auto_promotion_start(int cpu, struct cpuidle_state *state)
> +{
> + struct auto_promotion *this_ap = &per_cpu(ap, cpu);
> +
> + if (state->flags & CPUIDLE_FLAG_AUTO_PROMOTION)
> + hrtimer_start(&this_ap->hrtimer, ns_to_ktime(this_ap->timeout_us
> + * 1000), HRTIMER_MODE_REL_PINNED);
> +}
> +
> +static void cpuidle_auto_promotion_cancel(int cpu)
> +{
> + struct hrtimer *hrtimer;
> +
> + hrtimer = &per_cpu(ap, cpu).hrtimer;
> + if (hrtimer_is_queued(hrtimer))
> + hrtimer_cancel(hrtimer);
> +}
> +
> +static void cpuidle_auto_promotion_update(int cpu, unsigned long timeout)
> +{
> + per_cpu(ap, cpu).timeout_us = timeout;
> +}
> +
> +static void cpuidle_auto_promotion_init(int cpu, struct cpuidle_driver *drv)
> +{
> + struct auto_promotion *this_ap = &per_cpu(ap, cpu);
> +
> + hrtimer_init(&this_ap->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> + this_ap->hrtimer.function = auto_promotion_hrtimer_callback;
> +}
> +#else
> +static inline void cpuidle_auto_promotion_start(int cpu, struct cpuidle_state
> + *state) { }
> +static inline void cpuidle_auto_promotion_cancel(int cpu) { }
> +static inline void cpuidle_auto_promotion_update(int cpu, unsigned long
> + timeout) { }
> +static inline void cpuidle_auto_promotion_init(int cpu, struct cpuidle_driver
> + *drv) { }
> +#endif
> +
> /**
> * cpuidle_enter_state - enter the state and update stats
> * @dev: cpuidle device for this cpu
> @@ -225,12 +278,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
> trace_cpu_idle_rcuidle(index, dev->cpu);
> time_start = ns_to_ktime(local_clock());
>
> + cpuidle_auto_promotion_start(dev->cpu, target_state);
First off, I wouldn't call it "auto-promotion", because it just adds a
timeout to trigger if the CPU spends too much time in the target
state.
Second, and more important, I don't see why this cannot be done in
target_state->enter() just for the state in which it is needed (in
analogy with the "polling" state).
> +
> stop_critical_timings();
> entered_state = target_state->enter(dev, drv, index);
> start_critical_timings();
>
> sched_clock_idle_wakeup_event();
> time_end = ns_to_ktime(local_clock());
> +
> + cpuidle_auto_promotion_cancel(dev->cpu);
> +
> trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
>
> /* The cpu is no longer idle or about to enter idle. */
> @@ -312,7 +370,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
> int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> bool *stop_tick)
> {
> - return cpuidle_curr_governor->select(drv, dev, stop_tick);
> + unsigned long timeout_us, ret;
> +
> + timeout_us = UINT_MAX;
> + ret = cpuidle_curr_governor->select(drv, dev, stop_tick, &timeout_us);
> + cpuidle_auto_promotion_update(dev->cpu, timeout_us);
> +
> + return ret;
> }
>
> /**
> @@ -658,6 +722,8 @@ int cpuidle_register(struct cpuidle_driver *drv,
> device = &per_cpu(cpuidle_dev, cpu);
> device->cpu = cpu;
>
> + cpuidle_auto_promotion_init(cpu, drv);
> +
> #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
> /*
> * On multiplatform for ARM, the coupled idle states could be
> diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c
> index f0dddc66a..65b518dd7 100644
> --- a/drivers/cpuidle/governors/ladder.c
> +++ b/drivers/cpuidle/governors/ladder.c
> @@ -64,7 +64,8 @@ static inline void ladder_do_selection(struct ladder_device *ldev,
> * @dummy: not used
> */
> static int ladder_select_state(struct cpuidle_driver *drv,
> - struct cpuidle_device *dev, bool *dummy)
> + struct cpuidle_device *dev, bool *dummy,
> + unsigned long *unused)
> {
> struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
> struct ladder_device_state *last_state;
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 5951604e7..835e337de 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -276,7 +276,7 @@ static unsigned int get_typical_interval(struct menu_device *data,
> * @stop_tick: indication on whether or not to stop the tick
> */
> static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> - bool *stop_tick)
> + bool *stop_tick, unsigned long *timeout)
> {
> struct menu_device *data = this_cpu_ptr(&menu_devices);
> int latency_req = cpuidle_governor_latency_req(dev->cpu);
> @@ -442,6 +442,26 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> }
> }
>
> +#ifdef CPUIDLE_FLAG_AUTO_PROMOTION
> + if (drv->states[idx].flags & CPUIDLE_FLAG_AUTO_PROMOTION) {
> + /*
> + * Timeout is intended to be defined as sum of target residency
> + * of next available state, entry latency and exit latency. If
> + * time interval equal to timeout is spent in current state,
> + * and if it is a shallow lite state, we may want to auto-
> + * promote from such state.
> + */
> + for (i = idx + 1; i < drv->state_count; i++) {
> + if (drv->states[i].disabled ||
> + dev->states_usage[i].disable)
> + continue;
> + *timeout = drv->states[i].target_residency +
> + 2 * drv->states[i].exit_latency;
> + break;
> + }
> + }
> +#endif
Why do you need to do the above in the governor at all?
The driver's ->enter() callback knows what state has been selected, it
can check what the next available state is and set up the timer
accordingly.
I don't see the need to pass the timeout from the governor to the core
anyway when the next thing the governor does is to return the selected
state index.
> +
> return idx;
> }
>
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index 3b3947232..84d76d1ec 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -72,6 +72,13 @@ struct cpuidle_state {
> #define CPUIDLE_FLAG_POLLING BIT(0) /* polling state */
> #define CPUIDLE_FLAG_COUPLED BIT(1) /* state applies to multiple cpus */
> #define CPUIDLE_FLAG_TIMER_STOP BIT(2) /* timer is stopped on this state */
> +/*
> + * State with only and only fast state bit set don't even lose user context.
> + * But such states prevent other sibling threads from thread folding benefits.
> + * And hence we don't want to stay for too long in such states and want to
> + * auto-promote from it.
> + */
> +#define CPUIDLE_FLAG_AUTO_PROMOTION BIT(3)
>
> struct cpuidle_device_kobj;
> struct cpuidle_state_kobj;
> @@ -243,7 +250,8 @@ struct cpuidle_governor {
>
> int (*select) (struct cpuidle_driver *drv,
> struct cpuidle_device *dev,
> - bool *stop_tick);
> + bool *stop_tick, unsigned long
> + *timeout);
> void (*reflect) (struct cpuidle_device *dev, int index);
> };
>
> --
> 2.17.1
>
Hi Rafael,
Thanks for the Review. Few inline replies below.
On 04/09/2019 03:31 PM, Rafael J. Wysocki wrote:
> On Fri, Apr 5, 2019 at 11:17 AM Abhishek Goel
> <[email protected]> wrote:
>> Currently, the cpuidle governors (menu /ladder) determine what idle state
> There are three governors in 5.1-rc.
>
>> an idling CPU should enter into based on heuristics that depend on the
>> idle history on that CPU. Given that no predictive heuristic is perfect,
>> there are cases where the governor predicts a shallow idle state, hoping
>> that the CPU will be busy soon. However, if no new workload is scheduled
>> on that CPU in the near future, the CPU will end up in the shallow state.
>>
>> In case of POWER, this is problematic, when the predicted state in the
>> aforementioned scenario is a lite stop state, as such lite states will
>> inhibit SMT folding, thereby depriving the other threads in the core from
>> using the core resources.
>>
>> To address this, such lite states need to be autopromoted.
> I don't quite agree with this statement and it doesn't even match what
> the patch does AFAICS. "Autopromotion" would be going from the given
> state to a deeper one without running state selection in between, but
> that's not what's going on here.
Thinking to call it "timed-exit". Is that good?
>> The cpuidle-core can queue timer to correspond with the residency value of the next
>> available state. Thus leading to auto-promotion to a deeper idle state as
>> soon as possible.
> No, it doesn't automatically cause a deeper state to be used next
> time. It simply kicks the CPU out of the idle state and one more
> iteration of the idle loop runs on it. Whether or not a deeper state
> will be selected in that iteration depends on the governor
> computations carried out in it.
I did not mean that next state is chosen automatically. I should have been
more descriptive here instead of just using "as soon as possible"
> Now, this appears to be almost analogous to the "polling" state used
> on x86 which uses the next idle state's target residency as a timeout.
>
> While generally I'm not a big fan of setting up timers in the idle
> loop (it sort of feels like pulling your own hair in order to get
> yourself out of a swamp), if idle states like these are there in your
> platform, setting up a timer to get out of them in the driver's
> ->enter() routine might not be particularly objectionable. Doing that
> in the core is a whole different story, though.
>
> Generally, this adds quite a bit of complexity (on the "ugly" side of
> things IMO) to the core to cover a corner case present in one
> platform, while IMO it can be covered in the driver for that platform
> directly.
As of now, since this code doesn't add any benefit to the other platform,
I will post a patch with this implementation covered in platform-specific
driver code.
You are right that all the information needed for this implementation
are also
available there in platform driver code, so we should be good to go.