2022-10-25 14:03:17

by Anna-Maria Behnsen

[permalink] [raw]
Subject: [PATCH v3 01/17] cpufreq: Prepare timer flags for hierarchical timer pull model

Note: This is a proposal only. I was waiting on input how to change this
driver properly to use the already existing infrastructure. See therfore
the thread on linux-pm mailinglist:
https://lore.kernel.org/linux-pm/[email protected]/

gpstates timer is the only timer using TIMER_PINNED and TIMER_DEFERRABLE
flag. When moving to hierarchical timer pull model, pinned and deferrable
timers are stored in separate bases.

To ensure gpstates timer always expires on the CPU where it is pinned to,
keep only TIMER_PINNED flag and drop TIMER_DEFERRABLE flag.

While at it, rewrite comment explaining the rule for timer expiry for the
next interval and fix whitespace damages.

Signed-off-by: Anna-Maria Behnsen <[email protected]>
Cc: [email protected]
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Michael Ellerman <[email protected]>
---
drivers/cpufreq/powernv-cpufreq.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index fddbd1ea1635..08d6bd54539d 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -640,18 +640,18 @@ static inline int calc_global_pstate(unsigned int elapsed_time,
return highest_lpstate_idx + index_diff;
}

-static inline void queue_gpstate_timer(struct global_pstate_info *gpstates)
+static inline void queue_gpstate_timer(struct global_pstate_info *gpstates)
{
unsigned int timer_interval;

/*
- * Setting up timer to fire after GPSTATE_TIMER_INTERVAL ms, But
- * if it exceeds MAX_RAMP_DOWN_TIME ms for ramp down time.
- * Set timer such that it fires exactly at MAX_RAMP_DOWN_TIME
- * seconds of ramp down time.
+ * Timer should expire next time after GPSTATE_TIMER_INTERVAL. If
+ * the resulting interval (elapsed time + interval) between last
+ * and next timer expiry is greater than MAX_RAMP_DOWN_TIME, ensure
+ * it is maximum MAX_RAMP_DOWN_TIME when queueing the next timer.
*/
if ((gpstates->elapsed_time + GPSTATE_TIMER_INTERVAL)
- > MAX_RAMP_DOWN_TIME)
+ > MAX_RAMP_DOWN_TIME)
timer_interval = MAX_RAMP_DOWN_TIME - gpstates->elapsed_time;
else
timer_interval = GPSTATE_TIMER_INTERVAL;
@@ -865,8 +865,7 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)

/* initialize timer */
gpstates->policy = policy;
- timer_setup(&gpstates->timer, gpstate_timer_handler,
- TIMER_PINNED | TIMER_DEFERRABLE);
+ timer_setup(&gpstates->timer, gpstate_timer_handler, TIMER_PINNED);
gpstates->timer.expires = jiffies +
msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
spin_lock_init(&gpstates->gpstate_lock);
--
2.30.2



2022-10-26 14:11:15

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] cpufreq: Prepare timer flags for hierarchical timer pull model

On Tue, Oct 25, 2022 at 03:58:34PM +0200, Anna-Maria Behnsen wrote:
> Note: This is a proposal only. I was waiting on input how to change this
> driver properly to use the already existing infrastructure. See therfore
> the thread on linux-pm mailinglist:
> https://lore.kernel.org/linux-pm/[email protected]/
>
> gpstates timer is the only timer using TIMER_PINNED and TIMER_DEFERRABLE
> flag. When moving to hierarchical timer pull model, pinned and deferrable
> timers are stored in separate bases.
>
> To ensure gpstates timer always expires on the CPU where it is pinned to,
> keep only TIMER_PINNED flag and drop TIMER_DEFERRABLE flag.

OTOH there are deferrable timers out there that expect to run on a
specific CPU, because there are always queued with add_timer_on().

For example workqueues using DECLARE_DEFERRABLE_WORK() that are queued
with queue_delayed_work_on(). Like vmstat().

Those are not explicitely pinned because they don't rely on __mod_timer()
but they expect CPU affinity.

Thanks.

>
> While at it, rewrite comment explaining the rule for timer expiry for the
> next interval and fix whitespace damages.
>
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
> Cc: [email protected]
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> ---
> drivers/cpufreq/powernv-cpufreq.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index fddbd1ea1635..08d6bd54539d 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -640,18 +640,18 @@ static inline int calc_global_pstate(unsigned int elapsed_time,
> return highest_lpstate_idx + index_diff;
> }
>
> -static inline void queue_gpstate_timer(struct global_pstate_info *gpstates)
> +static inline void queue_gpstate_timer(struct global_pstate_info *gpstates)
> {
> unsigned int timer_interval;
>
> /*
> - * Setting up timer to fire after GPSTATE_TIMER_INTERVAL ms, But
> - * if it exceeds MAX_RAMP_DOWN_TIME ms for ramp down time.
> - * Set timer such that it fires exactly at MAX_RAMP_DOWN_TIME
> - * seconds of ramp down time.
> + * Timer should expire next time after GPSTATE_TIMER_INTERVAL. If
> + * the resulting interval (elapsed time + interval) between last
> + * and next timer expiry is greater than MAX_RAMP_DOWN_TIME, ensure
> + * it is maximum MAX_RAMP_DOWN_TIME when queueing the next timer.
> */
> if ((gpstates->elapsed_time + GPSTATE_TIMER_INTERVAL)
> - > MAX_RAMP_DOWN_TIME)
> + > MAX_RAMP_DOWN_TIME)
> timer_interval = MAX_RAMP_DOWN_TIME - gpstates->elapsed_time;
> else
> timer_interval = GPSTATE_TIMER_INTERVAL;
> @@ -865,8 +865,7 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
>
> /* initialize timer */
> gpstates->policy = policy;
> - timer_setup(&gpstates->timer, gpstate_timer_handler,
> - TIMER_PINNED | TIMER_DEFERRABLE);
> + timer_setup(&gpstates->timer, gpstate_timer_handler, TIMER_PINNED);
> gpstates->timer.expires = jiffies +
> msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
> spin_lock_init(&gpstates->gpstate_lock);
> --
> 2.30.2
>

2022-10-31 15:39:17

by Anna-Maria Behnsen

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] cpufreq: Prepare timer flags for hierarchical timer pull model

On Wed, 26 Oct 2022, Frederic Weisbecker wrote:

> On Tue, Oct 25, 2022 at 03:58:34PM +0200, Anna-Maria Behnsen wrote:
> > Note: This is a proposal only. I was waiting on input how to change this
> > driver properly to use the already existing infrastructure. See therfore
> > the thread on linux-pm mailinglist:
> > https://lore.kernel.org/linux-pm/[email protected]/
> >
> > gpstates timer is the only timer using TIMER_PINNED and TIMER_DEFERRABLE
> > flag. When moving to hierarchical timer pull model, pinned and deferrable
> > timers are stored in separate bases.
> >
> > To ensure gpstates timer always expires on the CPU where it is pinned to,
> > keep only TIMER_PINNED flag and drop TIMER_DEFERRABLE flag.
>
> OTOH there are deferrable timers out there that expect to run on a
> specific CPU, because there are always queued with add_timer_on().
>
> For example workqueues using DECLARE_DEFERRABLE_WORK() that are queued
> with queue_delayed_work_on(). Like vmstat().
>
> Those are not explicitely pinned because they don't rely on __mod_timer()
> but they expect CPU affinity.
>

You are right. In contrast to the original plan, I'm not able (yet) to
remove the deferrable timers completely. But all timers using the
add_timer_on() path need the TIMER_PINNED flag. Then three timer bases per
CPU will be available:

- global base (TIMER_PINNED is not set)
- local base (TIMER_PINNED is set but not TIMER_DEFERRABLE)
- deferrable pinned base (TIMER_PINNED and TIMER_DEFERRABLE is set)

The logic stays the same as already implemented in patch queue: Timers in
global base will not prevent CPU from going idle. When the CPU has the
migrator duty, timers in hierarchy are taken into account. Timers in local
base force the CPU to wake up. Timers in the deferrable pinned base are not
taken into account when going idle.

With this, the rework of cpufreq driver is no longer required - the timer
will end up in deferrable pinned base the same with vmstat.

Thanks,

Anna-Maria