LinuxLists.cc - [PATCH] sched: idle: Reenable sched tick for cpuidle request

2018-08-09 05:49:16

Subject: [PATCH] sched: idle: Reenable sched tick for cpuidle request

The idle loop stops tick by respecting the decision from cpuidle
framework, if the condition 'need_resched()' is false without any task
scheduling, the CPU keeps running in the loop in do_idle() and it has no
chance call tick_nohz_idle_exit() to enable the tick. This results in
the idle loop cannot reenable sched tick afterwards, if the idle
governor selects a shallow state, thus the powernightmares issue can
occur again.

This issue can be easily reproduce with the case on Arm Hikey board: use
CPU0 to send IPI to CPU7, CPU7 receives the IPI and in the callback
function it start a hrtimer with 4ms, so the 4ms timer delta value can
let 'menu' governor to choose deepest state in the next entering idle
time. From then on, CPU7 restarts hrtimer with 1ms interval for total
10 times, so this can utilize the typical pattern in 'menu' governor to
have prediction for 1ms duration, finally idle governor is easily to
select a shallow state, on Hikey board it usually is to select CPU off
state. From then on, CPU7 stays in this shallow state for long time
until there have other interrupts on it.

C2: cluster off; C1: CPU off

Idle state: C2 C2 C2 C2 C2 C2 C2 C1
--------------------------------------------------------->
Interrupt: ^ ^ ^ ^ ^ ^ ^ ^ ^
IPI Timer Timer Timer Timer Timer Timer Timer Timer
4ms 1ms 1ms 1ms 1ms 1ms 1ms 1ms

To fix this issue, the idle loop needs to support reenabling sched tick.
This patch checks the conditions 'stop_tick' is false when the tick is
stopped, this condition indicates the cpuidle governor asks to reenable
the tick and we can use tick_nohz_idle_restart_tick() for this purpose.

A synthetic case is used to to verify this patch, we use CPU0 to send
IPI to wake up CPU7 with 50ms interval, CPU7 generate a series hrtimer
events (the first interval is 4ms, then the sequential 10 timer events
are 1ms interval, same as described above). We do statistics for idle
states duration, the unit is second (s), the testing result shows the
C2 state (deepest state) staying time can be improved significantly for
CPU7 (+7.942s for 10s execution time on CPU7) and all CPUs wide
(+13.360s for ~80s of all CPUs execution time).

Without patches With patches Difference
-------------------- -------------------- -----------------------
CPU C0 C1 C2 C0 C1 C2 C0 C1 C2
0 0.000 0.027 9.941 0.055 0.038 9.700 +0.055 +0.010 -0.240
1 0.045 0.000 9.964 0.019 0.000 9.943 -0.026 +0.000 -0.020
2 0.002 0.003 10.007 0.035 0.053 9.916 +0.033 +0.049 -0.090
3 0.000 0.023 9.994 0.024 0.246 9.732 +0.024 +0.222 -0.261
4 0.032 0.000 9.985 0.015 0.007 9.993 -0.016 +0.007 +0.008
5 0.001 0.000 9.226 0.039 0.000 9.971 +0.038 +0.000 +0.744
6 0.000 0.000 0.000 0.036 0.000 5.278 +0.036 +0.000 +5.278
7 1.894 8.013 0.059 1.509 0.026 8.002 -0.384 -7.987 +7.942
All 1.976 8.068 59.179 1.737 0.372 72.539 -0.239 -7.695 +13.360

Cc: Daniel Lezcano <[email protected]>
Cc: Vincent Guittot <[email protected]>
Signed-off-by: Leo Yan <[email protected]>
---
kernel/sched/idle.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1a3e9bd..802286e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -190,10 +190,18 @@ static void cpuidle_idle_call(void)
*/
next_state = cpuidle_select(drv, dev, &stop_tick);

- if (stop_tick)
+ if (stop_tick) {
tick_nohz_idle_stop_tick();
- else
+ } else {
+ /*
+ * The cpuidle framework says to not stop tick but
+ * the tick has been stopped yet, so restart it.
+ */
+ if (tick_nohz_tick_stopped())
+ tick_nohz_idle_restart_tick();
+
tick_nohz_idle_retain_tick();
+ }

rcu_idle_enter();

--
2.7.4

2018-08-09 07:00:12

by Leo Yan

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Thu, Aug 09, 2018 at 01:47:27PM +0800, Leo Yan wrote:
> The idle loop stops tick by respecting the decision from cpuidle
> framework, if the condition 'need_resched()' is false without any task
> scheduling, the CPU keeps running in the loop in do_idle() and it has no
> chance call tick_nohz_idle_exit() to enable the tick. This results in
> the idle loop cannot reenable sched tick afterwards, if the idle
> governor selects a shallow state, thus the powernightmares issue can
> occur again.
>
> This issue can be easily reproduce with the case on Arm Hikey board: use
> CPU0 to send IPI to CPU7, CPU7 receives the IPI and in the callback
> function it start a hrtimer with 4ms, so the 4ms timer delta value can
> let 'menu' governor to choose deepest state in the next entering idle
> time. From then on, CPU7 restarts hrtimer with 1ms interval for total
> 10 times, so this can utilize the typical pattern in 'menu' governor to
> have prediction for 1ms duration, finally idle governor is easily to
> select a shallow state, on Hikey board it usually is to select CPU off
> state. From then on, CPU7 stays in this shallow state for long time
> until there have other interrupts on it.
>
> C2: cluster off; C1: CPU off
>
> Idle state: C2 C2 C2 C2 C2 C2 C2 C1
> --------------------------------------------------------->
> Interrupt: ^ ^ ^ ^ ^ ^ ^ ^ ^
> IPI Timer Timer Timer Timer Timer Timer Timer Timer
> 4ms 1ms 1ms 1ms 1ms 1ms 1ms 1ms
>
> To fix this issue, the idle loop needs to support reenabling sched tick.
> This patch checks the conditions 'stop_tick' is false when the tick is
> stopped, this condition indicates the cpuidle governor asks to reenable
> the tick and we can use tick_nohz_idle_restart_tick() for this purpose.
>
> A synthetic case is used to to verify this patch, we use CPU0 to send
> IPI to wake up CPU7 with 50ms interval, CPU7 generate a series hrtimer
> events (the first interval is 4ms, then the sequential 10 timer events
> are 1ms interval, same as described above). We do statistics for idle
> states duration, the unit is second (s), the testing result shows the
> C2 state (deepest state) staying time can be improved significantly for
> CPU7 (+7.942s for 10s execution time on CPU7) and all CPUs wide
> (+13.360s for ~80s of all CPUs execution time).
>
> Without patches With patches Difference
> -------------------- -------------------- -----------------------
> CPU C0 C1 C2 C0 C1 C2 C0 C1 C2
> 0 0.000 0.027 9.941 0.055 0.038 9.700 +0.055 +0.010 -0.240
> 1 0.045 0.000 9.964 0.019 0.000 9.943 -0.026 +0.000 -0.020
> 2 0.002 0.003 10.007 0.035 0.053 9.916 +0.033 +0.049 -0.090
> 3 0.000 0.023 9.994 0.024 0.246 9.732 +0.024 +0.222 -0.261
> 4 0.032 0.000 9.985 0.015 0.007 9.993 -0.016 +0.007 +0.008
> 5 0.001 0.000 9.226 0.039 0.000 9.971 +0.038 +0.000 +0.744
> 6 0.000 0.000 0.000 0.036 0.000 5.278 +0.036 +0.000 +5.278
> 7 1.894 8.013 0.059 1.509 0.026 8.002 -0.384 -7.987 +7.942
> All 1.976 8.068 59.179 1.737 0.372 72.539 -0.239 -7.695 +13.360

I found the CPU6 data in upper table is flaw when I read this again,
CPU6 has no any ftrace event for idle entering/exiting from the start
testing, both two runs have the same issue. so the result is not
reliable for CPU6.

Retested this case and at the beginning to wake up all CPUs so we can
have sane idle ftrace events. Below is result, the conclusion is: CPU7
has improvement for staying in deepest state and there have no
regression on other CPUs.

Without patches With patches Difference
-------------------- -------------------- ----------------------
CPU C0 C1 C2 C0 C1 C2 C0 C1 C2
0 0.000 0.021 9.837 0.000 0.022 9.919 +0.000 +0.000 +0.081
1 0.000 0.003 10.034 0.028 0.000 9.983 +0.028 -0.003 -0.051
2 0.023 0.031 9.963 0.007 0.019 9.986 -0.016 -0.011 +0.023
3 0.028 0.003 9.976 0.000 0.008 10.006 -0.027 +0.005 +0.030
4 0.052 0.000 9.971 0.023 0.000 9.994 -0.028 +0.000 +0.022
5 0.027 0.000 10.002 0.024 0.000 9.996 -0.002 +0.000 -0.006
6 0.013 0.000 10.018 0.025 0.000 9.992 +0.011 +0.000 -0.025
7 1.766 8.041 0.043 1.981 0.030 7.872 +0.214 -8.011 +7.829
All 1.912 8.101 69.847 2.092 0.081 77.752 +0.180 -8.020 +7.905

Another important dependency should to mention, we also need another
prerequisite patch "cpuidle: menu: Correct the criteria for stopping
tick" [1] for the testing, if without this patch, the idle governor
will select shallow state in idle loop but it will not tell idle loop
to reenable tick:

'expected_interval' is always be clamped to
min(TICK_USEC, ktime_to_us(delta_next)) [2] when tick is stopped,
thus 'expected_interval' is assigned to TICK_USEC at the last
time when the CPU enter idle state and without timer event, this results
in it cannot meet condition 'expected_interval < TICK_USEC' [3] for
enabling tick. We need rely on the dependent patch to set tick enabling
flag '*stop_tick = false' for shallow states.

[1] https://lkml.org/lkml/2018/8/7/407
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/cpuidle/governors/menu.c?h=v4.18-rc8#n358
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/cpuidle/governors/menu.c?h=v4.18-rc8#n407

> Cc: Daniel Lezcano <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Signed-off-by: Leo Yan <[email protected]>
> ---
> kernel/sched/idle.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 1a3e9bd..802286e 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -190,10 +190,18 @@ static void cpuidle_idle_call(void)
> */
> next_state = cpuidle_select(drv, dev, &stop_tick);
>
> - if (stop_tick)
> + if (stop_tick) {
> tick_nohz_idle_stop_tick();
> - else
> + } else {
> + /*
> + * The cpuidle framework says to not stop tick but
> + * the tick has been stopped yet, so restart it.
> + */
> + if (tick_nohz_tick_stopped())
> + tick_nohz_idle_restart_tick();
> +
> tick_nohz_idle_retain_tick();
> + }
>
> rcu_idle_enter();
>
> --
> 2.7.4
>

2018-08-09 10:47:11

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Thu, Aug 09, 2018 at 01:47:27PM +0800, Leo Yan wrote:
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 1a3e9bd..802286e 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -190,10 +190,18 @@ static void cpuidle_idle_call(void)
> */
> next_state = cpuidle_select(drv, dev, &stop_tick);
>
> - if (stop_tick)
> + if (stop_tick) {
> tick_nohz_idle_stop_tick();
> - else
> + } else {
> + /*
> + * The cpuidle framework says to not stop tick but
> + * the tick has been stopped yet, so restart it.
> + */
> + if (tick_nohz_tick_stopped())
> + tick_nohz_idle_restart_tick();
> +

I suspect you want an 'else' here. restart_tick already calls
timer_clear_idle().

> tick_nohz_idle_retain_tick();
> + }
>

However, I would rather we stuff all this into retain_tick.

2018-08-09 11:18:42

by Leo Yan

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Thu, Aug 09, 2018 at 12:45:49PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 09, 2018 at 01:47:27PM +0800, Leo Yan wrote:
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index 1a3e9bd..802286e 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -190,10 +190,18 @@ static void cpuidle_idle_call(void)
> > */
> > next_state = cpuidle_select(drv, dev, &stop_tick);
> >
> > - if (stop_tick)
> > + if (stop_tick) {
> > tick_nohz_idle_stop_tick();
> > - else
> > + } else {
> > + /*
> > + * The cpuidle framework says to not stop tick but
> > + * the tick has been stopped yet, so restart it.
> > + */
> > + if (tick_nohz_tick_stopped())
> > + tick_nohz_idle_restart_tick();
> > +
>
> I suspect you want an 'else' here. restart_tick already calls
> timer_clear_idle().

No, from the testing I found must call retain_tick, otherwise the
kernel compliants the warning from tick_nohz_idle_exit() when exit
from idle state:

WARN_ON_ONCE(ts->timer_expires_base);

> > tick_nohz_idle_retain_tick();
> > + }
> >
>
> However, I would rather we stuff all this into retain_tick.

Ah, yes; I tested below change and it also have same improvement for
idle state with my preivous change; please review if it's okay?

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index da9455a..fd2bfad 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -962,6 +962,10 @@ void tick_nohz_idle_stop_tick(void)

void tick_nohz_idle_retain_tick(void)
{
+ /* Restart the tikc if it has been stopped yet. */
+ if (tick_nohz_tick_stopped())
+ tick_nohz_idle_restart_tick();
+
tick_nohz_retain_tick(this_cpu_ptr(&tick_cpu_sched));
/*
* Undo the effect of get_next_timer_interrupt() called from

Thanks,
Leo Yan

2018-08-09 12:09:05

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Thursday, August 9, 2018 7:47:27 AM CEST Leo Yan wrote:
> The idle loop stops tick by respecting the decision from cpuidle
> framework, if the condition 'need_resched()' is false without any task
> scheduling, the CPU keeps running in the loop in do_idle() and it has no
> chance call tick_nohz_idle_exit() to enable the tick. This results in
> the idle loop cannot reenable sched tick afterwards, if the idle
> governor selects a shallow state, thus the powernightmares issue can
> occur again.

Well, there is code in the menu governor to avoid that.

> This issue can be easily reproduce with the case on Arm Hikey board: use
> CPU0 to send IPI to CPU7, CPU7 receives the IPI and in the callback
> function it start a hrtimer with 4ms, so the 4ms timer delta value can
> let 'menu' governor to choose deepest state in the next entering idle
> time. From then on, CPU7 restarts hrtimer with 1ms interval for total
> 10 times, so this can utilize the typical pattern in 'menu' governor to
> have prediction for 1ms duration, finally idle governor is easily to
> select a shallow state, on Hikey board it usually is to select CPU off
> state. From then on, CPU7 stays in this shallow state for long time
> until there have other interrupts on it.

And which means that the above-mentioned code misses this case.

>
> C2: cluster off; C1: CPU off
>
> Idle state: C2 C2 C2 C2 C2 C2 C2 C1
> --------------------------------------------------------->
> Interrupt: ^ ^ ^ ^ ^ ^ ^ ^ ^
> IPI Timer Timer Timer Timer Timer Timer Timer Timer
> 4ms 1ms 1ms 1ms 1ms 1ms 1ms 1ms
>
> To fix this issue, the idle loop needs to support reenabling sched tick.
> This patch checks the conditions 'stop_tick' is false when the tick is
> stopped, this condition indicates the cpuidle governor asks to reenable
> the tick and we can use tick_nohz_idle_restart_tick() for this purpose.
>
> A synthetic case is used to to verify this patch, we use CPU0 to send
> IPI to wake up CPU7 with 50ms interval, CPU7 generate a series hrtimer
> events (the first interval is 4ms, then the sequential 10 timer events
> are 1ms interval, same as described above). We do statistics for idle
> states duration, the unit is second (s), the testing result shows the
> C2 state (deepest state) staying time can be improved significantly for
> CPU7 (+7.942s for 10s execution time on CPU7) and all CPUs wide
> (+13.360s for ~80s of all CPUs execution time).
>
> Without patches With patches Difference
> -------------------- -------------------- -----------------------
> CPU C0 C1 C2 C0 C1 C2 C0 C1 C2
> 0 0.000 0.027 9.941 0.055 0.038 9.700 +0.055 +0.010 -0.240
> 1 0.045 0.000 9.964 0.019 0.000 9.943 -0.026 +0.000 -0.020
> 2 0.002 0.003 10.007 0.035 0.053 9.916 +0.033 +0.049 -0.090
> 3 0.000 0.023 9.994 0.024 0.246 9.732 +0.024 +0.222 -0.261
> 4 0.032 0.000 9.985 0.015 0.007 9.993 -0.016 +0.007 +0.008
> 5 0.001 0.000 9.226 0.039 0.000 9.971 +0.038 +0.000 +0.744
> 6 0.000 0.000 0.000 0.036 0.000 5.278 +0.036 +0.000 +5.278
> 7 1.894 8.013 0.059 1.509 0.026 8.002 -0.384 -7.987 +7.942
> All 1.976 8.068 59.179 1.737 0.372 72.539 -0.239 -7.695 +13.360
>
> Cc: Daniel Lezcano <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Signed-off-by: Leo Yan <[email protected]>
> ---
> kernel/sched/idle.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 1a3e9bd..802286e 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -190,10 +190,18 @@ static void cpuidle_idle_call(void)
> */
> next_state = cpuidle_select(drv, dev, &stop_tick);
>
> - if (stop_tick)
> + if (stop_tick) {
> tick_nohz_idle_stop_tick();
> - else
> + } else {
> + /*
> + * The cpuidle framework says to not stop tick but
> + * the tick has been stopped yet, so restart it.
> + */
> + if (tick_nohz_tick_stopped())
> + tick_nohz_idle_restart_tick();

You need an "else" here IMO as Peter said.

> +
> tick_nohz_idle_retain_tick();
> + }
>
> rcu_idle_enter();
>
>

Please CC cpuidle patches to [email protected], that helps a lot.

Thanks,
Rafael

2018-08-09 15:43:44

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Thu, Aug 9, 2018 at 2:05 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Thursday, August 9, 2018 7:47:27 AM CEST Leo Yan wrote:
>> The idle loop stops tick by respecting the decision from cpuidle
>> framework, if the condition 'need_resched()' is false without any task
>> scheduling, the CPU keeps running in the loop in do_idle() and it has no
>> chance call tick_nohz_idle_exit() to enable the tick. This results in
>> the idle loop cannot reenable sched tick afterwards, if the idle
>> governor selects a shallow state, thus the powernightmares issue can
>> occur again.
>
> Well, there is code in the menu governor to avoid that.

So the governor is not expected to select a shallow state then, unless
it knows that there will be a timer interrupt (which is not a tick,
obviously), that will wake up the CPU early enough.

The menu governor works like that, but I agree that it should not
request the tick to be running then (which it does).

>> This issue can be easily reproduce with the case on Arm Hikey board: use
>> CPU0 to send IPI to CPU7, CPU7 receives the IPI and in the callback
>> function it start a hrtimer with 4ms, so the 4ms timer delta value can
>> let 'menu' governor to choose deepest state in the next entering idle
>> time. From then on, CPU7 restarts hrtimer with 1ms interval for total
>> 10 times, so this can utilize the typical pattern in 'menu' governor to
>> have prediction for 1ms duration, finally idle governor is easily to
>> select a shallow state, on Hikey board it usually is to select CPU off
>> state. From then on, CPU7 stays in this shallow state for long time
>> until there have other interrupts on it.
>
> And which means that the above-mentioned code misses this case.

And I don't really understand how this happens. :-/

If menu sees that the tick has been stopped, it sets
data->predicted_us to the minimum of TICK_USEC and
ktime_to_us(delta_next) and the latency requirements comes from PM QoS
(no interactivity boost). Thus the only case when it will say "do not
stop the tick" is when delta_next is below the tick period length, but
that's OK, because it means that there is a timer pending that much
time away, so it doesn't make sense to select a deeper idle state
then.

If there is a short-interval timer pending every time we go idle, it
doesn't matter that the tick is stopped really, because the other
timer will wake the CPU up anyway.

Have I missed anything?

>>
>> C2: cluster off; C1: CPU off
>>
>> Idle state: C2 C2 C2 C2 C2 C2 C2 C1
>> --------------------------------------------------------->
>> Interrupt: ^ ^ ^ ^ ^ ^ ^ ^ ^
>> IPI Timer Timer Timer Timer Timer Timer Timer Timer
>> 4ms 1ms 1ms 1ms 1ms 1ms 1ms 1ms
>>
>> To fix this issue, the idle loop needs to support reenabling sched tick.
>> This patch checks the conditions 'stop_tick' is false when the tick is
>> stopped, this condition indicates the cpuidle governor asks to reenable
>> the tick and we can use tick_nohz_idle_restart_tick() for this purpose.
>>
>> A synthetic case is used to to verify this patch, we use CPU0 to send
>> IPI to wake up CPU7 with 50ms interval, CPU7 generate a series hrtimer
>> events (the first interval is 4ms, then the sequential 10 timer events
>> are 1ms interval, same as described above). We do statistics for idle
>> states duration, the unit is second (s), the testing result shows the
>> C2 state (deepest state) staying time can be improved significantly for
>> CPU7 (+7.942s for 10s execution time on CPU7) and all CPUs wide
>> (+13.360s for ~80s of all CPUs execution time).
>>
>> Without patches With patches Difference
>> -------------------- -------------------- -----------------------
>> CPU C0 C1 C2 C0 C1 C2 C0 C1 C2
>> 0 0.000 0.027 9.941 0.055 0.038 9.700 +0.055 +0.010 -0.240
>> 1 0.045 0.000 9.964 0.019 0.000 9.943 -0.026 +0.000 -0.020
>> 2 0.002 0.003 10.007 0.035 0.053 9.916 +0.033 +0.049 -0.090
>> 3 0.000 0.023 9.994 0.024 0.246 9.732 +0.024 +0.222 -0.261
>> 4 0.032 0.000 9.985 0.015 0.007 9.993 -0.016 +0.007 +0.008
>> 5 0.001 0.000 9.226 0.039 0.000 9.971 +0.038 +0.000 +0.744
>> 6 0.000 0.000 0.000 0.036 0.000 5.278 +0.036 +0.000 +5.278
>> 7 1.894 8.013 0.059 1.509 0.026 8.002 -0.384 -7.987 +7.942
>> All 1.976 8.068 59.179 1.737 0.372 72.539 -0.239 -7.695 +13.360
>>
>> Cc: Daniel Lezcano <[email protected]>
>> Cc: Vincent Guittot <[email protected]>
>> Signed-off-by: Leo Yan <[email protected]>
>> ---
>> kernel/sched/idle.c | 12 ++++++++++--
>> 1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 1a3e9bd..802286e 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -190,10 +190,18 @@ static void cpuidle_idle_call(void)
>> */
>> next_state = cpuidle_select(drv, dev, &stop_tick);
>>
>> - if (stop_tick)
>> + if (stop_tick) {
>> tick_nohz_idle_stop_tick();
>> - else
>> + } else {
>> + /*
>> + * The cpuidle framework says to not stop tick but
>> + * the tick has been stopped yet, so restart it.
>> + */
>> + if (tick_nohz_tick_stopped())
>> + tick_nohz_idle_restart_tick();
>
> You need an "else" here IMO as Peter said.

And I really would prefer to avoid restarting the tick here, because
it is overhead and quite likely unnecessary.

>> +
>> tick_nohz_idle_retain_tick();
>> + }
>>
>> rcu_idle_enter();
>>
>>
>
> Please CC cpuidle patches to [email protected], that helps a lot.

Thanks,
Rafael

2018-08-09 16:31:17

by Leo Yan

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Thu, Aug 09, 2018 at 05:42:30PM +0200, Rafael J. Wysocki wrote:

[...]

> >> This issue can be easily reproduce with the case on Arm Hikey board: use
> >> CPU0 to send IPI to CPU7, CPU7 receives the IPI and in the callback
> >> function it start a hrtimer with 4ms, so the 4ms timer delta value can
> >> let 'menu' governor to choose deepest state in the next entering idle
> >> time. From then on, CPU7 restarts hrtimer with 1ms interval for total
> >> 10 times, so this can utilize the typical pattern in 'menu' governor to
> >> have prediction for 1ms duration, finally idle governor is easily to
> >> select a shallow state, on Hikey board it usually is to select CPU off
> >> state. From then on, CPU7 stays in this shallow state for long time
> >> until there have other interrupts on it.
> >
> > And which means that the above-mentioned code misses this case.
>
> And I don't really understand how this happens. :-/
>
> If menu sees that the tick has been stopped, it sets
> data->predicted_us to the minimum of TICK_USEC and
> ktime_to_us(delta_next) and the latency requirements comes from PM QoS
> (no interactivity boost). Thus the only case when it will say "do not
> stop the tick" is when delta_next is below the tick period length, but
> that's OK, because it means that there is a timer pending that much
> time away, so it doesn't make sense to select a deeper idle state
> then.
>
> If there is a short-interval timer pending every time we go idle, it
> doesn't matter that the tick is stopped really, because the other
> timer will wake the CPU up anyway.
>
> Have I missed anything?

Yeah, you miss one case is if there haven't anymore timer event, for this
case the ktime_to_us(delta_next) is a quite large value and
data->predicted_us will be to set TICK_USEC; if HZ=1000 then TICK_USEC is
1000us, on Hikey board if data->predicted_us is 1000us then it's easily
to set shallow state (C1) rather than C2. Unfortunately, this is the
last time the CPU can predict idle state before it will stay in idle
for long period.

[...]

> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> index 1a3e9bd..802286e 100644
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -190,10 +190,18 @@ static void cpuidle_idle_call(void)
> >> */
> >> next_state = cpuidle_select(drv, dev, &stop_tick);
> >>
> >> - if (stop_tick)
> >> + if (stop_tick) {
> >> tick_nohz_idle_stop_tick();
> >> - else
> >> + } else {
> >> + /*
> >> + * The cpuidle framework says to not stop tick but
> >> + * the tick has been stopped yet, so restart it.
> >> + */
> >> + if (tick_nohz_tick_stopped())
> >> + tick_nohz_idle_restart_tick();
> >
> > You need an "else" here IMO as Peter said.

Yeah, I have replied to Peter, after we restart the tick, I found must to
call tick_retain() to clear 'ts->timer_expires_base' to 0, otherwise
tick_nohz_idle_exit() reports warning when exit from idle loop.

> And I really would prefer to avoid restarting the tick here, because
> it is overhead and quite likely unnecessary.

I understand the logic when read the code, actually I did some experiments
on the function menu_select(), in menu_select() function it discards the
consideration for typical pattern interval and it also tries to avoid to
enable tick and select more shallow state at the bottom of function. So I
agree that in the middle of idles it's redundant to reenable tick and the
code is careful thought.

But this patch tries to rescue the case at the last time the CPU enter one
shallow idle state but without wake up event.

> >> +
> >> tick_nohz_idle_retain_tick();
> >> + }
> >>
> >> rcu_idle_enter();
> >>
> >>
> >
> > Please CC cpuidle patches to [email protected], that helps a lot.
>
> Thanks,
> Rafael

2018-08-09 16:44:59

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Thu, Aug 9, 2018 at 6:29 PM, <[email protected]> wrote:
> On Thu, Aug 09, 2018 at 05:42:30PM +0200, Rafael J. Wysocki wrote:
>
> [...]
>
>> >> This issue can be easily reproduce with the case on Arm Hikey board: use
>> >> CPU0 to send IPI to CPU7, CPU7 receives the IPI and in the callback
>> >> function it start a hrtimer with 4ms, so the 4ms timer delta value can
>> >> let 'menu' governor to choose deepest state in the next entering idle
>> >> time. From then on, CPU7 restarts hrtimer with 1ms interval for total
>> >> 10 times, so this can utilize the typical pattern in 'menu' governor to
>> >> have prediction for 1ms duration, finally idle governor is easily to
>> >> select a shallow state, on Hikey board it usually is to select CPU off
>> >> state. From then on, CPU7 stays in this shallow state for long time
>> >> until there have other interrupts on it.
>> >
>> > And which means that the above-mentioned code misses this case.
>>
>> And I don't really understand how this happens. :-/
>>
>> If menu sees that the tick has been stopped, it sets
>> data->predicted_us to the minimum of TICK_USEC and
>> ktime_to_us(delta_next) and the latency requirements comes from PM QoS
>> (no interactivity boost). Thus the only case when it will say "do not
>> stop the tick" is when delta_next is below the tick period length, but
>> that's OK, because it means that there is a timer pending that much
>> time away, so it doesn't make sense to select a deeper idle state
>> then.
>>
>> If there is a short-interval timer pending every time we go idle, it
>> doesn't matter that the tick is stopped really, because the other
>> timer will wake the CPU up anyway.
>>
>> Have I missed anything?
>
> Yeah, you miss one case is if there haven't anymore timer event, for this
> case the ktime_to_us(delta_next) is a quite large value and
> data->predicted_us will be to set TICK_USEC; if HZ=1000 then TICK_USEC is
> 1000us, on Hikey board if data->predicted_us is 1000us then it's easily
> to set shallow state (C1) rather than C2. Unfortunately, this is the
> last time the CPU can predict idle state before it will stay in idle
> for long period.

Fair enough, but in that case the governor will want the tick to be
stopped, because expected_interval is TICK_USEC then, so I'm not sure
how the patch helps?

> [...]
>
>> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> >> index 1a3e9bd..802286e 100644
>> >> --- a/kernel/sched/idle.c
>> >> +++ b/kernel/sched/idle.c
>> >> @@ -190,10 +190,18 @@ static void cpuidle_idle_call(void)
>> >> */
>> >> next_state = cpuidle_select(drv, dev, &stop_tick);
>> >>
>> >> - if (stop_tick)
>> >> + if (stop_tick) {
>> >> tick_nohz_idle_stop_tick();
>> >> - else
>> >> + } else {
>> >> + /*
>> >> + * The cpuidle framework says to not stop tick but
>> >> + * the tick has been stopped yet, so restart it.
>> >> + */
>> >> + if (tick_nohz_tick_stopped())
>> >> + tick_nohz_idle_restart_tick();
>> >
>> > You need an "else" here IMO as Peter said.
>
> Yeah, I have replied to Peter, after we restart the tick, I found must to
> call tick_retain() to clear 'ts->timer_expires_base' to 0, otherwise
> tick_nohz_idle_exit() reports warning when exit from idle loop.

I see now, thanks.

>> And I really would prefer to avoid restarting the tick here, because
>> it is overhead and quite likely unnecessary.
>
> I understand the logic when read the code, actually I did some experiments
> on the function menu_select(), in menu_select() function it discards the
> consideration for typical pattern interval and it also tries to avoid to
> enable tick and select more shallow state at the bottom of function. So I
> agree that in the middle of idles it's redundant to reenable tick and the
> code is careful thought.
>
> But this patch tries to rescue the case at the last time the CPU enter one
> shallow idle state but without wake up event.

It is better to avoid entering a shallow state IMO. Let me think
about that a bit.

2018-08-09 17:06:15

by Leo Yan

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Thu, Aug 09, 2018 at 06:43:55PM +0200, Rafael J. Wysocki wrote:
> On Thu, Aug 9, 2018 at 6:29 PM, <[email protected]> wrote:
> > On Thu, Aug 09, 2018 at 05:42:30PM +0200, Rafael J. Wysocki wrote:
> >
> > [...]
> >
> >> >> This issue can be easily reproduce with the case on Arm Hikey board: use
> >> >> CPU0 to send IPI to CPU7, CPU7 receives the IPI and in the callback
> >> >> function it start a hrtimer with 4ms, so the 4ms timer delta value can
> >> >> let 'menu' governor to choose deepest state in the next entering idle
> >> >> time. From then on, CPU7 restarts hrtimer with 1ms interval for total
> >> >> 10 times, so this can utilize the typical pattern in 'menu' governor to
> >> >> have prediction for 1ms duration, finally idle governor is easily to
> >> >> select a shallow state, on Hikey board it usually is to select CPU off
> >> >> state. From then on, CPU7 stays in this shallow state for long time
> >> >> until there have other interrupts on it.
> >> >
> >> > And which means that the above-mentioned code misses this case.
> >>
> >> And I don't really understand how this happens. :-/
> >>
> >> If menu sees that the tick has been stopped, it sets
> >> data->predicted_us to the minimum of TICK_USEC and
> >> ktime_to_us(delta_next) and the latency requirements comes from PM QoS
> >> (no interactivity boost). Thus the only case when it will say "do not
> >> stop the tick" is when delta_next is below the tick period length, but
> >> that's OK, because it means that there is a timer pending that much
> >> time away, so it doesn't make sense to select a deeper idle state
> >> then.
> >>
> >> If there is a short-interval timer pending every time we go idle, it
> >> doesn't matter that the tick is stopped really, because the other
> >> timer will wake the CPU up anyway.
> >>
> >> Have I missed anything?
> >
> > Yeah, you miss one case is if there haven't anymore timer event, for this
> > case the ktime_to_us(delta_next) is a quite large value and
> > data->predicted_us will be to set TICK_USEC; if HZ=1000 then TICK_USEC is
> > 1000us, on Hikey board if data->predicted_us is 1000us then it's easily
> > to set shallow state (C1) rather than C2. Unfortunately, this is the
> > last time the CPU can predict idle state before it will stay in idle
> > for long period.
>
> Fair enough, but in that case the governor will want the tick to be
> stopped, because expected_interval is TICK_USEC then, so I'm not sure
> how the patch helps?

Correct, I might introduce confusion at here and I mentioned in
another email I have one prerequisite patch [1]: "cpuidle: menu: Correct
the criteria for stopping tick", if without this dependency patch, the idle
governor will always stop the tick even it selects one shallow state.

Sorry when I sent patchs with [1], I didn't send to linux-pm mailing list,
do you want me to send these patches to linux-pm?

[...]

Thanks,
Leo Yan

[1] https://lkml.org/lkml/2018/8/7/407

2018-08-09 17:07:42

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Thu, Aug 9, 2018 at 7:04 PM, <[email protected]> wrote:
> On Thu, Aug 09, 2018 at 06:43:55PM +0200, Rafael J. Wysocki wrote:
>> On Thu, Aug 9, 2018 at 6:29 PM, <[email protected]> wrote:
>> > On Thu, Aug 09, 2018 at 05:42:30PM +0200, Rafael J. Wysocki wrote:
>> >
>> > [...]
>> >
>> >> >> This issue can be easily reproduce with the case on Arm Hikey board: use
>> >> >> CPU0 to send IPI to CPU7, CPU7 receives the IPI and in the callback
>> >> >> function it start a hrtimer with 4ms, so the 4ms timer delta value can
>> >> >> let 'menu' governor to choose deepest state in the next entering idle
>> >> >> time. From then on, CPU7 restarts hrtimer with 1ms interval for total
>> >> >> 10 times, so this can utilize the typical pattern in 'menu' governor to
>> >> >> have prediction for 1ms duration, finally idle governor is easily to
>> >> >> select a shallow state, on Hikey board it usually is to select CPU off
>> >> >> state. From then on, CPU7 stays in this shallow state for long time
>> >> >> until there have other interrupts on it.
>> >> >
>> >> > And which means that the above-mentioned code misses this case.
>> >>
>> >> And I don't really understand how this happens. :-/
>> >>
>> >> If menu sees that the tick has been stopped, it sets
>> >> data->predicted_us to the minimum of TICK_USEC and
>> >> ktime_to_us(delta_next) and the latency requirements comes from PM QoS
>> >> (no interactivity boost). Thus the only case when it will say "do not
>> >> stop the tick" is when delta_next is below the tick period length, but
>> >> that's OK, because it means that there is a timer pending that much
>> >> time away, so it doesn't make sense to select a deeper idle state
>> >> then.
>> >>
>> >> If there is a short-interval timer pending every time we go idle, it
>> >> doesn't matter that the tick is stopped really, because the other
>> >> timer will wake the CPU up anyway.
>> >>
>> >> Have I missed anything?
>> >
>> > Yeah, you miss one case is if there haven't anymore timer event, for this
>> > case the ktime_to_us(delta_next) is a quite large value and
>> > data->predicted_us will be to set TICK_USEC; if HZ=1000 then TICK_USEC is
>> > 1000us, on Hikey board if data->predicted_us is 1000us then it's easily
>> > to set shallow state (C1) rather than C2. Unfortunately, this is the
>> > last time the CPU can predict idle state before it will stay in idle
>> > for long period.
>>
>> Fair enough, but in that case the governor will want the tick to be
>> stopped, because expected_interval is TICK_USEC then, so I'm not sure
>> how the patch helps?
>
> Correct, I might introduce confusion at here and I mentioned in
> another email I have one prerequisite patch [1]: "cpuidle: menu: Correct
> the criteria for stopping tick", if without this dependency patch, the idle
> governor will always stop the tick even it selects one shallow state.
>
> Sorry when I sent patchs with [1], I didn't send to linux-pm mailing list,
> do you want me to send these patches to linux-pm?

Please do.

2018-08-09 21:34:58

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Thursday, August 9, 2018 6:43:55 PM CEST Rafael J. Wysocki wrote:
> On Thu, Aug 9, 2018 at 6:29 PM, <[email protected]> wrote:
> > On Thu, Aug 09, 2018 at 05:42:30PM +0200, Rafael J. Wysocki wrote:
> >
> > [...]
> >
> >> >> This issue can be easily reproduce with the case on Arm Hikey board: use
> >> >> CPU0 to send IPI to CPU7, CPU7 receives the IPI and in the callback
> >> >> function it start a hrtimer with 4ms, so the 4ms timer delta value can
> >> >> let 'menu' governor to choose deepest state in the next entering idle
> >> >> time. From then on, CPU7 restarts hrtimer with 1ms interval for total
> >> >> 10 times, so this can utilize the typical pattern in 'menu' governor to
> >> >> have prediction for 1ms duration, finally idle governor is easily to
> >> >> select a shallow state, on Hikey board it usually is to select CPU off
> >> >> state. From then on, CPU7 stays in this shallow state for long time
> >> >> until there have other interrupts on it.
> >> >
> >> > And which means that the above-mentioned code misses this case.
> >>
> >> And I don't really understand how this happens. :-/
> >>
> >> If menu sees that the tick has been stopped, it sets
> >> data->predicted_us to the minimum of TICK_USEC and
> >> ktime_to_us(delta_next) and the latency requirements comes from PM QoS
> >> (no interactivity boost). Thus the only case when it will say "do not
> >> stop the tick" is when delta_next is below the tick period length, but
> >> that's OK, because it means that there is a timer pending that much
> >> time away, so it doesn't make sense to select a deeper idle state
> >> then.
> >>
> >> If there is a short-interval timer pending every time we go idle, it
> >> doesn't matter that the tick is stopped really, because the other
> >> timer will wake the CPU up anyway.
> >>
> >> Have I missed anything?
> >
> > Yeah, you miss one case is if there haven't anymore timer event, for this
> > case the ktime_to_us(delta_next) is a quite large value and
> > data->predicted_us will be to set TICK_USEC; if HZ=1000 then TICK_USEC is
> > 1000us, on Hikey board if data->predicted_us is 1000us then it's easily
> > to set shallow state (C1) rather than C2. Unfortunately, this is the
> > last time the CPU can predict idle state before it will stay in idle
> > for long period.
>
> Fair enough, but in that case the governor will want the tick to be
> stopped, because expected_interval is TICK_USEC then, so I'm not sure
> how the patch helps?
>
> > [...]
> >
> >> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> >> index 1a3e9bd..802286e 100644
> >> >> --- a/kernel/sched/idle.c
> >> >> +++ b/kernel/sched/idle.c
> >> >> @@ -190,10 +190,18 @@ static void cpuidle_idle_call(void)
> >> >> */
> >> >> next_state = cpuidle_select(drv, dev, &stop_tick);
> >> >>
> >> >> - if (stop_tick)
> >> >> + if (stop_tick) {
> >> >> tick_nohz_idle_stop_tick();
> >> >> - else
> >> >> + } else {
> >> >> + /*
> >> >> + * The cpuidle framework says to not stop tick but
> >> >> + * the tick has been stopped yet, so restart it.
> >> >> + */
> >> >> + if (tick_nohz_tick_stopped())
> >> >> + tick_nohz_idle_restart_tick();
> >> >
> >> > You need an "else" here IMO as Peter said.
> >
> > Yeah, I have replied to Peter, after we restart the tick, I found must to
> > call tick_retain() to clear 'ts->timer_expires_base' to 0, otherwise
> > tick_nohz_idle_exit() reports warning when exit from idle loop.
>
> I see now, thanks.
>
> >> And I really would prefer to avoid restarting the tick here, because
> >> it is overhead and quite likely unnecessary.
> >
> > I understand the logic when read the code, actually I did some experiments
> > on the function menu_select(), in menu_select() function it discards the
> > consideration for typical pattern interval and it also tries to avoid to
> > enable tick and select more shallow state at the bottom of function. So I
> > agree that in the middle of idles it's redundant to reenable tick and the
> > code is careful thought.
> >
> > But this patch tries to rescue the case at the last time the CPU enter one
> > shallow idle state but without wake up event.
>
> It is better to avoid entering a shallow state IMO. Let me think
> about that a bit.

The simple change below should address this issue and I don't quite see
what it can break. It may cause deeper idle states to be selected with
the tick already stopped, but that really shouldn't be problematic, as
(since the tick has been stopped) there are no strict latency constraints,
so even if there is an early wakeup, we should be able to tolerate the
extra latency just fine.

---
drivers/cpuidle/governors/menu.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -349,14 +349,12 @@ static int menu_select(struct cpuidle_dr
* If the tick is already stopped, the cost of possible short
* idle duration misprediction is much higher, because the CPU
* may be stuck in a shallow idle state for a long time as a
- * result of it. In that case say we might mispredict and try
- * to force the CPU into a state for which we would have stopped
- * the tick, unless a timer is going to expire really soon
- * anyway.
+ * result of it. In that case say we might mispredict and use
+ * the known time to the closest timer event for the idle state
+ * selection.
*/
if (data->predicted_us < TICK_USEC)
- data->predicted_us = min_t(unsigned int, TICK_USEC,
- ktime_to_us(delta_next));
+ data->predicted_us = ktime_to_us(delta_next);
} else {
/*
* Use the performance multiplier and the user-configurable

2018-08-10 05:55:46

by Leo Yan

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Thu, Aug 09, 2018 at 11:31:46PM +0200, Rafael J . Wysocki wrote:

[...]

> > >> And I really would prefer to avoid restarting the tick here, because
> > >> it is overhead and quite likely unnecessary.
> > >
> > > I understand the logic when read the code, actually I did some experiments
> > > on the function menu_select(), in menu_select() function it discards the
> > > consideration for typical pattern interval and it also tries to avoid to
> > > enable tick and select more shallow state at the bottom of function. So I
> > > agree that in the middle of idles it's redundant to reenable tick and the
> > > code is careful thought.
> > >
> > > But this patch tries to rescue the case at the last time the CPU enter one
> > > shallow idle state but without wake up event.
> >
> > It is better to avoid entering a shallow state IMO. Let me think
> > about that a bit.
>
> The simple change below should address this issue and I don't quite see
> what it can break. It may cause deeper idle states to be selected with
> the tick already stopped, but that really shouldn't be problematic, as
> (since the tick has been stopped) there are no strict latency constraints,
> so even if there is an early wakeup, we should be able to tolerate the
> extra latency just fine.
>
> ---
> drivers/cpuidle/governors/menu.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> Index: linux-pm/drivers/cpuidle/governors/menu.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
> +++ linux-pm/drivers/cpuidle/governors/menu.c
> @@ -349,14 +349,12 @@ static int menu_select(struct cpuidle_dr
> * If the tick is already stopped, the cost of possible short
> * idle duration misprediction is much higher, because the CPU
> * may be stuck in a shallow idle state for a long time as a
> - * result of it. In that case say we might mispredict and try
> - * to force the CPU into a state for which we would have stopped
> - * the tick, unless a timer is going to expire really soon
> - * anyway.
> + * result of it. In that case say we might mispredict and use
> + * the known time to the closest timer event for the idle state
> + * selection.
> */
> if (data->predicted_us < TICK_USEC)
> - data->predicted_us = min_t(unsigned int, TICK_USEC,
> - ktime_to_us(delta_next));
> + data->predicted_us = ktime_to_us(delta_next);

I did the testing on this, but above change cannot really resolve the
issue, it misses to handle the case if 'data->predicted_us > TICK_USEC';
if the prediction is longer than TICK_USEC, e.g. data->predicted_us is
2ms, TICK_USEC=1ms; for this case the deepest state will not be
chosen and if the data->predicted_us is decided by typical pattern
value but not the closest timer, finally the CPU still might stay in
shallow state for long time.

Actually in the CPU idle loop with the tick is stopped, I think we
should achieve two targets:
- Ensure the CPU can enter the deepest idle state at the last time it
runs into into idle;
- In the middle of idles, we will not reenable the tick at all; though
the idle states can be chosen a shallow state for short prediction;

To achieve the first target, we need to define what's the possible
case the CPU might stay into shallow state but cannot be waken up in
short time; so for this purpose it's pointeless to compare the value
between 'data->predicted_us' and TICK_USEC, so I'd like to check if
the next timer event is reliable to wake up CPU in short time, this
can be finished by comparison between 'ktime_to_us(delta_next)' with
maximum target residency;

For the second target, we should not enable the tick again in the idle
loop after the tick is stopped, whatever the governor choose any idle
state.

So how about below changes? I did some verify on this.

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 30ab759..e2de7c2 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -351,18 +351,21 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
data->predicted_us = min(data->predicted_us, expected_interval);

if (tick_nohz_tick_stopped()) {
+ unsigned int delta_next_us = ktime_to_us(delta_next);
+
/*
* If the tick is already stopped, the cost of possible short
* idle duration misprediction is much higher, because the CPU
* may be stuck in a shallow idle state for a long time as a
- * result of it. In that case say we might mispredict and try
- * to force the CPU into a state for which we would have stopped
- * the tick, unless a timer is going to expire really soon
- * anyway.
+ * result of it. If the next timer event is later than the
+ * maximum target residency, this means the timer event is not
+ * trusted to wake up CPU in short term and the typical pattern
+ * interval or other factors might lead to a shallow state, in
+ * that case say we might mispredict and use the known time to
+ * the closest timer event for the idle state selection.
*/
- if (data->predicted_us < TICK_USEC)
- data->predicted_us = min_t(unsigned int, TICK_USEC,
- ktime_to_us(delta_next));
+ if (delta_next_us >= drv->states[drv->state_count-1].target_residency)
+ data->predicted_us = delta_next_us;
} else {
/*
* Use the performance multiplier and the user-configurable
@@ -410,12 +413,12 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
* expected idle duration is shorter than the tick period length.
*/
if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
- expected_interval < TICK_USEC) {
+ (!tick_nohz_tick_stopped() && expected_interval < TICK_USEC)) {
unsigned int delta_next_us = ktime_to_us(delta_next);

*stop_tick = false;

- if (!tick_nohz_tick_stopped() && idx > 0 &&
+ if (idx > 0 &&
drv->states[idx].target_residency > delta_next_us) {
/*
* The tick is not going to be stopped and the target

2018-08-10 07:25:56

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH] sched: idle: Reenable sched tick for cpuidle request

On Fri, Aug 10, 2018 at 7:53 AM, <[email protected]> wrote:
> On Thu, Aug 09, 2018 at 11:31:46PM +0200, Rafael J . Wysocki wrote:
>
> [...]
>
>> > >> And I really would prefer to avoid restarting the tick here, because
>> > >> it is overhead and quite likely unnecessary.
>> > >
>> > > I understand the logic when read the code, actually I did some experiments
>> > > on the function menu_select(), in menu_select() function it discards the
>> > > consideration for typical pattern interval and it also tries to avoid to
>> > > enable tick and select more shallow state at the bottom of function. So I
>> > > agree that in the middle of idles it's redundant to reenable tick and the
>> > > code is careful thought.
>> > >
>> > > But this patch tries to rescue the case at the last time the CPU enter one
>> > > shallow idle state but without wake up event.
>> >
>> > It is better to avoid entering a shallow state IMO. Let me think
>> > about that a bit.
>>
>> The simple change below should address this issue and I don't quite see
>> what it can break. It may cause deeper idle states to be selected with
>> the tick already stopped, but that really shouldn't be problematic, as
>> (since the tick has been stopped) there are no strict latency constraints,
>> so even if there is an early wakeup, we should be able to tolerate the
>> extra latency just fine.
>>
>> ---
>> drivers/cpuidle/governors/menu.c | 10 ++++------
>> 1 file changed, 4 insertions(+), 6 deletions(-)
>>
>> Index: linux-pm/drivers/cpuidle/governors/menu.c
>> ===================================================================
>> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
>> +++ linux-pm/drivers/cpuidle/governors/menu.c
>> @@ -349,14 +349,12 @@ static int menu_select(struct cpuidle_dr
>> * If the tick is already stopped, the cost of possible short
>> * idle duration misprediction is much higher, because the CPU
>> * may be stuck in a shallow idle state for a long time as a
>> - * result of it. In that case say we might mispredict and try
>> - * to force the CPU into a state for which we would have stopped
>> - * the tick, unless a timer is going to expire really soon
>> - * anyway.
>> + * result of it. In that case say we might mispredict and use
>> + * the known time to the closest timer event for the idle state
>> + * selection.
>> */
>> if (data->predicted_us < TICK_USEC)
>> - data->predicted_us = min_t(unsigned int, TICK_USEC,
>> - ktime_to_us(delta_next));
>> + data->predicted_us = ktime_to_us(delta_next);
>
> I did the testing on this, but above change cannot really resolve the
> issue, it misses to handle the case if 'data->predicted_us > TICK_USEC';
> if the prediction is longer than TICK_USEC, e.g. data->predicted_us is
> 2ms, TICK_USEC=1ms; for this case the deepest state will not be
> chosen and if the data->predicted_us is decided by typical pattern
> value but not the closest timer, finally the CPU still might stay in
> shallow state for long time.

I noticed that too in the meantime. :-)

> Actually in the CPU idle loop with the tick is stopped, I think we
> should achieve two targets:
> - Ensure the CPU can enter the deepest idle state at the last time it
> runs into into idle;
> - In the middle of idles, we will not reenable the tick at all; though
> the idle states can be chosen a shallow state for short prediction;
>
> To achieve the first target, we need to define what's the possible
> case the CPU might stay into shallow state but cannot be waken up in
> short time; so for this purpose it's pointeless to compare the value
> between 'data->predicted_us' and TICK_USEC, so I'd like to check if
> the next timer event is reliable to wake up CPU in short time, this
> can be finished by comparison between 'ktime_to_us(delta_next)' with
> maximum target residency;
>
> For the second target, we should not enable the tick again in the idle
> loop after the tick is stopped, whatever the governor choose any idle
> state.
>
> So how about below changes? I did some verify on this.

I have a similar, but somewhat different patch. I'll post it shortly.

>
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 30ab759..e2de7c2 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -351,18 +351,21 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> data->predicted_us = min(data->predicted_us, expected_interval);
>
> if (tick_nohz_tick_stopped()) {
> + unsigned int delta_next_us = ktime_to_us(delta_next);
> +
> /*
> * If the tick is already stopped, the cost of possible short
> * idle duration misprediction is much higher, because the CPU
> * may be stuck in a shallow idle state for a long time as a
> - * result of it. In that case say we might mispredict and try
> - * to force the CPU into a state for which we would have stopped
> - * the tick, unless a timer is going to expire really soon
> - * anyway.
> + * result of it. If the next timer event is later than the
> + * maximum target residency, this means the timer event is not
> + * trusted to wake up CPU in short term and the typical pattern
> + * interval or other factors might lead to a shallow state, in
> + * that case say we might mispredict and use the known time to
> + * the closest timer event for the idle state selection.
> */
> - if (data->predicted_us < TICK_USEC)
> - data->predicted_us = min_t(unsigned int, TICK_USEC,
> - ktime_to_us(delta_next));
> + if (delta_next_us >= drv->states[drv->state_count-1].target_residency)
> + data->predicted_us = delta_next_us;
> } else {
> /*
> * Use the performance multiplier and the user-configurable
> @@ -410,12 +413,12 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> * expected idle duration is shorter than the tick period length.
> */
> if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
> - expected_interval < TICK_USEC) {
> + (!tick_nohz_tick_stopped() && expected_interval < TICK_USEC)) {
> unsigned int delta_next_us = ktime_to_us(delta_next);
>
> *stop_tick = false;
>
> - if (!tick_nohz_tick_stopped() && idx > 0 &&
> + if (idx > 0 &&
> drv->states[idx].target_residency > delta_next_us) {
> /*
> * The tick is not going to be stopped and the target