2011-04-07 18:18:19

by Kevin Hilman

[permalink] [raw]
Subject: [PATCH] nohz: delay going tickless under CPU load to favor deeper C states

From: Nicole Chalhoub <[email protected]>

While there is CPU load, continue the periodic tick in order to give
CPUidle another opportunity to pick a deeper C-state instead of
spending potentially long idle times in a shallow C-state.

Long winded version:

When going idle with a high load average, CPUidle menu governor will
decide to pick a shallow C-state since one of the guiding principles
of the menu governor is "The busier the system, the less impact of C
states is acceptable" (taken from cpuidle/governors/menu.c.) That
makes perfect sense.

However, there are missed power-saving opportunities for bursty
workloads with long idle times (e.g. MP3 playback.) Given such a
workload, because of the load average, CPUidle tends to pick a shallow
C-state. Because we also go tickless, this shallow C-state is used
for the duration of the idle period. If the idle period is long, a
deeper C state would've resulted in better power savings.

This patch delays going tickless when there is a load such that on the
next tick, the CPUidle governor will have another opportunity to to
pick a deeper C-state. Since the system will have been idle for
potentially a full tick, the load average will drop and a deeper C
state will most likely be chosen.

Delaying NOHZ decisions until the load is zero improved the load
estimation on our ARM/OMAP4 platform where HZ=128 and increased the
time spent in deep C-states (~50% of idle time in C-states deeper than
C1). A power saving of ~20mA at battery level is observed during MP3
playback on OMAP4/Blaze board.

Signed-off-by: Nicole Chalhoub <[email protected]>
Signed-off-by: Vincent Bour <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Thomas Gleixner <[email protected]>
[[email protected]: rework changelog]
Signed-off-by: Kevin Hilman <[email protected]>
---
kernel/time/tick-sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d5097c4..418066c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -324,7 +324,7 @@ void tick_nohz_stop_sched_tick(int inidle)
} while (read_seqretry(&xtime_lock, seq));

if (rcu_needs_cpu(cpu) || printk_needs_cpu(cpu) ||
- arch_needs_cpu(cpu)) {
+ arch_needs_cpu(cpu) || this_cpu_load()) {
next_jiffies = last_jiffies + 1;
delta_jiffies = 1;
} else {
--
1.7.4


2011-04-07 19:57:09

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] nohz: delay going tickless under CPU load to favor deeper C states

On 4/7/2011 11:18 AM, Kevin Hilman wrote:
> From: Nicole Chalhoub<[email protected]>
>
> While there is CPU load, continue the periodic tick in order to give
> CPUidle another opportunity to pick a deeper C-state instead of
> spending potentially long i


so I don't really like this patch. It's actually a pretty bad hack (I'm
sure it'll work somewhat)
[and I mean that in the most positive sense of the word ;-) ]

what we really need instead, and this is inside cpuidle, is the option
to set a timer when we enter the non-deepest C state,
so that if that timer fires we then reevaluate.
The duration of that timer will be dependent on the C state (so should
come from the C state structure of the state we pick).

For the most shallow one this will be a relatively short time, but for
the deepest-but-one this might be a lot longer time.


your patch abuses a completely different, unrelated timer for this, with
a pretty much unspecified frequency, that also has other side effects
that we probably don't want.


it shouldn't be hard to do the right thing instead and make it a
separate timer with a per C state timeout.

(and I would say a default timeout of 10x the break even time that we
already have in the structure)


2011-04-07 22:38:11

by Kevin Hilman

[permalink] [raw]
Subject: Re: [PATCH] nohz: delay going tickless under CPU load to favor deeper C states

Hi Arjan,

Arjan van de Ven <[email protected]> writes:

> On 4/7/2011 11:18 AM, Kevin Hilman wrote:
>> From: Nicole Chalhoub<[email protected]>
>>
>> While there is CPU load, continue the periodic tick in order to give
>> CPUidle another opportunity to pick a deeper C-state instead of
>> spending potentially long i
>
>
> so I don't really like this patch. It's actually a pretty bad hack
> (I'm sure it'll work somewhat)
> [and I mean that in the most positive sense of the word ;-) ]

I'll take it as a complement then. :)

I agree though, it did feel somewhat like we were attempting to fix the
problem in the wrong place.

> what we really need instead, and this is inside cpuidle, is the option
> to set a timer when we enter the non-deepest C state,
> so that if that timer fires we then reevaluate.
> The duration of that timer will be dependent on the C state (so should
> come from the C state structure of the state we pick).

OK, this sounds like a good idea. Will experiment.

Of course, setting new timers can affect the governors decision. To
avoid that, I guess this timer will need to be one-shot, and only set
after the CPUidle governor has made a decision, otherwise that timer
itself will affect tick_nohz_get_sleep_length() which the governor uses
to pick a C-state.

> For the most shallow one this will be a relatively short time, but for
> the deepest-but-one this might be a lot longer time.
>
> your patch abuses a completely different, unrelated timer for this,
> with a pretty much unspecified frequency, that also has other side
> effects that we probably don't want.

What side effects come to mind? The only side effects that I could
think of were (potentially) unwanted wakeups from C1. However, since C1
is presumably cheap to enter (and exit), it seemed like a worthwhile
cost since you're almost certain to pick a deeper C state after wakeup.

That being said, your idea of per C-state timer is much better than
relying on the scheduler tick. On most ARM systems, HZ is still pretty
low (around 100), the time between ticks is relatively long, but on a
HZ=1000 setup, I could see the extra wakeups having a penalty of their
own.

> it shouldn't be hard to do the right thing instead and make it a
> separate timer with a per C state timeout.

Agreed. Will give it a try.

> (and I would say a default timeout of 10x the break even time that we
> already have in the structure)

OK.

Thanks for the review and suggestions,

Kevin