2014-01-07 16:19:57

by Morten Rasmussen

[permalink] [raw]
Subject: [0/11][REPOST] Energy-aware scheduling use-cases and scheduler issues

Reposting the series with LKML on cc as well.

Original thread (with a few replies) can be found here:
http://article.gmane.org/gmane.linux.power-management.general/41501

Sorry for double-posting.

Morten

------------------------------------------------------------------------

Hi,

One of the requests from the scheduler maintainers at the Energy-aware
Scheduling workshop at Kernel Summit this year was to provide plain text
descriptions of use-cases (workloads) and system topologies. To get that
moving I have written some short texts about some use-cases. In addition
I described a list of issues that today prevent mainly the scheduler
from achieving a good energy/performance balance in common use-cases.
The follow-up emails are structured as follows:

1-6: Current issues related to energy/performance balance.
7-10: Use-cases (overall behaviour and energy/performance goals)
11: DVFS example (for reference)

I'm hoping that this provides some of the background for why I'm
interested in improving energy-awareness in the scheduler. I'm aware
that the use-cases and issues/wishlist don't cover everyone's area of
interest. Input is needed to fix that.

Comments and input are appreciated.

Morten


2014-01-07 16:20:14

by Morten Rasmussen

[permalink] [raw]
Subject: [1/11] issue 1: Missing power topology information in scheduler

The current mainline scheduler has no power topology information
available to enable it to make energy-aware decisions. The energy cost
of running a cpu at different frequencies and the energy cost of waking
up another cpu are needed.

One example where this could be useful is audio on Android. With the
current mainline scheduler it would utilize three cpus when active. Due
to the size of the tasks it is still possible to meet the performance
criteria when execution is serialized on a single cpu. Depending on the
power topology leaving two cpus idle and running one longer may lead to
energy savings if the cpus can be power-gated individually.

The audio performance requirements can be satisfied by most cpus at the
lowest frequency. Video is a more interesting use-case due to its higher
performance requirements. Running all tasks on a single cpu is likely to
require a higher frequency than if the tasks are spread out across
more cpus.

Running Android video playback on an ARM Cortex-A7 platform with 1, 2,
and 4 cpus online has lead to the following power measurements
(normalized):

video 720p (Android)
cpus power
1 1.59
2 1.00
4 1.10

Restricting the number of cpus to one forces the frequency up to cope
with the load, but the overall cpu load is only ~60% (busy %-age). Using
two cpus keeps the frequency in the more power efficient range and gives
a ~37% power reduction. With four cpus the power consumption is worse,
likely due to the wake/idle transitions increase (~100%).

For this use-case it appears that the optimal busy %-age is ~30% (use
two cpus). However, that is likely to vary depending on the use-case.

Proposed solution: Represent energy costs for each P-states and C-states
in the topology to enable the scheduler to estimate the energy cost of
the scheduling decisions. Coupled with P-state awareness that would
allow the scheduler to avoid expensive high P-states.

2014-01-07 16:20:22

by Morten Rasmussen

[permalink] [raw]
Subject: [11/11] system 1: Saving energy using DVFS

Most modern systems use DVFS to save power by slowing down computation
throughput when less performance is necessary. The power/performance
relation is platform specific. Some platforms may have better energy
savings (energy per instruction) than others at low frequencies.

To have something to relate to, here is an anonymized example based on
a modern ARM platform:

Performance Energy/instruction
1.0 1.0
1.3 1.6
1.7 1.8
2.0 1.9
2.3 2.1
2.7 2.4
3.0 2.7

Performance is frequency (~instruction issue rate) and
energy/instruction is the energy cost of executing one (or a fixed
number of instructions) at that level of performance (frequency). For
this example, it costs 2.7x more energy per instruction to increase the
performance from 1.0 to 3.0 (3x). That is, the amount of work
(instructions) that can be done on one battery charge is reduced by 2.7x
(~63%) if you run as fast as possible (3.0) compared to running at
slowest frequency (1.0).

A lot of things haven't been accounted for in this simplified example.
There is a number of factors that influence the energy efficiency
including whether the cpu is the only one awake in its frequency/power
domain or not. The numbers shown above won't be accurate for all
workloads. They are meant as a ballpark figures.

To save energy, the higher frequencies should be avoided and only used
when the application performance requirements can not be satisfied
otherwise (e.g. spread tasks across more cpus if possible).

When considering the total system power it may save energy in some
scenarios by running the cpu faster to allow other power hungry parts of
the system to be shut down faster. However, this is highly platform and
application dependent.

2014-01-07 16:20:29

by Morten Rasmussen

[permalink] [raw]
Subject: [9/11] use-case 3: Video playback on Android

Depending on the platform hardware video is a low to medium load
periodic application. There may be some variation in the load depending
on the video codec, content, and resolution. The load pattern is roughly
synchronized to the video frame-rate (typically 30 FPS). Video playback
also includes audio playback as part of the workload.

Performance Criteria

Video decoding must be done in time to avoid dropped frames. Similarly,
audio must be decoded in time to never let the audio buffer run empty.

Task behaviour

Based on video playback (720p and 1080p) on a modern ARM SoC, the cpu
load is generally modest. Video resolution has only a minor impact on
the overall cpu load. The load pattern is repeating every ~33 ms.

Rendering task: The main Android graphics rendering task accounts for
about 18% of the total cpu load. It is active for about ~6-8 ms each
period during which it is blocking a couple of times. It appears to run
in sync with a handful of other tasks. Most of which are related to
timed queues in Android and graphics.

Timed events queue tasks: An Android TimedEventQueue task for each cpu
is active during video playback. In total, they account for around 30%
of the cpu load. They are all following the 33 ms period and run for
250-700 us when scheduled (average).

Audio decoding task: Accounts for ~10% of the cpu load. Load pattern
repeat every ~23 ms. Since this is different from the period of video
rendering, this may run in parallel with video rendering form time to
time. Audio decoding runs for about 2 ms when scheduled each period.

A lot of smaller tasks are involved in the periodic load pattern.

2014-01-07 16:20:39

by Morten Rasmussen

[permalink] [raw]
Subject: [8/11] use-case 2: Audio playback on Android

Audio playback is a low load periodic application that has little/no
variation in period and load over time. It consists of tasks involved in
decoding the audio stream and communicating with audio frameworks and
drivers.

Performance Criteria

All tasks must have completed before the next request to fill the audio
buffer. Most modern hardware should be able to deal with the load even
at the lowest P-state.

Task behaviour

The task load pattern period is dictated by the audio interrupt. On an
example modern ARM based system this occurs every ~6 ms. The decoding
work is triggered every fourth interrupt, i.e. a ~24 ms period. No tasks
are scheduled at the intermediate interrupts. The tasks involved are:

Main audio framework task (AudioOut): The first task to be scheduled
after the interrupt and continues running until decoding has completed.
That is ~5 ms. Runs at nice=-19.

Audio framework task 2 (AudioTrack): Woken up by the main task ~250-300
us after the main audio task is scheduled. Runs for ~300 us at nice=-16.

Decoder task (mp3.decoder): Woken up by the audio task 2 when that
finishes (serialized). Runs for ~1 ms until it wakes a third Android
task on which it blocks and continues for another ~150 us afterwards
(serialized). Runs at nice=-2.

Android task 3 (OMXCallbackDisp): Woken by decoder task. Runs for ~300
us at nice=-2.

2014-01-07 16:20:49

by Morten Rasmussen

[permalink] [raw]
Subject: [10/11] use-case 4: Game on Android

Games generally have periodic load pattern synchronized to the
frame-rate (30 or 60 Hz). Games workloads typically involve both
graphics rendering (game engine) and audio mixing.

Performance Criteria

Keep the frame-rate as close to the target as possible. Variations are
acceptable. Audio must be handled before the audio buffers runs empty.

Task behaviour

This description is based on one particular Android game, but similar
patterns have been observed for a number of games. Overall, 10+ threads
are active and context switches happen very often. Key game engine tasks
and graphics driver tasks are scheduled ~200-700 times per second. The
top 10 tasks (by cpu time) consists of: One game task, one main game
engine task, three graphics related tasks, three audio tasks, one event
handling task, and one kworker task.

Game engine task: By far the most cpu intensive task. Accounts for about
50% of all cpu load. It is scheduled ~375 times per second (average).
The scheduling pattern repeats every ~16 ms (~60 Hz), where the task
runs for ~12 ms, followed by three shorter periods of activity where the
longest is ~2 ms (unless it is preempted by other tasks). In addition,
the game engine has a worker thread for each cpu. Each of the worker
threads account for ~0.4% of the load, is scheduled ~115 times per
second (average), and only runs for ~56 us (average).

Rendering task: Accounts for ~6% of the load. Scheduled ~200 times per
second (average) and runs for ~420 us (average).

Graphics driver task: Accounts for ~6% of the load. Scheduled ~700 times
per second (average) and runs for 11 us (average).

Game main task: Accounts for ~4% of the load. Scheduled ~170 times per
second (average) and runs for ~37 us (average).

Audio system task: Accounts for ~3% of the load. Scheduled ~120 times
per second (average) and runs for ~42 us (average).

kworker task: Accounts for ~3% of the load. Scheduled ~320 times per
second (average) and runs for ~13 us (average).

2014-01-07 16:21:24

by Morten Rasmussen

[permalink] [raw]
Subject: [7/11] use-case 1: Webbrowsing on Android

Common webbrowsing use-cases (no embedded videos, but dynamic contents
is ok) typically exhibit three distinct modes of operation depending on
what it is doing in relation to the user:

1)
Mode: Page load and rendering.

Behaviour: The duration is highly depends on the website but is
relatively short. Typically a few seconds. Page loading time impacts
user experience directly. Minor performance drops may be acceptable if
it comes with good overall energy savings.

Performance criteria: Complete as fast as possible.


2)
Mode: Display website (user reading, no user interaction)

Behaviour: Low load. Only minor updates of dynamic contents.

Performance criteria: Minimize energy.


3)
Mode: Page scrolling.

Behaviour: Relatively short in duration. Rendering of contents which was
previously off screen.

Performance criteria: Ensure smooth UI interaction. Without UI
experience feedback (lag, etc.), optimizing for best performance might
the only way to get necessary performance.


Task behaviour

The task descriptions are based on traces from a modern ARM platform
with a fairly recent version of Android. It may be different other
platforms and software stacks. This serves just as an example.

There are three main tasks involved on the browser side, and one or more
tasks related to the graphics driver. Each of them behaves differently
in each of the three modes of operation.

Render task: Mainly active in mode 1, but also active in mode 3.
Accounts for about a third of the total cpu time. May run for a more or
less continuous burst of 1-2s in 1.

Texture task: Active in all modes, but mainly in modes 1 and 3. Active
after the render task burst in mode 1. Somewhat periodic behaviour
during modes 1 and 3 indicating dependencies on other tasks. Account
for about a sixth of the cpu time.

Browser task: Active in all modes. Blocks often. Only running about half
the time when it is active. Short occasional periods of activity in mode
2 along with the texture task. Accounts for about a third of the cpu
time.

Graphics driver task: Mainly active in mode 1 and 3, very little
activity in mode 2 and only when browser and texture tasks are
active. Runs for a short amount of time but frequently when active.
Accounts for about a sixth of the cpu time.

2014-01-07 16:21:35

by Morten Rasmussen

[permalink] [raw]
Subject: [5/11] issue 5: Frequency and uarch invariant task load

Related to the issue of potential cpu capacity, task load is influenced
directly by the current P-state of the cpu it is running on. For
energy-aware task placement decisions the scheduler would need to
estimate the energy impact of scheduling a specific task on a specific
cpu. Depending on the resulting P-state it may be more energy efficient
to wake-up another cpu (see system 1 in mail 11 for energy efficiency
example).

The frequency and uarch impact can be rather significant. On modern
systems frequency scaling covers a range of 5-6x. On top of that uarch
differences may give another 1.5-3x for a total cpu capacity range
covering >10x.

Measurements on ARM TC2 for a simple periodic test workload (single
task, 16 ms period):

cpu load load_avg_contrib (10 sample avg.)
Freq A7 A15 A7 A15
500 16.76% 9.94% ~201 ~135
700 12.06% 6.95% ~145 ~87
1000 8.19% 5.23% ~103 ~65

The cpu load estimate used for load balancing is based on
load_avg_contrib which means that for this example the load estimate may
vary 3x depending on where tasks are scheduled and the frequency scaling
governors used.

Potential solution: Frequency invariance has been proposed before [1]
where the task load is scaled by the cur/max freq ratio. Another
possibility is to use hardware counters if such are available on the
platform.

[1] https://lkml.org/lkml/2013/4/16/289

2014-01-07 16:21:42

by Morten Rasmussen

[permalink] [raw]
Subject: [6/11] issue 6: Poor and non-deterministic performance on heterogeneous systems

The current mainline scheduler doesn't give optimum performance on
heterogeneous systems for workload with few tasks (#tasks <= #cpu).
Using cpu_power (in its current form) to inform the scheduler about the
relative compute capacity of the cpus is not sufficient.

1. cpu_power is not used on wake-up which means that new tasks may end
up anywhere. Periodic load-balance generally bails out if there is only
one task running on a cpu, so the task isn't moved later. Hence, the
execution time of the task may be anywhere between the execution it
would have had running exclusively on the fastest cpu and running
exclusively on the slowest cpu.

Running a single cpu intensive task on an otherwise idle system while
measuring its execution time will show this problem. On ARM TC2
(big.LITTLE) we get the following numbers:

cpu_power 1024 606/1441
default slow/fast
execution time:
(100 runs)
Max 4.33 4.33
Min 2.09 2.91
Distribution:
Runs within
5% of Min 14 11
5% of Max 86 89

Only a few runs randomly ended up on a fast cpu irrespective of the
cpu_power settings. The distribution can easily change depending on
other tasks, reordering the cpus, or changing the topology.

The problem can also be observed for smartphone workloads like
webbrowsing where page rendering times vary significantly as the threads
are randomly scheduled on fast and slow cpus.

2. Using cpu_power to represent the relative performance of the cpus,
leads to undesirable task balance in common scenarios. group_power =
sum(cpu_power) for a group of cpus and is used in the periodic
load-balance, idle balance, and nohz idle balance to determine the
number of tasks that should be in each group. However, depending on the
number of cpus in the groups, that causes one group to be overloaded
while another has idle cpus if the number of tasks is equal to the
number of cpus (or slightly larger).

Running a simple parallel workload (OpenMP) will reveal this as it uses
one worker thread per cpu by default. On ARM TC2 we get the following
behaviour:

cpu_power 1024 606/1441 (slow/fast)
execution time:
(20 runs)
avg 8.63 9.87 14.34% (slowdown)
stdev 0.01 0.01

The kernelshark trace reveals that the 606/1441 configuration puts three
tasks on the two fast cpus and two tasks of the three slow cpus leaving
one of them idle. The 1024 case has one task per cpu.

Overall cpu_power in its current form does not solve any of the
performance issues on heterogeneous systems. It even makes them worse
for some common workload scenarios.

2014-01-07 16:21:50

by Morten Rasmussen

[permalink] [raw]
Subject: [3/11] issue 3: No understanding of potential cpu capacity

To minimize energy it may sometimes be better to put waking tasks on
partially loaded cpus instead of powering up more cpus (particularly if
it implies powering up a new cluster/group of cpus with associated
caches). To make that call, information about the potential spare cycles
on the busy cpus is required.

Currently, the CFS scheduler has no knowledge about frequency scaling.
Frequency scaling governors generally try to match the frequency to
the load, which means that the idle time has no absolute meaning. The
potential spare cpu capacity may be much higher than indicated by the
idle time if the cpu is running at a low P-state.

The energy trade-off may justify putting another task on a loaded cpu
even if it causes a change to a higher P-state to handle the extra load.
Related issues are frequency (and cpu micro architecture) invariant task
load and power topology information, which are both needed to enable the
scheduler for energy-aware task placement. This is covered in more
detail in issue 5.

The potential cpu capacity cannot be assumed to be constant as thermal
management may restrict the usage of high performance P-states
dynamically.

2014-01-07 16:22:04

by Morten Rasmussen

[permalink] [raw]
Subject: [4/11] issue 4: Tracking idle states

Similar to the issue of knowing the potential capacity of a cpu, the CFS
scheduler also needs to know the idle state of idle cpus. Currently, an
idle cpu is found using cpumask_first() when an extra cpu is needed (for
nohz_idle_balance in find_new_ilb() in sched/fair.c). The energy
trade-off whether to wake another cpu or put tasks on already busy cpus
depend on this information.

The cost of waking up a cpu in terms of latency and energy depends on
the idle state the cpu is in. Deeper idle states typically affects more
than a single cpu. Waking up a single cpu from such state is more
expensive as it also affects the idle states of of its related cpus.

Energy costs are not currently represented in the cpuidle framework, but
latency is. Taking ARM TC2 as an example [1], which has two idle states:
Per-core clock-gating (WFI), and cluster power-down (power down all
related cpus and caches). The target residencies and exit latencies
specified in the driver give an idea about the cost involved in
entering/exiting these states.

Target Exit
residency latency
Clock-gating (WFI) 1 1
Cluster power-down 2000/2500 500/700 (big/LITTLE)

Picking the cheapest idle cpu would also have the effect that wake-ups
are likely to happen on the same cpu and leave the remaining cpus in
idle for longer.

Potential solution: Make the scheduler idle state aware by either moving
idle handling into the scheduler or let the idle framework (cpuidle)
maintain a cpumask of the cheapest cpus to wake up which is accessible
to the scheduler.

[1] drivers/cpuidle/cpuidle-big_little.c

2014-01-07 16:22:14

by Morten Rasmussen

[permalink] [raw]
Subject: [2/11] issue 2: Energy-awareness for heterogeneous systems

While performance is non-deterministic with the mainline scheduler
(described in issue 6) it also leads to non-deterministic energy
consumption. First step is to get performance right, but if we don't
keep energy in mind, heterogeneous systems will end up with high
performance and energy consumption.

To save energy low intensity workloads should not be scheduled on fast
cpus as these are generally less energy efficient. Audio playback is an
example where the performance offered by the slow cpus in todays
heterogeneous systems like ARM big.LITTLE is more than sufficient.

The mainline scheduler may schedule it on any cpu leading to
non-deterministic energy consumption. The energy expense on big (A15) is
3.63x little (A7) for Android mp3 audio playback on ARM TC2 (2xA15+3xA7)
when using just the A15s or just the A7s.

If we run multiple workloads at the same time, e.g. audio and
webbrowsing, both performance and energy is non-deterministic. Because
of audio we may even get poor webbrowsing performance and high energy
consumption at the same time.

Running that scenario on Android on ARM TC2 gives the following
execution times and energy measurements for 10 runs (normalized to avg):

Run Exec Energy
1 1.03 1.04
2 1.12 1.11 Worst energy
3 0.85 1.08
4 0.85 1.08 Best performance
5 0.94 1.06
6 1.01 0.78
7 0.90 0.63 Best performance/energy and best energy
8 1.22 1.08 Worst performance/energy and worst performance
9 0.94 1.08
10 1.14 1.07

Run 7 had a very good schedule as it led to both lowest energy and also
good performance at the same time. That is not generally the case. Run 2
is an example of a poor schedule where performance is 12% worse than
average and energy is 11% higher. Best performance in run 3 comes at the
cost of high energy.

While run 7 seems to be ideal from an energy-awareness point of view,
it may be disqualified by performance constraints. Hence, ideally the
performance level should be tunable.

Possible solution: We know that a simple heuristic that controls task
placement based on tracked load works rather well for most smartphone
workloads. However, realistic patterns exist that defeat this heuristic.

2014-01-08 12:31:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [5/11] issue 5: Frequency and uarch invariant task load

On Tue, Jan 07, 2014 at 04:19:41PM +0000, Morten Rasmussen wrote:
> Potential solution: Frequency invariance has been proposed before [1]
> where the task load is scaled by the cur/max freq ratio. Another
> possibility is to use hardware counters if such are available on the
> platform.
>
> [1] https://lkml.org/lkml/2013/4/16/289

Right, I just had a look at those patches.. they're not horrible but I
think they're missing a few opportunities.

My main objection to them is that I think the newly introduced
max_capacity is exactly what the current cpu_power thing is -- then
again, I still haven't let the entire thing sink in well enough.

Not to mention we need to fix some of the cpu_power abuse -- like the
correlation to capacity, which as stated in previous emails should be
sorted using utilization.

So DVFS certainly makes sense, and would indeed be required in order to
make sensible decisions in the face of P states. Even in the face of
funny hardware like Intel which pretty much ignores whatever you tell it
and does it own merry thing.


A few random thoughts:

- I think for SMP-nice we want to migrate from /max_capacity to
/curr_capacity; because SMP-nice cares about 100% utilization
regardless of the actual P state. If we're somehow forced into a
lower P state (thermal or otherwise) fairness is best served by
normalizing at the rate we're actually running at, not the potential
maximal.

- We need to re-think SMT and turbo-bins in general; I think we can
think of those two as the same effective thing. This does mean Intel
chips will have a dual layer of this goo, and we can currently barely
deal with the 1 SMT layer, let alone do something sensible with 2.

To clarify, a single SMT thread will generally go 'faster' on its own
since it doesn't need to compete with the other thread(s) for core
resources, but together they might better utilize the core resources
giving an over-all throughput win.

Similar for turbo bins, a single core can go faster on its own since
it doesn't have competition for energy and thermal constraints, but
together cores can probably achieve greater throughput.

So we need a better way to describe this capacity dependency and
variability.

I'm fairly sure ARM doesn't do SMT, but they certainly suffer from
thermal caps and can thus have effective turbo bins, even though
they're not explicit and magic like with Intel.

And of course the honorary mention goes to Power7 which has
asymmetric bins -- lets hope they fix it and nobody else things them
a great idea.

- For hardware without P state controls, or hardware that pretty much
ignores them, we need means of obtaining the max and curr capacity.

Intel has the APERF, MPERF registers which resp. count at actual
frequency and fixed frequency. Using them is a bit tricky since
APERF doesn't count when idle, but when filtering out the idle time
they do provide a current performance ratio.

From that we could obtain a max performance ratio by using a wide
window max on the current value or somesuch.

Again, SMT and turbo-bins will complicate matters..

Other CPUs that have magic P state control might not provide such
registers which would require PMU resources, which would completely
blow :/

2014-01-13 20:53:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [3/11] issue 3: No understanding of potential cpu capacity

On Tuesday, January 07, 2014 04:19:39 PM Morten Rasmussen wrote:
> To minimize energy it may sometimes be better to put waking tasks on
> partially loaded cpus instead of powering up more cpus (particularly if
> it implies powering up a new cluster/group of cpus with associated
> caches). To make that call, information about the potential spare cycles
> on the busy cpus is required.

That generally is not the only thing that matters. There's one more factor
called "responsiveness" that used to be popular in the past. It, roughly,
is about how much time it takes for the system to respond to user actions,
on the average.

> Currently, the CFS scheduler has no knowledge about frequency scaling.
> Frequency scaling governors generally try to match the frequency to
> the load, which means that the idle time has no absolute meaning. The
> potential spare cpu capacity may be much higher than indicated by the
> idle time if the cpu is running at a low P-state.
>
> The energy trade-off may justify putting another task on a loaded cpu
> even if it causes a change to a higher P-state to handle the extra load.
> Related issues are frequency (and cpu micro architecture) invariant task
> load and power topology information, which are both needed to enable the
> scheduler for energy-aware task placement. This is covered in more
> detail in issue 5.
>
> The potential cpu capacity cannot be assumed to be constant as thermal
> management may restrict the usage of high performance P-states
> dynamically.

That's correct. Moreover, all of the above seems to assume that we can get
exact power numbers for all of the involved C-states and P-states. What if
we can't?

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-01-14 10:28:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [3/11] issue 3: No understanding of potential cpu capacity

On Mon, Jan 13, 2014 at 10:07:12PM +0100, Rafael J. Wysocki wrote:
> > Currently, the CFS scheduler has no knowledge about frequency scaling.
> > Frequency scaling governors generally try to match the frequency to
> > the load, which means that the idle time has no absolute meaning. The
> > potential spare cpu capacity may be much higher than indicated by the
> > idle time if the cpu is running at a low P-state.
> >
> > The energy trade-off may justify putting another task on a loaded cpu
> > even if it causes a change to a higher P-state to handle the extra load.
> > Related issues are frequency (and cpu micro architecture) invariant task
> > load and power topology information, which are both needed to enable the
> > scheduler for energy-aware task placement. This is covered in more
> > detail in issue 5.
> >
> > The potential cpu capacity cannot be assumed to be constant as thermal
> > management may restrict the usage of high performance P-states
> > dynamically.
>
> That's correct. Moreover, all of the above seems to assume that we can get
> exact power numbers for all of the involved C-states and P-states. What if
> we can't?

On average more or less correct should be fine; in which case the
result will on average still be better.

Obviously the more reliable the input to the model the better the
results, but as long as the input numbers are more or less in the right
ballpark the model should still more or less do the right thing.

2014-01-14 16:39:53

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [3/11] issue 3: No understanding of potential cpu capacity

On Mon, Jan 13, 2014 at 09:07:12PM +0000, Rafael J. Wysocki wrote:
> On Tuesday, January 07, 2014 04:19:39 PM Morten Rasmussen wrote:
> > To minimize energy it may sometimes be better to put waking tasks on
> > partially loaded cpus instead of powering up more cpus (particularly if
> > it implies powering up a new cluster/group of cpus with associated
> > caches). To make that call, information about the potential spare cycles
> > on the busy cpus is required.
>
> That generally is not the only thing that matters. There's one more factor
> called "responsiveness" that used to be popular in the past. It, roughly,
> is about how much time it takes for the system to respond to user actions,
> on the average.

Responsiveness is still very important. It is quite hard to control. CFS
doesn't consider latency. The only way to get the best responsiveness is
to go for best performance which comes at a high cost in energy.

IMHO, we are looking for ways to reduce energy without sacrificing too
much responsiveness, but we can't really guarantee the impact without
having latency awareness in the scheduler. I don't think it is feasible
to introduce that, so we have to do the best we can with whatever
heuristics we can come up with.

>
> > Currently, the CFS scheduler has no knowledge about frequency scaling.
> > Frequency scaling governors generally try to match the frequency to
> > the load, which means that the idle time has no absolute meaning. The
> > potential spare cpu capacity may be much higher than indicated by the
> > idle time if the cpu is running at a low P-state.
> >
> > The energy trade-off may justify putting another task on a loaded cpu
> > even if it causes a change to a higher P-state to handle the extra load.
> > Related issues are frequency (and cpu micro architecture) invariant task
> > load and power topology information, which are both needed to enable the
> > scheduler for energy-aware task placement. This is covered in more
> > detail in issue 5.
> >
> > The potential cpu capacity cannot be assumed to be constant as thermal
> > management may restrict the usage of high performance P-states
> > dynamically.
>
> That's correct. Moreover, all of the above seems to assume that we can get
> exact power numbers for all of the involved C-states and P-states. What if
> we can't?

None of the current load-tracking in the scheduler is exact or even
accurate. As long as we can get some hints it is better than nothing.

Morten

2014-01-14 16:51:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [3/11] issue 3: No understanding of potential cpu capacity

On Tue, Jan 14, 2014 at 04:39:54PM +0000, Morten Rasmussen wrote:
> Responsiveness is still very important. It is quite hard to control. CFS
> doesn't consider latency. The only way to get the best responsiveness is
> to go for best performance which comes at a high cost in energy.

The big problem is that the normal unix task model doesn't cover his at
all -- nice isn't much of a knob.

There's ways in which you can adapt CFS to include such a measure
(search for the EEVDF patches), but I was kinda hoping that tasks that
really desire responsiveness could be made to use SCHED_DEADLINE or
such.

2014-01-16 11:16:41

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [5/11] issue 5: Frequency and uarch invariant task load

On Wed, Jan 08, 2014 at 12:31:18PM +0000, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 04:19:41PM +0000, Morten Rasmussen wrote:
> > Potential solution: Frequency invariance has been proposed before [1]
> > where the task load is scaled by the cur/max freq ratio. Another
> > possibility is to use hardware counters if such are available on the
> > platform.
> >
> > [1] https://lkml.org/lkml/2013/4/16/289
>
> Right, I just had a look at those patches.. they're not horrible but I
> think they're missing a few opportunities.
>
> My main objection to them is that I think the newly introduced
> max_capacity is exactly what the current cpu_power thing is -- then
> again, I still haven't let the entire thing sink in well enough.

Yes, you can view it that way. The basic idea is to introduce a
potential compute capacity (max_capacity) and a current compute capacity
(curr_capacity). By scaling the load_contrib of a task by the
current/potential capacity ratio you get a frequency invariant task
load. The invariant task load enables more sensible comparison of load
between task loads of tasks running on different cpus in different
frequency domains.

I would have said that max_capacity is equivalent to cpu_power if wasn't
used for so many other things as you point out below.

>
> Not to mention we need to fix some of the cpu_power abuse -- like the
> correlation to capacity, which as stated in previous emails should be
> sorted using utilization.

Agreed.

>
> So DVFS certainly makes sense, and would indeed be required in order to
> make sensible decisions in the face of P states. Even in the face of
> funny hardware like Intel which pretty much ignores whatever you tell it
> and does it own merry thing.
>
>
> A few random thoughts:
>
> - I think for SMP-nice we want to migrate from /max_capacity to
> /curr_capacity; because SMP-nice cares about 100% utilization
> regardless of the actual P state. If we're somehow forced into a
> lower P state (thermal or otherwise) fairness is best served by
> normalizing at the rate we're actually running at, not the potential
> maximal.

I see your point, but normalizing to /curr_capacity would break ability
to compare tasks from different runqueues. When we pull tasks during
load-balance we have no idea what the load of the pulled tasks will be
on the new cpu. The source and target cpus may be at different P-states.

It would probably be better to adjust the max_capacity if we are forced
into a lower P-state for some reason.

>
> - We need to re-think SMT and turbo-bins in general; I think we can
> think of those two as the same effective thing. This does mean Intel
> chips will have a dual layer of this goo, and we can currently barely
> deal with the 1 SMT layer, let alone do something sensible with 2.
>
> To clarify, a single SMT thread will generally go 'faster' on its own
> since it doesn't need to compete with the other thread(s) for core
> resources, but together they might better utilize the core resources
> giving an over-all throughput win.
>
> Similar for turbo bins, a single core can go faster on its own since
> it doesn't have competition for energy and thermal constraints, but
> together cores can probably achieve greater throughput.
>
> So we need a better way to describe this capacity dependency and
> variability.

Agreed. It is my impression that SMT works fairly well using cpu_power,
but I don't see how we can further abuse cpu_power to optimize for turbo
boost.

We might as well add heterogeneous systems (big.LITTLE) to the list of
things that need better capacity management. Scheduling for performance
on big.LITTLE you want to utilze the big cpus first and then use the
little cpus. As pointed out in issue 6, cpu_power in its current form
can not do this.

>
> I'm fairly sure ARM doesn't do SMT, but they certainly suffer from
> thermal caps and can thus have effective turbo bins, even though
> they're not explicit and magic like with Intel.

Thermal management is indeed important. It is up to the SoC implementor
how they deal with it, but I think most ARM systems expose all P-states,
including those that may only be used for shorter periods of time in
small form factor devices.

>
> And of course the honorary mention goes to Power7 which has
> asymmetric bins -- lets hope they fix it and nobody else things them
> a great idea.
>
> - For hardware without P state controls, or hardware that pretty much
> ignores them, we need means of obtaining the max and curr capacity.
>
> Intel has the APERF, MPERF registers which resp. count at actual
> frequency and fixed frequency. Using them is a bit tricky since
> APERF doesn't count when idle, but when filtering out the idle time
> they do provide a current performance ratio.
>
> From that we could obtain a max performance ratio by using a wide
> window max on the current value or somesuch.
>
> Again, SMT and turbo-bins will complicate matters..

+ heterogeneous systems (big.LITTLE)...

>
> Other CPUs that have magic P state control might not provide such
> registers which would require PMU resources, which would completely
> blow :/

For systems with multiple performance counters that are cheap to access
it may be worth it to dedicate a counter or two for use by the scheduler
if it can give significant improvements. But that has yet to be shown.

2014-01-20 16:32:59

by Pavel Machek

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

On Tue 2014-01-07 16:19:47, Morten Rasmussen wrote:
> Most modern systems use DVFS to save power by slowing down computation
> throughput when less performance is necessary. The power/performance
> relation is platform specific. Some platforms may have better energy
> savings (energy per instruction) than others at low frequencies.
>
> To have something to relate to, here is an anonymized example based on
> a modern ARM platform:

And here is anonymized example I pulled out of my hat:

Ammount of anonymization Usefulness of information
0.0 1.0
0.5 0.05
1.0 0.0

Come on, you can surely do better than "trust me, it is modern". Now
we can't verify those numbers. And they don't make sense.

> Performance Energy/instruction
> 1.0 1.0
> 1.3 1.6
> 1.7 1.8
> 2.0 1.9
> 2.3 2.1
> 2.7 2.4
> 3.0 2.7
>
> Performance is frequency (~instruction issue rate) and
> energy/instruction is the energy cost of executing one (or a fixed
> number of instructions) at that level of performance (frequency). For
> this example, it costs 2.7x more energy per instruction to increase the
> performance from 1.0 to 3.0 (3x). That is, the amount of work
> (instructions) that can be done on one battery charge is reduced by 2.7x
> (~63%) if you run as fast as possible (3.0) compared to running at
> slowest frequency (1.0).

This very heavily depends on what you count to the total energy,
right? And it is very hard to argue with you before you anonymized
your numbers.

Anyway, you assuming modern system, low frequency should be cca
0.5GHz, with high cca 1.5GHz. Do you claim that operation on 1.5GHz
takes 9x the power of 0.5GHz operation?

Do you count DRAM to the power consumption?

> To save energy, the higher frequencies should be avoided and only used
> when the application performance requirements can not be satisfied
> otherwise (e.g. spread tasks across more cpus if possible).

This is in very steep contrast with race-to-idle on the PCs.

> When considering the total system power it may save energy in some
> scenarios by running the cpu faster to allow other power hungry parts of
> the system to be shut down faster. However, this is highly platform and
> application dependent.

Aha. Devil is in the details. "I pulled random numbers out of the hat,
and they are wrong, but they are wrong in platform specific way. And I
anonymized them for you so that you can't verify them".

Can we talk specific machine, please? You are talking Android all the
time, so pick one cellphone you care about, and provide real numbers...

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2014-01-20 16:49:30

by Pavel Machek

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

Hi!

> To save energy, the higher frequencies should be avoided and only used
> when the application performance requirements can not be satisfied
> otherwise (e.g. spread tasks across more cpus if possible).

I argue this is untrue for any task where user waits for its
completion with screen on. (And that's quite important subset).

Lets take Nokia n900 as an example.

(source http://wiki.maemo.org/N900_Hardware_Power_Consumption)

Sleeping CPU: 2mA
Screen on: 230mA
CPU loaded: 250mA

Now, lets believe your numbers and pretend system can operate at 33%
of speed with 11% power consumption.

Lets take task that takes 10 seconds on max frequency:

~ 10s * 470mA = 4700mAs

You suggest running at 33% speed, instead; that means 30 seconds on
low requency.

CPU on low: 25mA (assumed).

~ 30s * 255mA = 7650mAs

Hmm. So race to idle is good thing on Intel machines, and it is good
thing on ARM design I have access to.

And you even acknowledge it here, right:

> When considering the total system power it may save energy in some
> scenarios by running the cpu faster to allow other power hungry parts of
> the system to be shut down faster. However, this is highly platform and
> application dependent.

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2014-01-20 17:11:07

by Catalin Marinas

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

On Mon, Jan 20, 2014 at 04:49:26PM +0000, Pavel Machek wrote:
> > To save energy, the higher frequencies should be avoided and only used
> > when the application performance requirements can not be satisfied
> > otherwise (e.g. spread tasks across more cpus if possible).
>
> I argue this is untrue for any task where user waits for its
> completion with screen on. (And that's quite important subset).
>
> Lets take Nokia n900 as an example.
>
> (source http://wiki.maemo.org/N900_Hardware_Power_Consumption)
>
> Sleeping CPU: 2mA
> Screen on: 230mA
> CPU loaded: 250mA
>
> Now, lets believe your numbers and pretend system can operate at 33%
> of speed with 11% power consumption.
>
> Lets take task that takes 10 seconds on max frequency:
>
> ~ 10s * 470mA = 4700mAs
>
> You suggest running at 33% speed, instead; that means 30 seconds on
> low requency.
>
> CPU on low: 25mA (assumed).
>
> ~ 30s * 255mA = 7650mAs
>
> Hmm. So race to idle is good thing on Intel machines, and it is good
> thing on ARM design I have access to.

Race to idle doesn't mean that the screen goes off as well. Let's say
the screen stays on for 1 min and the CPU needs to be running for 10s
over this minute, in the first case you have:

10s & 250mA + 60s * 230mA = 16300mAs

in the second case you have:

30s * 25mA + 60s * 230mA = 14550mAs

That's a 1750mAs difference. There are of course other parts drawing
current but simple things like the above really make a difference in the
mobile space, both in terms of battery and thermal budget.

--
Catalin

2014-01-20 17:18:27

by Catalin Marinas

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

On Mon, Jan 20, 2014 at 05:10:29PM +0000, Catalin Marinas wrote:
> On Mon, Jan 20, 2014 at 04:49:26PM +0000, Pavel Machek wrote:
> > > To save energy, the higher frequencies should be avoided and only used
> > > when the application performance requirements can not be satisfied
> > > otherwise (e.g. spread tasks across more cpus if possible).
> >
> > I argue this is untrue for any task where user waits for its
> > completion with screen on. (And that's quite important subset).
> >
> > Lets take Nokia n900 as an example.
> >
> > (source http://wiki.maemo.org/N900_Hardware_Power_Consumption)
> >
> > Sleeping CPU: 2mA
> > Screen on: 230mA
> > CPU loaded: 250mA
> >
> > Now, lets believe your numbers and pretend system can operate at 33%
> > of speed with 11% power consumption.
> >
> > Lets take task that takes 10 seconds on max frequency:
> >
> > ~ 10s * 470mA = 4700mAs
> >
> > You suggest running at 33% speed, instead; that means 30 seconds on
> > low requency.
> >
> > CPU on low: 25mA (assumed).
> >
> > ~ 30s * 255mA = 7650mAs
> >
> > Hmm. So race to idle is good thing on Intel machines, and it is good
> > thing on ARM design I have access to.
>
> Race to idle doesn't mean that the screen goes off as well. Let's say
> the screen stays on for 1 min and the CPU needs to be running for 10s
> over this minute, in the first case you have:
>
> 10s & 250mA + 60s * 230mA = 16300mAs
>
> in the second case you have:
>
> 30s * 25mA + 60s * 230mA = 14550mAs
>
> That's a 1750mAs difference. There are of course other parts drawing
> current but simple things like the above really make a difference in the
> mobile space, both in terms of battery and thermal budget.

BTW, the proper way to calculate this is to use the energy rather than
current x time. This would be J = Ohm * A^2 * s = V^2 / Ohm * s (so the
impact of the current is even bigger).

--
Catalin

2014-01-20 17:47:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

On Mon 2014-01-20 17:17:52, Catalin Marinas wrote:
> On Mon, Jan 20, 2014 at 05:10:29PM +0000, Catalin Marinas wrote:
> > On Mon, Jan 20, 2014 at 04:49:26PM +0000, Pavel Machek wrote:
> > > > To save energy, the higher frequencies should be avoided and only used
> > > > when the application performance requirements can not be satisfied
> > > > otherwise (e.g. spread tasks across more cpus if possible).
> > >
> > > I argue this is untrue for any task where user waits for its
> > > completion with screen on. (And that's quite important subset).
> > >
> > > Lets take Nokia n900 as an example.
> > >
> > > (source http://wiki.maemo.org/N900_Hardware_Power_Consumption)
> > >
> > > Sleeping CPU: 2mA
> > > Screen on: 230mA
> > > CPU loaded: 250mA
> > >
> > > Now, lets believe your numbers and pretend system can operate at 33%
> > > of speed with 11% power consumption.
> > >
> > > Lets take task that takes 10 seconds on max frequency:
> > >
> > > ~ 10s * 470mA = 4700mAs
> > >
> > > You suggest running at 33% speed, instead; that means 30 seconds on
> > > low requency.
> > >
> > > CPU on low: 25mA (assumed).
> > >
> > > ~ 30s * 255mA = 7650mAs
> > >
> > > Hmm. So race to idle is good thing on Intel machines, and it is good
> > > thing on ARM design I have access to.
> >
> > Race to idle doesn't mean that the screen goes off as well. Let's say
> > the screen stays on for 1 min and the CPU needs to be running for 10s
> > over this minute, in the first case you have:
> >
> > 10s & 250mA + 60s * 230mA = 16300mAs
> >
> > in the second case you have:
> >
> > 30s * 25mA + 60s * 230mA = 14550mAs
> >
> > That's a 1750mAs difference. There are of course other parts drawing
> > current but simple things like the above really make a difference in the
> > mobile space, both in terms of battery and thermal budget.
>
> BTW, the proper way to calculate this is to use the energy rather than
> current x time. This would be J = Ohm * A^2 * s = V^2 / Ohm * s (so the
> impact of the current is even bigger).

You are claiming that energy is proportional to current squared?

I stand by numbers. Energy is proportional to values I quoted,
provided constant voltage.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2014-01-20 17:54:40

by Pavel Machek

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

On Mon 2014-01-20 17:10:29, Catalin Marinas wrote:
> On Mon, Jan 20, 2014 at 04:49:26PM +0000, Pavel Machek wrote:
> > > To save energy, the higher frequencies should be avoided and only used
> > > when the application performance requirements can not be satisfied
> > > otherwise (e.g. spread tasks across more cpus if possible).
> >
> > I argue this is untrue for any task where user waits for its
> > completion with screen on. (And that's quite important subset).
> >
> > Lets take Nokia n900 as an example.
> >
> > (source http://wiki.maemo.org/N900_Hardware_Power_Consumption)
> >
> > Sleeping CPU: 2mA
> > Screen on: 230mA
> > CPU loaded: 250mA
> >
> > Now, lets believe your numbers and pretend system can operate at 33%
> > of speed with 11% power consumption.
> >
> > Lets take task that takes 10 seconds on max frequency:
> >
> > ~ 10s * 470mA = 4700mAs
> >
> > You suggest running at 33% speed, instead; that means 30 seconds on
> > low requency.
> >
> > CPU on low: 25mA (assumed).
> >
> > ~ 30s * 255mA = 7650mAs
> >
> > Hmm. So race to idle is good thing on Intel machines, and it is good
> > thing on ARM design I have access to.
>
> Race to idle doesn't mean that the screen goes off as well. Let's say
> the screen stays on for 1 min and the CPU needs to be running for 10s
> over this minute, in the first case you have:

No, it does not. I just assumed user is continuing to use his
machine. Obviously, waiting 60 seconds with screen on will make the
difference look smaller. But your solution still means user has to
wait longer _and_ you consume more battery doing so.

And this is for any task where user waits for result with screen
on. Like rendering a webpage. Like opening settings screen. Like
installing application.

There are not too many background tasks on a cellphone.

But hey, maybe you are right and running at lowest possible frequency
is right. Please provide concrete numbers like I did.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2014-01-20 18:03:56

by Catalin Marinas

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

On Mon, Jan 20, 2014 at 05:47:45PM +0000, Pavel Machek wrote:
> On Mon 2014-01-20 17:17:52, Catalin Marinas wrote:
> > On Mon, Jan 20, 2014 at 05:10:29PM +0000, Catalin Marinas wrote:
> > > On Mon, Jan 20, 2014 at 04:49:26PM +0000, Pavel Machek wrote:
> > > > > To save energy, the higher frequencies should be avoided and only used
> > > > > when the application performance requirements can not be satisfied
> > > > > otherwise (e.g. spread tasks across more cpus if possible).
> > > >
> > > > I argue this is untrue for any task where user waits for its
> > > > completion with screen on. (And that's quite important subset).
> > > >
> > > > Lets take Nokia n900 as an example.
> > > >
> > > > (source http://wiki.maemo.org/N900_Hardware_Power_Consumption)
> > > >
> > > > Sleeping CPU: 2mA
> > > > Screen on: 230mA
> > > > CPU loaded: 250mA
> > > >
> > > > Now, lets believe your numbers and pretend system can operate at 33%
> > > > of speed with 11% power consumption.
> > > >
> > > > Lets take task that takes 10 seconds on max frequency:
> > > >
> > > > ~ 10s * 470mA = 4700mAs
> > > >
> > > > You suggest running at 33% speed, instead; that means 30 seconds on
> > > > low requency.
> > > >
> > > > CPU on low: 25mA (assumed).
> > > >
> > > > ~ 30s * 255mA = 7650mAs
> > > >
> > > > Hmm. So race to idle is good thing on Intel machines, and it is good
> > > > thing on ARM design I have access to.
> > >
> > > Race to idle doesn't mean that the screen goes off as well. Let's say
> > > the screen stays on for 1 min and the CPU needs to be running for 10s
> > > over this minute, in the first case you have:
> > >
> > > 10s & 250mA + 60s * 230mA = 16300mAs
> > >
> > > in the second case you have:
> > >
> > > 30s * 25mA + 60s * 230mA = 14550mAs
> > >
> > > That's a 1750mAs difference. There are of course other parts drawing
> > > current but simple things like the above really make a difference in the
> > > mobile space, both in terms of battery and thermal budget.
> >
> > BTW, the proper way to calculate this is to use the energy rather than
> > current x time. This would be J = Ohm * A^2 * s = V^2 / Ohm * s (so the
> > impact of the current is even bigger).
>
> You are claiming that energy is proportional to current squared?
>
> I stand by numbers. Energy is proportional to values I quoted,
> provided constant voltage.

The big advantage of frequency scaling is that you can scale down the
voltage, making the power proportional to the voltage squared (or
current squared for a constant resistance).

--
Catalin

2014-01-20 18:12:14

by Pavel Machek

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS


> > Sleeping CPU: 2mA
> > Screen on: 230mA
> > CPU loaded: 250mA
> >
> > Now, lets believe your numbers and pretend system can operate at 33%
> > of speed with 11% power consumption.
> >
> > Lets take task that takes 10 seconds on max frequency:
> >
> > ~ 10s * 470mA = 4700mAs
> >
> > You suggest running at 33% speed, instead; that means 30 seconds on
> > low requency.
> >
> > CPU on low: 25mA (assumed).
> >
> > ~ 30s * 255mA = 7650mAs
> >
> > Hmm. So race to idle is good thing on Intel machines, and it is good
> > thing on ARM design I have access to.
>
> Race to idle doesn't mean that the screen goes off as well. Let's say
> the screen stays on for 1 min and the CPU needs to be running for 10s
> over this minute, in the first case you have:
>
> 10s & 250mA + 60s * 230mA = 16300mAs
>
> in the second case you have:
>
> 30s * 25mA + 60s * 230mA = 14550mAs
>
> That's a 1750mAs difference. There are of course other parts drawing
> current but simple things like the above really make a difference in the
> mobile space, both in terms of battery and thermal budget.

Aha, I noticed the values are now the other way around. [And notice
that if user _does_ lock/turn off the screen after the operation,
difference between power consumptions is factor of two. People do turn
off screens before putting phone back in pocket.]

You are right that as long as user does _not_ wait for the computation
result, running at low frequency might make sense. That may be true on
cellphone so fast that all the actions are "instant". I have yet to
see such cellphone. That probably means that staying on low frequency
normally and going to high after cpu is busy for 100msec or so is
right thing: if cpu is busy for 100msec, it probably means user is
waiting for the result.

But it depends on the numbers you did not tell us. I'm pretty sure
N900 does _not_ have 11% power consuption at 33% performance; I just
assumed so for sake of argument.

So, really, details are needed.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2014-01-20 18:17:24

by Catalin Marinas

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

On Mon, Jan 20, 2014 at 05:54:32PM +0000, Pavel Machek wrote:
> On Mon 2014-01-20 17:10:29, Catalin Marinas wrote:
> > On Mon, Jan 20, 2014 at 04:49:26PM +0000, Pavel Machek wrote:
> > > > To save energy, the higher frequencies should be avoided and only used
> > > > when the application performance requirements can not be satisfied
> > > > otherwise (e.g. spread tasks across more cpus if possible).
> > >
> > > I argue this is untrue for any task where user waits for its
> > > completion with screen on. (And that's quite important subset).
> > >
> > > Lets take Nokia n900 as an example.
> > >
> > > (source http://wiki.maemo.org/N900_Hardware_Power_Consumption)
> > >
> > > Sleeping CPU: 2mA
> > > Screen on: 230mA
> > > CPU loaded: 250mA
> > >
> > > Now, lets believe your numbers and pretend system can operate at 33%
> > > of speed with 11% power consumption.
> > >
> > > Lets take task that takes 10 seconds on max frequency:
> > >
> > > ~ 10s * 470mA = 4700mAs
> > >
> > > You suggest running at 33% speed, instead; that means 30 seconds on
> > > low requency.
> > >
> > > CPU on low: 25mA (assumed).
> > >
> > > ~ 30s * 255mA = 7650mAs
> > >
> > > Hmm. So race to idle is good thing on Intel machines, and it is good
> > > thing on ARM design I have access to.
> >
> > Race to idle doesn't mean that the screen goes off as well. Let's say
> > the screen stays on for 1 min and the CPU needs to be running for 10s
> > over this minute, in the first case you have:
>
> No, it does not. I just assumed user is continuing to use his
> machine. Obviously, waiting 60 seconds with screen on will make the
> difference look smaller. But your solution still means user has to
> wait longer _and_ you consume more battery doing so.
>
> And this is for any task where user waits for result with screen
> on. Like rendering a webpage. Like opening settings screen. Like
> installing application.

Page rendering should make very little difference to power since the
reading (screen on) time is much larger than the rendering (CPU) time.
But what I'm pointing at for 10s/60s ratios are thing like games or
video playing where the CPU is running for 1/6 of the time and idle for
the other 5/6. We get better energy figures by changing the run time to
3/6 and idle at 3/6.

> There are not too many background tasks on a cellphone.

For sleep time, screen off etc. there are some background tasks but here
the run-time doesn't matter much, it's probably more expensive to take
CPUs out of deep sleep states. What we want to optimise here is which
CPU to wake (like a little vs big).

> But hey, maybe you are right and running at lowest possible frequency
> is right. Please provide concrete numbers like I did.

They've been anonymised (for many reasons) and you have the right not to
trust them. But do you really think we are making up the numbers? We
have a great interest in the Linux scheduler working efficiently on the
ARM platforms rather than optimising it for non-existent scenarios. If
at some point this argument becomes a blocking factor, I'm sure we can
share the real numbers with the relevant parties under an NDA.

--
Catalin

2014-01-20 18:25:41

by Sebastian Reichel

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

On Mon, Jan 20, 2014 at 06:54:32PM +0100, Pavel Machek wrote:
> On Mon 2014-01-20 17:10:29, Catalin Marinas wrote:
> > On Mon, Jan 20, 2014 at 04:49:26PM +0000, Pavel Machek wrote:
> > > > To save energy, the higher frequencies should be avoided and only used
> > > > when the application performance requirements can not be satisfied
> > > > otherwise (e.g. spread tasks across more cpus if possible).
> > >
> > > I argue this is untrue for any task where user waits for its
> > > completion with screen on. (And that's quite important subset).
> > >
> > > Lets take Nokia n900 as an example.
> > >
> > > (source http://wiki.maemo.org/N900_Hardware_Power_Consumption)
> > >
> > > Sleeping CPU: 2mA
> > > Screen on: 230mA
> > > CPU loaded: 250mA
> > >
> > > Now, lets believe your numbers and pretend system can operate at 33%
> > > of speed with 11% power consumption.
> > >
> > > Lets take task that takes 10 seconds on max frequency:
> > >
> > > ~ 10s * 470mA = 4700mAs
> > >
> > > You suggest running at 33% speed, instead; that means 30 seconds on
> > > low requency.
> > >
> > > CPU on low: 25mA (assumed).
> > >
> > > ~ 30s * 255mA = 7650mAs
> > >
> > > Hmm. So race to idle is good thing on Intel machines, and it is good
> > > thing on ARM design I have access to.
> >
> > Race to idle doesn't mean that the screen goes off as well. Let's say
> > the screen stays on for 1 min and the CPU needs to be running for 10s
> > over this minute, in the first case you have:
>
> No, it does not. I just assumed user is continuing to use his
> machine. Obviously, waiting 60 seconds with screen on will make the
> difference look smaller. But your solution still means user has to
> wait longer _and_ you consume more battery doing so.
>
> And this is for any task where user waits for result with screen
> on. Like rendering a webpage. Like opening settings screen. Like
> installing application.
>
> There are not too many background tasks on a cellphone.
>
> But hey, maybe you are right and running at lowest possible frequency
> is right. Please provide concrete numbers like I did.

So what about using the display status information for power
management? Basically always using the lowest frequency should be ok
on phones if the display is disabled?

-- Sebastian


Attachments:
(No filename) (2.27 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2014-01-20 19:15:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

On Mon 2014-01-20 18:03:22, Catalin Marinas wrote:
> On Mon, Jan 20, 2014 at 05:47:45PM +0000, Pavel Machek wrote:
> > On Mon 2014-01-20 17:17:52, Catalin Marinas wrote:
> > > On Mon, Jan 20, 2014 at 05:10:29PM +0000, Catalin Marinas wrote:
> > > > On Mon, Jan 20, 2014 at 04:49:26PM +0000, Pavel Machek wrote:
> > > > > > To save energy, the higher frequencies should be avoided and only used
> > > > > > when the application performance requirements can not be satisfied
> > > > > > otherwise (e.g. spread tasks across more cpus if possible).
> > > > >
> > > > > I argue this is untrue for any task where user waits for its
> > > > > completion with screen on. (And that's quite important subset).
> > > > >
> > > > > Lets take Nokia n900 as an example.
> > > > >
> > > > > (source http://wiki.maemo.org/N900_Hardware_Power_Consumption)
> > > > >
> > > > > Sleeping CPU: 2mA
> > > > > Screen on: 230mA
> > > > > CPU loaded: 250mA
> > > > >
> > > > > Now, lets believe your numbers and pretend system can operate at 33%
> > > > > of speed with 11% power consumption.
> > > > >
> > > > > Lets take task that takes 10 seconds on max frequency:
> > > > >
> > > > > ~ 10s * 470mA = 4700mAs
> > > > >
> > > > > You suggest running at 33% speed, instead; that means 30 seconds on
> > > > > low requency.
> > > > >
> > > > > CPU on low: 25mA (assumed).
> > > > >
> > > > > ~ 30s * 255mA = 7650mAs
> > > > >
> > > > > Hmm. So race to idle is good thing on Intel machines, and it is good
> > > > > thing on ARM design I have access to.
> > > >
> > > > Race to idle doesn't mean that the screen goes off as well. Let's say
> > > > the screen stays on for 1 min and the CPU needs to be running for 10s
> > > > over this minute, in the first case you have:
> > > >
> > > > 10s & 250mA + 60s * 230mA = 16300mAs
> > > >
> > > > in the second case you have:
> > > >
> > > > 30s * 25mA + 60s * 230mA = 14550mAs
> > > >
> > > > That's a 1750mAs difference. There are of course other parts drawing
> > > > current but simple things like the above really make a difference in the
> > > > mobile space, both in terms of battery and thermal budget.
> > >
> > > BTW, the proper way to calculate this is to use the energy rather than
> > > current x time. This would be J = Ohm * A^2 * s = V^2 / Ohm * s (so the
> > > impact of the current is even bigger).
> >
> > You are claiming that energy is proportional to current squared?
> >
> > I stand by numbers. Energy is proportional to values I quoted,
> > provided constant voltage.
>
> The big advantage of frequency scaling is that you can scale down the
> voltage, making the power proportional to the voltage squared (or
> current squared for a constant resistance).

I was talking battery voltage; so multiple my numbers by 3.6V and
you'll get Joules.

Yes, I know how voltage scaling works, thats why you can get 11% power
consumption for 33% of work done, thank you.

But no, my cell phone is not pure resistor, that's why your quotation
of Ohm's law surprised me.

Can you point out problem with my numbers or not?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2014-01-20 20:44:57

by Pavel Machek

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

Hi!

> > > Race to idle doesn't mean that the screen goes off as well. Let's say
> > > the screen stays on for 1 min and the CPU needs to be running for 10s
> > > over this minute, in the first case you have:
> >
> > No, it does not. I just assumed user is continuing to use his
> > machine. Obviously, waiting 60 seconds with screen on will make the
> > difference look smaller. But your solution still means user has to
> > wait longer _and_ you consume more battery doing so.
> >
> > And this is for any task where user waits for result with screen
> > on. Like rendering a webpage. Like opening settings screen. Like
> > installing application.
>
> Page rendering should make very little difference to power since the
> reading (screen on) time is much larger than the rendering (CPU)
> time.

For some uses, yes, for some uses (searching for bus time tables,
displaying weather) not neccessarily. And I suspect that the whole CPU
consumption takes a little difference in power, anyways...

> But what I'm pointing at for 10s/60s ratios are thing like games or
> video playing where the CPU is running for 1/6 of the time and idle for
> the other 5/6. We get better energy figures by changing the run time to
> 3/6 and idle at 3/6.

Better energy figures on complete system consumption, on phone-type
device that can be bought in the shop?

> > But hey, maybe you are right and running at lowest possible frequency
> > is right. Please provide concrete numbers like I did.
>
> They've been anonymised (for many reasons) and you have the right not to
> trust them. But do you really think we are making up the numbers? We

"Here is power consumption of unspecified part of machine in
unspecified units on machine of unspecified type. Trust us our patches
improve it in unspecified workload". Why should I trust you?

> have a great interest in the Linux scheduler working efficiently on the
> ARM platforms rather than optimising it for non-existent scenarios. If
> at some point this argument becomes a blocking factor, I'm sure we can
> share the real numbers with the relevant parties under an NDA.

I'm sure you can just buy Samsung S4 in the nearest shop, and you
probably can find and ampermeter on site... Then perhaps people can
reproduce your results and we can have useful discussion. This is
relevant to production hardware, right?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2014-01-21 11:19:57

by Catalin Marinas

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

On Mon, Jan 20, 2014 at 07:15:46PM +0000, Pavel Machek wrote:
> On Mon 2014-01-20 18:03:22, Catalin Marinas wrote:
> > On Mon, Jan 20, 2014 at 05:47:45PM +0000, Pavel Machek wrote:
> > > On Mon 2014-01-20 17:17:52, Catalin Marinas wrote:
> > > > On Mon, Jan 20, 2014 at 05:10:29PM +0000, Catalin Marinas wrote:
> > > > > On Mon, Jan 20, 2014 at 04:49:26PM +0000, Pavel Machek wrote:
> > > > > > > To save energy, the higher frequencies should be avoided and only used
> > > > > > > when the application performance requirements can not be satisfied
> > > > > > > otherwise (e.g. spread tasks across more cpus if possible).
> > > > > >
> > > > > > I argue this is untrue for any task where user waits for its
> > > > > > completion with screen on. (And that's quite important subset).
> > > > > >
> > > > > > Lets take Nokia n900 as an example.
> > > > > >
> > > > > > (source http://wiki.maemo.org/N900_Hardware_Power_Consumption)
> > > > > >
> > > > > > Sleeping CPU: 2mA
> > > > > > Screen on: 230mA
> > > > > > CPU loaded: 250mA
> > > > > >
> > > > > > Now, lets believe your numbers and pretend system can operate at 33%
> > > > > > of speed with 11% power consumption.
> > > > > >
> > > > > > Lets take task that takes 10 seconds on max frequency:
> > > > > >
> > > > > > ~ 10s * 470mA = 4700mAs
> > > > > >
> > > > > > You suggest running at 33% speed, instead; that means 30 seconds on
> > > > > > low requency.
> > > > > >
> > > > > > CPU on low: 25mA (assumed).
> > > > > >
> > > > > > ~ 30s * 255mA = 7650mAs
> > > > > >
> > > > > > Hmm. So race to idle is good thing on Intel machines, and it is good
> > > > > > thing on ARM design I have access to.
> > > > >
> > > > > Race to idle doesn't mean that the screen goes off as well. Let's say
> > > > > the screen stays on for 1 min and the CPU needs to be running for 10s
> > > > > over this minute, in the first case you have:
> > > > >
> > > > > 10s & 250mA + 60s * 230mA = 16300mAs
> > > > >
> > > > > in the second case you have:
> > > > >
> > > > > 30s * 25mA + 60s * 230mA = 14550mAs
> > > > >
> > > > > That's a 1750mAs difference. There are of course other parts drawing
> > > > > current but simple things like the above really make a difference in the
> > > > > mobile space, both in terms of battery and thermal budget.
> > > >
> > > > BTW, the proper way to calculate this is to use the energy rather than
> > > > current x time. This would be J = Ohm * A^2 * s = V^2 / Ohm * s (so the
> > > > impact of the current is even bigger).
> > >
> > > You are claiming that energy is proportional to current squared?
> > >
> > > I stand by numbers. Energy is proportional to values I quoted,
> > > provided constant voltage.
> >
> > The big advantage of frequency scaling is that you can scale down the
> > voltage, making the power proportional to the voltage squared (or
> > current squared for a constant resistance).
>
> I was talking battery voltage; so multiple my numbers by 3.6V and
> you'll get Joules.

That's where we were talking about different things. What I was
referring to was the actual current used by the CPU which is different
from the one drawn from battery for that CPU (because of voltage
translation). But with a low-loss voltage regulator, we could pretend
that the corresponding power used by the CPU is the same at the battery
level.

> Can you point out problem with my numbers or not?

I agree with your equivalent battery current for the CPU (minor thing, I
get about 12% power consumption at 33% performance from Morten's
numbers, irrelevant).

The other thing I didn't agree with was the screen on vs race to idle
but I'll follow up separately.

--
Catalin

2014-01-21 11:43:00

by Catalin Marinas

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

On Mon, Jan 20, 2014 at 06:12:08PM +0000, Pavel Machek wrote:
> > > Sleeping CPU: 2mA
> > > Screen on: 230mA
> > > CPU loaded: 250mA
> > >
> > > Now, lets believe your numbers and pretend system can operate at 33%
> > > of speed with 11% power consumption.
> > >
> > > Lets take task that takes 10 seconds on max frequency:
> > >
> > > ~ 10s * 470mA = 4700mAs
> > >
> > > You suggest running at 33% speed, instead; that means 30 seconds on
> > > low requency.
> > >
> > > CPU on low: 25mA (assumed).
> > >
> > > ~ 30s * 255mA = 7650mAs
> > >
> > > Hmm. So race to idle is good thing on Intel machines, and it is good
> > > thing on ARM design I have access to.
> >
> > Race to idle doesn't mean that the screen goes off as well. Let's say
> > the screen stays on for 1 min and the CPU needs to be running for 10s
> > over this minute, in the first case you have:
> >
> > 10s & 250mA + 60s * 230mA = 16300mAs
> >
> > in the second case you have:
> >
> > 30s * 25mA + 60s * 230mA = 14550mAs
> >
> > That's a 1750mAs difference. There are of course other parts drawing
> > current but simple things like the above really make a difference in the
> > mobile space, both in terms of battery and thermal budget.
>
> Aha, I noticed the values are now the other way around. [And notice
> that if user _does_ lock/turn off the screen after the operation,
> difference between power consumptions is factor of two. People do turn
> off screens before putting phone back in pocket.]

It depends on the use-case, that's why the problem is so complicated.
Race-to-idle may work well if just checking bus timetables but not if
you are watching video or listening to music (the latter with screen
off).

> You are right that as long as user does _not_ wait for the computation
> result, running at low frequency might make sense. That may be true on
> cellphone so fast that all the actions are "instant". I have yet to
> see such cellphone. That probably means that staying on low frequency
> normally and going to high after cpu is busy for 100msec or so is
> right thing: if cpu is busy for 100msec, it probably means user is
> waiting for the result.

I'm talking about use-cases where a task (or multiple threads) are
running and only loading the CPU partially (audio or video playback).
Here you have an average number of instructions to execute per decoded
frame in a certain time. Once the frame is decoded, the CPU can go idle,
so you can choose whether to race to idle or run at lower frequency (and
lower energy per the same number of frame decoding instructions) with
less idle time. There are modern platforms where the latter behaviour is
more efficient.

I would really like race to idle to be true for all cases, it would
simplify the kernel and we could just remove cpufreq, always running the
CPUs at max frequency. But so far I don't see Intel ignoring this
problem either, they keep developing a pstate driver which changes the
P-states based on average CPU load.

(we can complicate the problem further by considering memory vs CPU
bound threads)

> But it depends on the numbers you did not tell us. I'm pretty sure
> N900 does _not_ have 11% power consuption at 33% performance; I just
> assumed so for sake of argument.
>
> So, really, details are needed.

If that's the only issue to be addressed, I'm happy to ignore the
frequency scaling initially and focus on idle. But since people still do
frequency scaling and this would interfere with the scheduler, we have
to (1) normalise the task load as much as possible (frequency invariant
load tracking) and (2) scheduler power model should take into account
the cost of placing tasks on CPUs at different P-states. With such
simplification we can leave the P-state selection to cpufreq and see how
far we can get in terms of power efficiency.

--
Catalin

2014-01-21 12:14:32

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

Hi,

On Mon, Jan 20, 2014 at 04:32:54PM +0000, Pavel Machek wrote:
> On Tue 2014-01-07 16:19:47, Morten Rasmussen wrote:
> > Most modern systems use DVFS to save power by slowing down computation
> > throughput when less performance is necessary. The power/performance
> > relation is platform specific. Some platforms may have better energy
> > savings (energy per instruction) than others at low frequencies.
> >
> > To have something to relate to, here is an anonymized example based on
> > a modern ARM platform:
>
> And here is anonymized example I pulled out of my hat:
>
> Ammount of anonymization Usefulness of information
> 0.0 1.0
> 0.5 0.05
> 1.0 0.0
>
> Come on, you can surely do better than "trust me, it is modern". Now
> we can't verify those numbers. And they don't make sense.
>
> > Performance Energy/instruction
> > 1.0 1.0
> > 1.3 1.6
> > 1.7 1.8
> > 2.0 1.9
> > 2.3 2.1
> > 2.7 2.4
> > 3.0 2.7
> >
> > Performance is frequency (~instruction issue rate) and
> > energy/instruction is the energy cost of executing one (or a fixed
> > number of instructions) at that level of performance (frequency). For
> > this example, it costs 2.7x more energy per instruction to increase the
> > performance from 1.0 to 3.0 (3x). That is, the amount of work
> > (instructions) that can be done on one battery charge is reduced by 2.7x
> > (~63%) if you run as fast as possible (3.0) compared to running at
> > slowest frequency (1.0).
>
> This very heavily depends on what you count to the total energy,
> right? And it is very hard to argue with you before you anonymized
> your numbers.

Just to clarify, the numbers above are cpu only as already stated in the
linux-pm thread referenced in the cover letter. We do of course need to
consider the total energy (cpu, gpu and memory at least) when verifying
whether any optimization does save energy or not.

As already discussed, battery power is suitable for this purpose on end
product form factor systems. However, for development hardware that
might be quite different (extra onboard devices and such).

>
> Anyway, you assuming modern system, low frequency should be cca
> 0.5GHz, with high cca 1.5GHz. Do you claim that operation on 1.5GHz
> takes 9x the power of 0.5GHz operation?

On this particular platform, increasing the frequency by 3x increases
power by 8.1x.

>
> Do you count DRAM to the power consumption?

As said above, the numbers are cpu only.

>
> > To save energy, the higher frequencies should be avoided and only used
> > when the application performance requirements can not be satisfied
> > otherwise (e.g. spread tasks across more cpus if possible).
>
> This is in very steep contrast with race-to-idle on the PCs.

I think Catalin already covered why race-to-idle isn't always the best
idea elsewhere in this thread. Basically, it is wasting a lot of energy
in use-cases like audio and video playback. I have provided descriptions
of these use-cases as part of this set of emails.

>
> > When considering the total system power it may save energy in some
> > scenarios by running the cpu faster to allow other power hungry parts of
> > the system to be shut down faster. However, this is highly platform and
> > application dependent.
>
> Aha. Devil is in the details. "I pulled random numbers out of the hat,
> and they are wrong, but they are wrong in platform specific way. And I
> anonymized them for you so that you can't verify them".
>
> Can we talk specific machine, please? You are talking Android all the
> time, so pick one cellphone you care about, and provide real numbers...

As Catalin already said, there is a number of reason why we can't share
absolute numbers publicly. I think the numbers I posted give a pretty
good picture of the trade-offs involved in frequency scaling on a modern
ARM Soc. This posting was meant to a be a start of a discussion and I'm
hoping add more information (in anonymized form) in the future.

Morten

2014-01-21 12:20:15

by Pavel Machek

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

Hi!

> > > That's a 1750mAs difference. There are of course other parts drawing
> > > current but simple things like the above really make a difference in the
> > > mobile space, both in terms of battery and thermal budget.
> >
> > Aha, I noticed the values are now the other way around. [And notice
> > that if user _does_ lock/turn off the screen after the operation,
> > difference between power consumptions is factor of two. People do turn
> > off screens before putting phone back in pocket.]
>
> It depends on the use-case, that's why the problem is so complicated.
> Race-to-idle may work well if just checking bus timetables but not if
> you are watching video or listening to music (the latter with screen
> off).

Exactly, it is complex. That's why it is important to get real
numbers, please.

And yes, if your _system_ has low power consumption in
active-at-low-frequency mode, race-to-idle may not be a win for you.

> > You are right that as long as user does _not_ wait for the computation
> > result, running at low frequency might make sense. That may be true on
> > cellphone so fast that all the actions are "instant". I have yet to
> > see such cellphone. That probably means that staying on low frequency
> > normally and going to high after cpu is busy for 100msec or so is
> > right thing: if cpu is busy for 100msec, it probably means user is
> > waiting for the result.
>
> I'm talking about use-cases where a task (or multiple threads) are
> running and only loading the CPU partially (audio or video playback).
> Here you have an average number of instructions to execute per decoded
> frame in a certain time. Once the frame is decoded, the CPU can go idle,
> so you can choose whether to race to idle or run at lower frequency (and
> lower energy per the same number of frame decoding instructions) with
> less idle time. There are modern platforms where the latter behaviour is
> more efficient.

So, my Thinkpad X60 is not such platform. Early Athlon64 notebooks
_were_ such platforms. Can you provide example modern platform you are
talking about?

> I would really like race to idle to be true for all cases, it would
> simplify the kernel and we could just remove cpufreq, always running the
> CPUs at max frequency. But so far I don't see Intel ignoring this
> problem either, they keep developing a pstate driver which changes the
> P-states based on average CPU load.

Race-to-idle is win on all modern x86 systems, because they have high
power consumption even on low non-idle frequency, due to leakage. We
still keep P-states for cooling, for completeness and for older
systems.

> > But it depends on the numbers you did not tell us. I'm pretty sure
> > N900 does _not_ have 11% power consuption at 33% performance; I just
> > assumed so for sake of argument.
> >
> > So, really, details are needed.
>
> If that's the only issue to be addressed, I'm happy to ignore the
> frequency scaling initially and focus on idle. But since people still do
> frequency scaling and this would interfere with the scheduler, we have

I guess there are modern platforms and workloads where frequency
scaling makes sense. You only need to find one, and provide numbers
for it. Please.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2014-01-21 12:31:25

by Pavel Machek

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

Hi!

> > > Performance is frequency (~instruction issue rate) and
> > > energy/instruction is the energy cost of executing one (or a fixed
> > > number of instructions) at that level of performance (frequency). For
> > > this example, it costs 2.7x more energy per instruction to increase the
> > > performance from 1.0 to 3.0 (3x). That is, the amount of work
> > > (instructions) that can be done on one battery charge is reduced by 2.7x
> > > (~63%) if you run as fast as possible (3.0) compared to running at
> > > slowest frequency (1.0).
> >
> > This very heavily depends on what you count to the total energy,
> > right? And it is very hard to argue with you before you anonymized
> > your numbers.
>
> Just to clarify, the numbers above are cpu only as already stated in the
> linux-pm thread referenced in the cover letter. We do of course need to
> consider the total energy (cpu, gpu and memory at least) when verifying
> whether any optimization does save energy or not.

Yes.

> As already discussed, battery power is suitable for this purpose on end
> product form factor systems. However, for development hardware that
> might be quite different (extra onboard devices and such).

Is the behavior of current production hardware significantly different
from secret development boards you have? I don't think so. So can we
get measurements on real production hardware?

> > Anyway, you assuming modern system, low frequency should be cca
> > 0.5GHz, with high cca 1.5GHz. Do you claim that operation on 1.5GHz
> > takes 9x the power of 0.5GHz operation?
>
> On this particular platform, increasing the frequency by 3x increases
> power by 8.1x.

Lets call your plaform TopSecret.

_CPU_ power. If your DRAM eats as much power as the CPU on that
platform, and enters low-power mode when CPU does, race-to-idle is
still a win.

> > > To save energy, the higher frequencies should be avoided and only used
> > > when the application performance requirements can not be satisfied
> > > otherwise (e.g. spread tasks across more cpus if possible).
> >
> > This is in very steep contrast with race-to-idle on the PCs.
>
> I think Catalin already covered why race-to-idle isn't always the best
> idea elsewhere in this thread. Basically, it is wasting a lot of energy
> in use-cases like audio and video playback. I have provided descriptions
> of these use-cases as part of this set of emails.

That's the problem. He demonstrated that on platform TopSecret
race-to-idle is not good idea, assuming CPU and display are the only
parts eating power. But that's not true even on TopSecret platform.

So, at the very least, we need to know ammount of power taken by CPU
idle/active and DRAM idle/active.

> > Can we talk specific machine, please? You are talking Android all the
> > time, so pick one cellphone you care about, and provide real numbers...
>
> As Catalin already said, there is a number of reason why we can't share
> absolute numbers publicly. I think the numbers I posted give a pretty
> good picture of the trade-offs involved in frequency scaling on a modern
> ARM Soc.

Unfortunately, as explained above, numbers for TopSecret are not
useful. And that should be reason to re-do the measurements on some
non-secret machine.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2014-01-21 19:01:47

by Kalle Jokiniemi

[permalink] [raw]
Subject: Re: [11/11] system 1: Saving energy using DVFS

Hi,

On 20.01.2014 20:25, Sebastian Reichel wrote:
> On Mon, Jan 20, 2014 at 06:54:32PM +0100, Pavel Machek wrote:
>> On Mon 2014-01-20 17:10:29, Catalin Marinas wrote:
>>> On Mon, Jan 20, 2014 at 04:49:26PM +0000, Pavel Machek wrote:
>>>>> To save energy, the higher frequencies should be avoided and only used
>>>>> when the application performance requirements can not be satisfied
>>>>> otherwise (e.g. spread tasks across more cpus if possible).
>>>>
>>>> I argue this is untrue for any task where user waits for its
>>>> completion with screen on. (And that's quite important subset).
>>>>
>>>> Lets take Nokia n900 as an example.
>>>>
>>>> (source http://wiki.maemo.org/N900_Hardware_Power_Consumption)
>>>>
>>>> Sleeping CPU: 2mA
>>>> Screen on: 230mA
>>>> CPU loaded: 250mA
>>>>
>>>> Now, lets believe your numbers and pretend system can operate at 33%
>>>> of speed with 11% power consumption.
>>>>
>>>> Lets take task that takes 10 seconds on max frequency:
>>>>
>>>> ~ 10s * 470mA = 4700mAs
>>>>
>>>> You suggest running at 33% speed, instead; that means 30 seconds on
>>>> low requency.
>>>>
>>>> CPU on low: 25mA (assumed).
>>>>
>>>> ~ 30s * 255mA = 7650mAs
>>>>
>>>> Hmm. So race to idle is good thing on Intel machines, and it is good
>>>> thing on ARM design I have access to.
>>>
>>> Race to idle doesn't mean that the screen goes off as well. Let's say
>>> the screen stays on for 1 min and the CPU needs to be running for 10s
>>> over this minute, in the first case you have:
>>
>> No, it does not. I just assumed user is continuing to use his
>> machine. Obviously, waiting 60 seconds with screen on will make the
>> difference look smaller. But your solution still means user has to
>> wait longer _and_ you consume more battery doing so.
>>
>> And this is for any task where user waits for result with screen
>> on. Like rendering a webpage. Like opening settings screen. Like
>> installing application.
>>
>> There are not too many background tasks on a cellphone.
>>
>> But hey, maybe you are right and running at lowest possible frequency
>> is right. Please provide concrete numbers like I did.
>
> So what about using the display status information for power
> management? Basically always using the lowest frequency should be ok
> on phones if the display is disabled?

Well, not really. There are a looot of devices running linux kernel, and
there are always devices and use cases that can't operate if you
hardcode something like that.

It is good to know what the problematic use cases are, but it usually
does not end well if you optimize for specifics. End users are
unpredictable in ways they utilize their devices :)

And these days audio playback power optimization in smart phones is
mostly for the product spec marketing purposes anyways :P

I think the discussion has been going into right direction:
- find what data we have to make better decisions
- find ways to utilize that data

And then in the end the big smart phone manufacturers will twist that to
their use cases in horrible ways to meet product specs on tight
schedules :D But at least the starting point will be closer to target.

Even on N900 dvfs is beneficial for audio playback. But it was not
because of CPU consuming less, it was because the peripheral bus
frequency was tied to CPU frequency. There we ended up removing a very
low 125MHz cpu operating point so that the device ran at highest
possible 250MHz CPU rate (to idle quickly) while still keeping the lower
peripheral bus speed.

So we raced to idle and used DVFS.. how nice :)

- Kalle

>
> -- Sebastian
>