Hi,
A number of patch sets related to power-efficient scheduling have been
posted over the last couple of months. Most of them do not have much
data to back them up, so I decided to do some testing.
Common for all of the patch sets that I have tested, except one, is that
they attempt to pack tasks on as few cpus as possible to allow the
remaining cpus to enter deeper sleep states - a strategy that should
make sense on most platforms that support per-cpu power gating and
multi-socket machines.
Kernel: 3.9
Patch sets:
rlb-v4: sched: use runnable load based balance (Alex Shi)
<https://lkml.org/lkml/2013/4/27/13>
pas-v7: sched: power aware scheduling (Alex Shi)
<https://lkml.org/lkml/2013/4/3/732>
pst-v3: sched: packing small tasks (Vincent Guittot)
<https://lkml.org/lkml/2013/3/22/183>
pst-v4: sched: packing small tasks (Vincent Guittot)
<https://lkml.org/lkml/2013/4/25/396>
Configuration:
pas-v7: Set to "powersaving" mode.
pst-v4: Set to "Full" packing mode.
Platform:
ARM TC2 (test-chip), 2xCortex-A15 + 3xCortex-A7. Cortex-A15s disabled.
Measurement technique:
Time spent non-idle (not in idle state) for each cpu based on cpuidle
ftrace events. TC2 does not have per-core power-gating, so packing
inside the A7 cluster does not lead to any significant power savings.
Note that any product grade hardware (TC2 is a test-chip) will very
likely have per-core power-gating, so in those cases packing will have
an appreciable effect on power savings.
Measuring non-idle time rather than power should give a more clear idea
about the effect of the patch sets given that the idle back-end is
highly implementation specific.
Benchmarks:
audio playback (Android): 30s mp3 file playback on Android.
bbench+audio (Android): Web page rendering while doing mp3 playback.
andebench_native (Android): Android benchmark running in native mode.
cyclictest: Short periodic tasks.
Results:
Two runs for each patch set.
audio playback (Android) SMP
non-idle % cpu 0 cpu 1 cpu 2
3.9_1 11.96 2.86 2.48
3.9_2 12.64 2.81 1.88
rlb-v4_1 12.61 2.44 1.90
rlb-v4_2 12.45 2.44 1.90
pas-v7_1 16.17 0.03 0.24
pas-v7_2 16.08 0.28 0.07
pst-v3_1 15.18 2.76 1.70
pst-v3_2 15.13 0.80 0.38
pst-v4_1 16.14 0.05 0.00
pst-v4_2 16.34 0.06 0.00
bbench+audio (Android) SMP
non-idle % cpu 0 cpu 1 cpu 2 render time
3.9_1 25.00 20.73 21.22 812
3.9_2 24.29 19.78 22.34 795
rlb-v4_1 23.84 19.36 22.74 782
rlb-v4_2 24.07 19.36 22.74 797
pas-v7_1 28.29 17.86 16.01 869
pas-v7_2 28.62 18.54 15.05 908
pst-v3_1 29.14 20.59 21.72 830
pst-v3_2 27.69 18.81 20.06 830
pst-v4_1 42.20 13.63 2.29 880
pst-v4_2 41.56 14.40 2.17 935
andebench_native (8 threads) (Android) SMP
non-idle % cpu 0 cpu 1 cpu 2 Score
3.9_1 99.22 98.88 99.61 4139
3.9_2 99.56 99.31 99.46 4148
rlb-v4_1 99.49 99.61 99.53 4153
rlb-v4_2 99.56 99.61 99.53 4149
pas-v7_1 99.53 99.59 99.29 4149
pas-v7_2 99.42 99.63 99.48 4150
pst-v3_1 97.89 99.33 99.42 4097
pst-v3_2 99.16 99.62 99.42 4097
pst-v4_1 99.34 99.01 99.59 4146
pst-v4_2 99.49 99.52 99.20 4146
cyclictest SMP
non-idle % cpu 0 cpu 1 cpu 2
3.9_1 9.13 8.88 8.41
3.9_2 10.27 8.02 6.30
rlb-v4_1 8.88 8.09 8.11
rlb-v4_2 8.49 8.09 8.11
pas-v7_1 10.20 0.02 11.50
pas-v7_2 7.86 14.31 0.02
pst-v3_1 20.44 8.68 7.97
pst-v3_2 20.41 0.78 1.00
pst-v4_1 21.32 0.21 0.05
pst-v4_2 21.56 0.21 0.04
Overall, pas-v7 seems to do a fairly good job at packing. The idle time
distribution seems to be somewhere between pst-v3 and the more
aggressive pst-v4 for all the benchmarks. pst-v4 manages to keep two
cpus nearly idle (<0.25% non-idle) for both cyclictest and audio, which
is better than both pst-v3 and pas-v7. pas-v7 fails to pack cyclictest.
Packing does come at at cost which can be seen for bbench+audio, where
pst-v3 and rlb-v4 get better render times than pas-v7 and pst-v4 which
do more aggressive packing. rlb-v4 does not pack, it is only included
for reference.
>From a packing perspective pst-v4 seems to do the best job for the
workloads that I have tested on ARM TC2. The less aggressive packing in
pst-v3 may be a better choice for in terms of performance.
I'm well aware that these tests are heavily focused on mobile workloads.
I would therefore encourage people to share your test results for your
workloads on your platforms to complete the picture. Comments are also
welcome.
Thanks,
Morten
On 05/30/2013 09:47 PM, Morten Rasmussen wrote:
> Hi,
>
> A number of patch sets related to power-efficient scheduling have been
> posted over the last couple of months. Most of them do not have much
> data to back them up, so I decided to do some testing.
>
> Common for all of the patch sets that I have tested, except one, is that
> they attempt to pack tasks on as few cpus as possible to allow the
> remaining cpus to enter deeper sleep states - a strategy that should
> make sense on most platforms that support per-cpu power gating and
> multi-socket machines.
>
> Kernel: 3.9
>
> Patch sets:
> rlb-v4: sched: use runnable load based balance (Alex Shi)
> <https://lkml.org/lkml/2013/4/27/13>
Thanks for the valuable comparison!
The runnable load balance target is performance. It is still try to
disperse tasks to as much as possible CPUs. :)
The latest v7 version remove the 6th patch(wake_affine change) in v4.
and plus fix a slept time double counting issue, and remove
blocked_load_avg in tg load.
http://comments.gmane.org/gmane.linux.kernel/1498988
Enjoy!
> pas-v7: sched: power aware scheduling (Alex Shi)
> <https://lkml.org/lkml/2013/4/3/732>
We still have some internal discussion on this patch set before update
it. Sorry for response late on this patchset!
> pst-v3: sched: packing small tasks (Vincent Guittot)
> <https://lkml.org/lkml/2013/3/22/183>
> pst-v4: sched: packing small tasks (Vincent Guittot)
> <https://lkml.org/lkml/2013/4/25/396>
>
> Configuration:
> pas-v7: Set to "powersaving" mode.
> pst-v4: Set to "Full" packing mode.
>
> Platform:
> ARM TC2 (test-chip), 2xCortex-A15 + 3xCortex-A7. Cortex-A15s disabled.
>
> Measurement technique:
> Time spent non-idle (not in idle state) for each cpu based on cpuidle
> ftrace events. TC2 does not have per-core power-gating, so packing
> inside the A7 cluster does not lead to any significant power savings.
> Note that any product grade hardware (TC2 is a test-chip) will very
> likely have per-core power-gating, so in those cases packing will have
> an appreciable effect on power savings.
> Measuring non-idle time rather than power should give a more clear idea
> about the effect of the patch sets given that the idle back-end is
> highly implementation specific.
>
> Benchmarks:
> audio playback (Android): 30s mp3 file playback on Android.
> bbench+audio (Android): Web page rendering while doing mp3 playback.
> andebench_native (Android): Android benchmark running in native mode.
> cyclictest: Short periodic tasks.
>
> Results:
> Two runs for each patch set.
>
> audio playback (Android) SMP
> non-idle % cpu 0 cpu 1 cpu 2
> 3.9_1 11.96 2.86 2.48
> 3.9_2 12.64 2.81 1.88
> rlb-v4_1 12.61 2.44 1.90
> rlb-v4_2 12.45 2.44 1.90
> pas-v7_1 16.17 0.03 0.24
> pas-v7_2 16.08 0.28 0.07
> pst-v3_1 15.18 2.76 1.70
> pst-v3_2 15.13 0.80 0.38
> pst-v4_1 16.14 0.05 0.00
> pst-v4_2 16.34 0.06 0.00
>
> bbench+audio (Android) SMP
> non-idle % cpu 0 cpu 1 cpu 2 render time
> 3.9_1 25.00 20.73 21.22 812
> 3.9_2 24.29 19.78 22.34 795
> rlb-v4_1 23.84 19.36 22.74 782
> rlb-v4_2 24.07 19.36 22.74 797
> pas-v7_1 28.29 17.86 16.01 869
> pas-v7_2 28.62 18.54 15.05 908
> pst-v3_1 29.14 20.59 21.72 830
> pst-v3_2 27.69 18.81 20.06 830
> pst-v4_1 42.20 13.63 2.29 880
> pst-v4_2 41.56 14.40 2.17 935
>
> andebench_native (8 threads) (Android) SMP
> non-idle % cpu 0 cpu 1 cpu 2 Score
> 3.9_1 99.22 98.88 99.61 4139
> 3.9_2 99.56 99.31 99.46 4148
> rlb-v4_1 99.49 99.61 99.53 4153
> rlb-v4_2 99.56 99.61 99.53 4149
> pas-v7_1 99.53 99.59 99.29 4149
> pas-v7_2 99.42 99.63 99.48 4150
> pst-v3_1 97.89 99.33 99.42 4097
> pst-v3_2 99.16 99.62 99.42 4097
> pst-v4_1 99.34 99.01 99.59 4146
> pst-v4_2 99.49 99.52 99.20 4146
>
> cyclictest SMP
> non-idle % cpu 0 cpu 1 cpu 2
> 3.9_1 9.13 8.88 8.41
> 3.9_2 10.27 8.02 6.30
> rlb-v4_1 8.88 8.09 8.11
> rlb-v4_2 8.49 8.09 8.11
> pas-v7_1 10.20 0.02 11.50
> pas-v7_2 7.86 14.31 0.02
> pst-v3_1 20.44 8.68 7.97
> pst-v3_2 20.41 0.78 1.00
> pst-v4_1 21.32 0.21 0.05
> pst-v4_2 21.56 0.21 0.04
>
> Overall, pas-v7 seems to do a fairly good job at packing. The idle time
> distribution seems to be somewhere between pst-v3 and the more
> aggressive pst-v4 for all the benchmarks. pst-v4 manages to keep two
> cpus nearly idle (<0.25% non-idle) for both cyclictest and audio, which
> is better than both pst-v3 and pas-v7. pas-v7 fails to pack cyclictest.
> Packing does come at at cost which can be seen for bbench+audio, where
> pst-v3 and rlb-v4 get better render times than pas-v7 and pst-v4 which
> do more aggressive packing. rlb-v4 does not pack, it is only included
> for reference.
>
> From a packing perspective pst-v4 seems to do the best job for the
> workloads that I have tested on ARM TC2. The less aggressive packing in
> pst-v3 may be a better choice for in terms of performance.
>
> I'm well aware that these tests are heavily focused on mobile workloads.
> I would therefore encourage people to share your test results for your
> workloads on your platforms to complete the picture. Comments are also
> welcome.
>
> Thanks,
> Morten
>
>
--
Thanks
Alex
On 05/31/2013 09:17 AM, Alex Shi wrote:
>> > Kernel: 3.9
>> >
>> > Patch sets:
>> > rlb-v4: sched: use runnable load based balance (Alex Shi)
>> > <https://lkml.org/lkml/2013/4/27/13>
> Thanks for the valuable comparison!
>
> The runnable load balance target is performance. It is still try to
> disperse tasks to as much as possible CPUs. :)
> The latest v7 version remove the 6th patch(wake_affine change) in v4.
> and plus fix a slept time double counting issue, and remove
> blocked_load_avg in tg load.
> http://comments.gmane.org/gmane.linux.kernel/1498988
Even the rlb patch set target is performance, Maybe the power benefit is
due to better balancing?
Anyway I appreciate if you like to test the latest v7 version. :)
https://github.com/alexshi/power-scheduling.git runnablelb
--
Thanks
Alex
* Morten Rasmussen <[email protected]> wrote:
> Hi,
>
> A number of patch sets related to power-efficient scheduling have been
> posted over the last couple of months. Most of them do not have much
> data to back them up, so I decided to do some testing.
Thanks, numbers are always welcome!
> Measurement technique:
> Time spent non-idle (not in idle state) for each cpu based on cpuidle
> ftrace events. TC2 does not have per-core power-gating, so packing
> inside the A7 cluster does not lead to any significant power savings.
> Note that any product grade hardware (TC2 is a test-chip) will very
> likely have per-core power-gating, so in those cases packing will have
> an appreciable effect on power savings.
> Measuring non-idle time rather than power should give a more clear idea
> about the effect of the patch sets given that the idle back-end is
> highly implementation specific.
Note that I still disagree with the whole design notion of having an "idle
back-end" (and a 'cpufreq back end') separate from scheduler power saving
policy, and none of the patch-sets offered so far solve this fundamental
design problem.
PeterZ and me tried to point out the design requirements previously, but
it still does not appear to be clear enough to people, so let me spell it
out again, in a hopefully clearer fashion.
The scheduler has valuable power saving information available:
- when a CPU is busy: about how long the current task expects to run
- when a CPU is idle: how long the current CPU expects _not_ to run
- topology: it knows how the CPUs and caches interrelate and already
optimizes based on that
- various high level and low level load averages and other metrics about
the recent past that show how busy a particular CPU is, how busy the
whole system is, and what the runtime properties of individual tasks is
(how often it sleeps, etc.)
so the scheduler is in an _ideal_ position to do a judgement call about
the near future and estimate how deep an idle state a CPU core should
enter into and what frequency it should run at.
The scheduler is also at a high enough level to host a "I want maximum
performance, power does not matter to me" user policy override switch and
similar user policy details.
No ifs and whens about that.
Today the power saving landscape is fragmented and sad: we just randomly
interface scheduler task packing changes with some idle policy (and
cpufreq policy), which might or might not combine correctly.
Even when the numbers improve, it's an entirely random, essentially
unmaintainable property: because there's no clear split (possible) between
'scheduler policy' and 'idle policy'. This is why we removed the old,
broken power saving scheduler code a year ago: to make room for something
_better_.
So if we want to add back scheduler power saving then what should happen
is genuinely better code:
To create a new low level idle driver mechanism the scheduler could use
and integrate proper power saving / idle policy into the scheduler.
In that power saving framework the already existing scheduler topology
information should be extended with deep idle parameters:
- enumeration of idle states
- how long it takes to enter+exit a particular idle state
- [ perhaps information about how destructive to CPU caches that
particular idle state is. ]
- new driver entry point that allows the scheduler to enter any of the
enumerated idle states. Platform code will not change this state, all
policy decisions and the idle state is decided at the power saving
policy level.
All of this combines into a 'cost to enter and exit an idle state'
estimation plus a way to enter idle states. It should be presented to the
scheduler in a platform independent fashion, but without policy embedded:
a low level platform driver interface in essence.
Thomas Gleixner's recent work to generalize platform idle routines will
further help the implementation of this. (that code is upstream already)
_All_ policy, all metrics, all averaging should happen at the scheduler
power saving level, in a single place, and then the scheduler should
directly drive the new low level idle state driver mechanism.
'scheduler power saving' and 'idle policy' are one and the same principle
and they should be handled in a single place to offer the best power
saving results.
Note that any RFC patch-set that offers an implementation for this could
be structured in a gradual fashion: only implementing it for a limited CPU
range initially. The new framework can then be extended to more and more
CPUs and architectures, incorporating more complicated power saving
features gradually. (The old, existing idle policy code would remain
untouched and available - it would simply not be used when the new policy
is activated.)
I.e. I'm not asking for a 'rewrite the world' kind of impossible task -
I'm providing an actionable path to get improved power saving upstream,
but it has to use a _sane design_.
This is a "line in the sand", a 'must have' design property for any
scheduler power saving patches to be acceptable - and I'm NAK-ing
incomplete approaches that don't solve the root design cause of our power
saving troubles...
Thanks,
Ingo
>
> - enumeration of idle states
>
> - how long it takes to enter+exit a particular idle state
>
> - [ perhaps information about how destructive to CPU caches that
> particular idle state is. ]
>
> - new driver entry point that allows the scheduler to enter any of the
> enumerated idle states. Platform code will not change this state, all
> policy decisions and the idle state is decided at the power saving
> policy level.
>
> All of this combines into a 'cost to enter and exit an idle state'
> estimation plus a way to enter idle states. It should be presented to the
> scheduler in a platform independent fashion, but without policy embedded:
> a low level platform driver interface in essence.
you're missing an aspect.
Deeper idle states on one core, allow (on Intel and AMD at least) the other cores to go faster.
So it's not so simple as "if I want more performance, go less deep". By going less deep you also reduce
overall performance of the system... as well as increase the power usage.
This aspect really really cannot be ignored, it's quite significant today, and going forward
is only going to get more and more significant.
* Arjan van de Ven <[email protected]> wrote:
> >
> > - enumeration of idle states
> >
> > - how long it takes to enter+exit a particular idle state
> >
> > - [ perhaps information about how destructive to CPU caches that
> > particular idle state is. ]
> >
> > - new driver entry point that allows the scheduler to enter any of the
> > enumerated idle states. Platform code will not change this state, all
> > policy decisions and the idle state is decided at the power saving
> > policy level.
> >
> >All of this combines into a 'cost to enter and exit an idle state'
> >estimation plus a way to enter idle states. It should be presented to the
> >scheduler in a platform independent fashion, but without policy embedded:
> >a low level platform driver interface in essence.
>
> you're missing an aspect.
>
> Deeper idle states on one core, allow (on Intel and AMD at least) the
> other cores to go faster. So it's not so simple as "if I want more
> performance, go less deep". By going less deep you also reduce overall
> performance of the system... as well as increase the power usage.
>
> This aspect really really cannot be ignored, it's quite significant
> today, and going forward is only going to get more and more significant.
I'm not missing turbo mode, just wanted to keep the above discussion
simple. For turbo mode the "go for performance" constraints are simply
different, more global. We have similar concerns in the scheduler already
- for example system-global scheduling decisions for NUMA balancing.
Turbo mode in fact shows _why_ it's important to decide this on a higher,
unified level to achieve best results: as the contraints and
interdependencies become more complex it's not a simple CPU-local
CPU-resource utilization decision anymore, but a system-wide one, where
broad kinds of scheduling information is needed to make a good guess.
Thanks,
Ingo
On Fri, May 31, 2013 at 11:52:04AM +0100, Ingo Molnar wrote:
>
> * Morten Rasmussen <[email protected]> wrote:
>
> > Hi,
> >
> > A number of patch sets related to power-efficient scheduling have been
> > posted over the last couple of months. Most of them do not have much
> > data to back them up, so I decided to do some testing.
>
> Thanks, numbers are always welcome!
>
> > Measurement technique:
> > Time spent non-idle (not in idle state) for each cpu based on cpuidle
> > ftrace events. TC2 does not have per-core power-gating, so packing
> > inside the A7 cluster does not lead to any significant power savings.
> > Note that any product grade hardware (TC2 is a test-chip) will very
> > likely have per-core power-gating, so in those cases packing will have
> > an appreciable effect on power savings.
> > Measuring non-idle time rather than power should give a more clear idea
> > about the effect of the patch sets given that the idle back-end is
> > highly implementation specific.
>
> Note that I still disagree with the whole design notion of having an "idle
> back-end" (and a 'cpufreq back end') separate from scheduler power saving
> policy, and none of the patch-sets offered so far solve this fundamental
> design problem.
>
> PeterZ and me tried to point out the design requirements previously, but
> it still does not appear to be clear enough to people, so let me spell it
> out again, in a hopefully clearer fashion.
>
> The scheduler has valuable power saving information available:
>
> - when a CPU is busy: about how long the current task expects to run
>
> - when a CPU is idle: how long the current CPU expects _not_ to run
>
> - topology: it knows how the CPUs and caches interrelate and already
> optimizes based on that
>
> - various high level and low level load averages and other metrics about
> the recent past that show how busy a particular CPU is, how busy the
> whole system is, and what the runtime properties of individual tasks is
> (how often it sleeps, etc.)
>
> so the scheduler is in an _ideal_ position to do a judgement call about
> the near future and estimate how deep an idle state a CPU core should
> enter into and what frequency it should run at.
>
> The scheduler is also at a high enough level to host a "I want maximum
> performance, power does not matter to me" user policy override switch and
> similar user policy details.
>
> No ifs and whens about that.
>
> Today the power saving landscape is fragmented and sad: we just randomly
> interface scheduler task packing changes with some idle policy (and
> cpufreq policy), which might or might not combine correctly.
>
> Even when the numbers improve, it's an entirely random, essentially
> unmaintainable property: because there's no clear split (possible) between
> 'scheduler policy' and 'idle policy'. This is why we removed the old,
> broken power saving scheduler code a year ago: to make room for something
> _better_.
>
> So if we want to add back scheduler power saving then what should happen
> is genuinely better code:
>
> To create a new low level idle driver mechanism the scheduler could use
> and integrate proper power saving / idle policy into the scheduler.
>
> In that power saving framework the already existing scheduler topology
> information should be extended with deep idle parameters:
>
> - enumeration of idle states
>
> - how long it takes to enter+exit a particular idle state
>
> - [ perhaps information about how destructive to CPU caches that
> particular idle state is. ]
>
> - new driver entry point that allows the scheduler to enter any of the
> enumerated idle states. Platform code will not change this state, all
> policy decisions and the idle state is decided at the power saving
> policy level.
>
> All of this combines into a 'cost to enter and exit an idle state'
> estimation plus a way to enter idle states. It should be presented to the
> scheduler in a platform independent fashion, but without policy embedded:
> a low level platform driver interface in essence.
>
> Thomas Gleixner's recent work to generalize platform idle routines will
> further help the implementation of this. (that code is upstream already)
>
> _All_ policy, all metrics, all averaging should happen at the scheduler
> power saving level, in a single place, and then the scheduler should
> directly drive the new low level idle state driver mechanism.
>
> 'scheduler power saving' and 'idle policy' are one and the same principle
> and they should be handled in a single place to offer the best power
> saving results.
>
> Note that any RFC patch-set that offers an implementation for this could
> be structured in a gradual fashion: only implementing it for a limited CPU
> range initially. The new framework can then be extended to more and more
> CPUs and architectures, incorporating more complicated power saving
> features gradually. (The old, existing idle policy code would remain
> untouched and available - it would simply not be used when the new policy
> is activated.)
>
> I.e. I'm not asking for a 'rewrite the world' kind of impossible task -
> I'm providing an actionable path to get improved power saving upstream,
> but it has to use a _sane design_.
>
> This is a "line in the sand", a 'must have' design property for any
> scheduler power saving patches to be acceptable - and I'm NAK-ing
> incomplete approaches that don't solve the root design cause of our power
> saving troubles...
Thanks for sharing your view.
I agree with idea of having a high level user switch to change
power/performance policy trade-offs for the system. Not only for
scheduling. I also share your view that the scheduler is in the ideal
place to drive the frequency scaling and idle policies.
However, I think that an integrated solution with one unified policy
implemented in the scheduler would take a significant rewrite of the
scheduler and the power management frameworks even if we start with just
a few SoCs.
To reach an integrated solution that does better than the current
approach there is a range of things that need to be considered:
- Define a power-efficient scheduling policy. Depending on the power
gating support on the particular system packing tasks may improve
power-efficiency while spreading the tasks may be better for others.
- Define how the user policy switch works. In previous discussions it
was proposed to have a high level switch that allows specification of
what the system should strive to achieve - power saving or performance.
In those discussions, what power meant wasn't exactly defined.
- Find a generic way to represent the power topology which includes
power domains, voltage domains and frequency domains. Also, more
importantly how we can derive the optimal power/performance policy for
the specific platform. There may be dependencies between idle and
frequency states like it is the case for frequency boost mode like Arjan
mentions in his reply.
- The fact that not all platforms expose all idle states to the OS and
that closed firmware may do whatever it likes behind the scenes. There
are various reasons to do this. Not all of them are bad.
- Define a scheduler driven frequency scaling policy that at least
matches the 'performance' of the current cpufreq policies and has
potential for further improvements.
- Match the power savings of the current cpuidle governors which are
based on arcane heuristics developed over years to predict things like
the occurrence of the next interrupt.
- Thermal aspects add more complexity to the power/performance policy.
Depending on the platform, overheating may be handled by frequency
capping or restricting the number of active cpus.
- Asymmetric/heterogeneous multi-processors need to be dealt with.
This is not a complete list. My point is that moving all policy to the
scheduler will significantly increase the complexity of the scheduler.
It is my impression that the general opinion is that the scheduler is
already too complicated. Correct me if I'm wrong.
While the proposed task packing patches are not complete solutions, they
address the first item on the above list and can be seen as a step
towards the goal.
Should I read your recommendation as you prefer a complete and
potentially huge patch set over incremental patch sets?
It would be good to have even a high level agreement on the path forward
where the expectation first and foremost is to take advantage of the
schedulers ideal position to drive the power management while
simplifying the power management code.
Thanks,
Morten
On Fri, May 31, 2013 at 4:22 PM, Ingo Molnar <[email protected]> wrote:
>
> * Morten Rasmussen <[email protected]> wrote:
>
>> Hi,
>>
>> A number of patch sets related to power-efficient scheduling have been
>> posted over the last couple of months. Most of them do not have much
>> data to back them up, so I decided to do some testing.
>
> Thanks, numbers are always welcome!
>
>> Measurement technique:
>> Time spent non-idle (not in idle state) for each cpu based on cpuidle
>> ftrace events. TC2 does not have per-core power-gating, so packing
>> inside the A7 cluster does not lead to any significant power savings.
>> Note that any product grade hardware (TC2 is a test-chip) will very
>> likely have per-core power-gating, so in those cases packing will have
>> an appreciable effect on power savings.
>> Measuring non-idle time rather than power should give a more clear idea
>> about the effect of the patch sets given that the idle back-end is
>> highly implementation specific.
>
> Note that I still disagree with the whole design notion of having an "idle
> back-end" (and a 'cpufreq back end') separate from scheduler power saving
> policy, and none of the patch-sets offered so far solve this fundamental
> design problem.
I don't think you'll see any argument on this one.
> PeterZ and me tried to point out the design requirements previously, but
> it still does not appear to be clear enough to people, so let me spell it
> out again, in a hopefully clearer fashion.
It hasn't been spelled out in as many words before, so thank you!
> The scheduler has valuable power saving information available:
>
> - when a CPU is busy: about how long the current task expects to run
>
> - when a CPU is idle: how long the current CPU expects _not_ to run
>
> - topology: it knows how the CPUs and caches interrelate and already
> optimizes based on that
>
> - various high level and low level load averages and other metrics about
> the recent past that show how busy a particular CPU is, how busy the
> whole system is, and what the runtime properties of individual tasks is
> (how often it sleeps, etc.)
>
> so the scheduler is in an _ideal_ position to do a judgement call about
> the near future and estimate how deep an idle state a CPU core should
> enter into and what frequency it should run at.
>
> The scheduler is also at a high enough level to host a "I want maximum
> performance, power does not matter to me" user policy override switch and
> similar user policy details.
>
> No ifs and whens about that.
>
> Today the power saving landscape is fragmented and sad: we just randomly
> interface scheduler task packing changes with some idle policy (and
> cpufreq policy), which might or might not combine correctly.
>
> Even when the numbers improve, it's an entirely random, essentially
> unmaintainable property: because there's no clear split (possible) between
> 'scheduler policy' and 'idle policy'. This is why we removed the old,
> broken power saving scheduler code a year ago: to make room for something
> _better_.
>
> So if we want to add back scheduler power saving then what should happen
> is genuinely better code:
My understanding (and that of several of my colleagues) in discussions
with some of the folks on cc was that we wanted the following things
to happen in somewhat this order:
1. Replacement for task packing bits of sched_mc (Vincent's packing
small task patchset)
2. General scalability improvements and low-hanging fruit e.g. Thomas'
hotplug/kthread rework, un-pinned workqueues (queued for 3.11 by
Tejun), migrating running timers (RFC patches being discussed),
Adaptive NO_HZ, etc.
3. Scheduler-driven CPU states (DVFS and idle)
a. More CPU topology information in scheduler (to replace
related_cpus, affected_cpus, couple C-states and other such
constructs)
b. Intermediate step to replace cpufreq/cpuidle governors with a
'sched governor' that uses scheduler statistics instead of heuristics
in the governors today.
c. Thermal input into scheduling decisions
d. Co-existing sched-driven and legacy cpufreq/cpuidle policies
e. Switch over newer HW to default to sched-driven policy
Morten has already gone in great detail about some of the things we
need to address before the scheduler can drive power management.
What you've outlined in this email more or less reverses the order we
had in mind. And that is fine as long as we're all agreeing that it is
the way forward. More below.
> To create a new low level idle driver mechanism the scheduler could use
> and integrate proper power saving / idle policy into the scheduler.
>
> In that power saving framework the already existing scheduler topology
> information should be extended with deep idle parameters:
>
> - enumeration of idle states
>
> - how long it takes to enter+exit a particular idle state
>
> - [ perhaps information about how destructive to CPU caches that
> particular idle state is. ]
>
> - new driver entry point that allows the scheduler to enter any of the
> enumerated idle states. Platform code will not change this state, all
> policy decisions and the idle state is decided at the power saving
> policy level.
>
> All of this combines into a 'cost to enter and exit an idle state'
> estimation plus a way to enter idle states. It should be presented to the
> scheduler in a platform independent fashion, but without policy embedded:
> a low level platform driver interface in essence.
>
> Thomas Gleixner's recent work to generalize platform idle routines will
> further help the implementation of this. (that code is upstream already)
>
> _All_ policy, all metrics, all averaging should happen at the scheduler
> power saving level, in a single place, and then the scheduler should
> directly drive the new low level idle state driver mechanism.
>
> 'scheduler power saving' and 'idle policy' are one and the same principle
> and they should be handled in a single place to offer the best power
> saving results.
>
> Note that any RFC patch-set that offers an implementation for this could
> be structured in a gradual fashion: only implementing it for a limited CPU
> range initially. The new framework can then be extended to more and more
> CPUs and architectures, incorporating more complicated power saving
> features gradually. (The old, existing idle policy code would remain
> untouched and available - it would simply not be used when the new policy
> is activated.)
>
> I.e. I'm not asking for a 'rewrite the world' kind of impossible task -
> I'm providing an actionable path to get improved power saving upstream,
> but it has to use a _sane design_.
Someone will have to rewrite the world at some point. IMHO, you're
just asking for the schedule to be brought forward. :)
Doing steps 1. and 2. has brought us to an acceptable
power/performance threshold. Sure, we still have separate cpuidle,
cpufreq and thermal subsystems that sometimes fight each other, but it
is mostly a well-understood problem with known workarounds. Step 3
feels like good hygiene at this point, but one that we intend to help
with.
> This is a "line in the sand", a 'must have' design property for any
> scheduler power saving patches to be acceptable - and I'm NAK-ing
> incomplete approaches that don't solve the root design cause of our power
> saving troubles...
>From what I've read in your proposal, you want step 3. done first. Am
I correct in that assumption? I really want to nail down the
requirements and perhaps a sequence of steps that you might have in
mind.
Can we also expect more timely feedback/flames on this topic going forward?
Regards,
Amit
Hi,
On 05/31/2013 04:22 PM, Ingo Molnar wrote:
> PeterZ and me tried to point out the design requirements previously, but
> it still does not appear to be clear enough to people, so let me spell it
> out again, in a hopefully clearer fashion.
>
> The scheduler has valuable power saving information available:
>
> - when a CPU is busy: about how long the current task expects to run
>
> - when a CPU is idle: how long the current CPU expects _not_ to run
>
> - topology: it knows how the CPUs and caches interrelate and already
> optimizes based on that
>
> - various high level and low level load averages and other metrics about
> the recent past that show how busy a particular CPU is, how busy the
> whole system is, and what the runtime properties of individual tasks is
> (how often it sleeps, etc.)
>
> so the scheduler is in an _ideal_ position to do a judgement call about
> the near future and estimate how deep an idle state a CPU core should
> enter into and what frequency it should run at.
I don't think the problem lies in the fact that scheduler is not making
these decisions about which idle state the CPU should enter or which
frequency the CPU should run at.
IIUC, I think the problem lies in the part where although the
*cpuidle and cpufrequency governors are co-operating with the scheduler,
the scheduler is not doing the same.*
Let me elaborate with respect to cpuidle subsystem. When the scheduler
chooses the CPUs to run tasks on, it leaves certain other CPUs idle. The
cpuidle governor then evaluates, among other things, the load average of
the CPUs, before deciding to put it into an ideal idle state. With the
PJT's metric, an idle CPU's load average degrades over time and cpuidle
governor will perhaps decide to put such CPUs to deep idle states.
But the problem surfaces when scheduler gets to choose a CPU to run
new/woken up tasks on. It chooses the *idlest_cpu* to run the task on
without considering how deep an idle state that CPU is in,if at all it
is in an idle state. It would end up waking a deep sleeping CPU, which
will *hinder power savings*.
I think here is where we need to focus. Currently, there is no
*two way co-operation between the scheduler and cpuidle/cpufrequency*
subsystems, which makes no sense. In the above case for instance
scheduler prompts the cpuidle governor to put CPU to idle state and
comes back to hamper that move.
>
> The scheduler is also at a high enough level to host a "I want maximum
> performance, power does not matter to me" user policy override switch and
> similar user policy details.
>
> No ifs and whens about that.
>
> Today the power saving landscape is fragmented and sad: we just randomly
> interface scheduler task packing changes with some idle policy (and
> cpufreq policy), which might or might not combine correctly.
I would repeat here that today we interface cpuidle/cpufrequency
policies with scheduler but not the other way around. They do their bit
when a cpu is busy/idle. However scheduler does not see that somebody
else is taking instructions from it and comes back to give different
instructions!
Therefore I think among other things, this is one fundamental issue that
we need to resolve in the steps towards better power savings through
scheduler.
Regards
Preeti U Murthy
Hi Morten,
I have one point to make below.
On 06/04/2013 08:33 PM, Morten Rasmussen wrote:
>
> Thanks for sharing your view.
>
> I agree with idea of having a high level user switch to change
> power/performance policy trade-offs for the system. Not only for
> scheduling. I also share your view that the scheduler is in the ideal
> place to drive the frequency scaling and idle policies.
>
> However, I think that an integrated solution with one unified policy
> implemented in the scheduler would take a significant rewrite of the
> scheduler and the power management frameworks even if we start with just
> a few SoCs.
>
> To reach an integrated solution that does better than the current
> approach there is a range of things that need to be considered:
>
> - Define a power-efficient scheduling policy. Depending on the power
> gating support on the particular system packing tasks may improve
> power-efficiency while spreading the tasks may be better for others.
>
> - Define how the user policy switch works. In previous discussions it
> was proposed to have a high level switch that allows specification of
> what the system should strive to achieve - power saving or performance.
> In those discussions, what power meant wasn't exactly defined.
>
> - Find a generic way to represent the power topology which includes
> power domains, voltage domains and frequency domains. Also, more
> importantly how we can derive the optimal power/performance policy for
> the specific platform. There may be dependencies between idle and
> frequency states like it is the case for frequency boost mode like Arjan
> mentions in his reply.
>
> - The fact that not all platforms expose all idle states to the OS and
> that closed firmware may do whatever it likes behind the scenes. There
> are various reasons to do this. Not all of them are bad.
>
> - Define a scheduler driven frequency scaling policy that at least
> matches the 'performance' of the current cpufreq policies and has
> potential for further improvements.
>
> - Match the power savings of the current cpuidle governors which are
> based on arcane heuristics developed over years to predict things like
> the occurrence of the next interrupt.
>
> - Thermal aspects add more complexity to the power/performance policy.
> Depending on the platform, overheating may be handled by frequency
> capping or restricting the number of active cpus.
>
> - Asymmetric/heterogeneous multi-processors need to be dealt with.
>
> This is not a complete list. My point is that moving all policy to the
> scheduler will significantly increase the complexity of the scheduler.
> It is my impression that the general opinion is that the scheduler is
> already too complicated. Correct me if I'm wrong.
I don't think this is the idea. As you have rightly pointed out above,
the current cpuidle and cpufrequency governors are based on heuristics
that have been developed over years. So in my opinion, we must not
strive at duplicating this effort in the scheduler, rather we must
strive at improving the co-operation between scheduler and these governors.
As I have mentioned in the reply to Ingo's mail, we do not have a two
way co-operation between cpuidle/cpufrequency subsystems and scheduler.
When the scheduler decides not to schedule tasks on certain CPUs for a
long time the cpuidle governor for instance, puts them into deep idle
state since it looks at load average of CPUs, among other things before
doing this.
So here we notice that cpuidle is *listening* to scheduler decisions.
However when the scheduler decides to schedule newer/woken up tasks, it
looks for the *idlest* cpu to run them on, without considering which
idle state that CPU is in. The result is waking up a deep idle state
CPU, rather than a shallow one, thus hindering power savings. IOW, the
scheduler is *not listening* to the decisions taken by the cpuidle governor.
If we observe the basis and the principle of scheduling today, the
scheduler makes its decisions based on the scheduling domain hierarchy
and more importantly the *load* on the CPUs. It does not consider other
aspects like idleness/frequency/ thermal aspects among the things that
you and Ingo have pointed out. I think here is where we need to step in.
We need scheduler to be *well aware* of its ecosystem,
*not necessarily decide this ecosystem*.
As Amit Kucheria has pointed out, currently without this two way
co-operation, we might see scheduler fighting with these subsystems.
We could as one of the steps to power savings in scheduler, try and
eliminate that.
> While the proposed task packing patches are not complete solutions, they
> address the first item on the above list and can be seen as a step
> towards the goal.
>
> Should I read your recommendation as you prefer a complete and
> potentially huge patch set over incremental patch sets?
>
> It would be good to have even a high level agreement on the path forward
> where the expectation first and foremost is to take advantage of the
> schedulers ideal position to drive the power management while
> simplifying the power management code.
>
> Thanks,
> Morten
>
Regards
Preeti U Murthy
Hi Preeti,
On 7 June 2013 07:03, Preeti U Murthy <[email protected]> wrote:
> On 05/31/2013 04:22 PM, Ingo Molnar wrote:
>> PeterZ and me tried to point out the design requirements previously, but
>> it still does not appear to be clear enough to people, so let me spell it
>> out again, in a hopefully clearer fashion.
>>
>> The scheduler has valuable power saving information available:
>>
>> - when a CPU is busy: about how long the current task expects to run
>>
>> - when a CPU is idle: how long the current CPU expects _not_ to run
>>
>> - topology: it knows how the CPUs and caches interrelate and already
>> optimizes based on that
>>
>> - various high level and low level load averages and other metrics about
>> the recent past that show how busy a particular CPU is, how busy the
>> whole system is, and what the runtime properties of individual tasks is
>> (how often it sleeps, etc.)
>>
>> so the scheduler is in an _ideal_ position to do a judgement call about
>> the near future and estimate how deep an idle state a CPU core should
>> enter into and what frequency it should run at.
>
> I don't think the problem lies in the fact that scheduler is not making
> these decisions about which idle state the CPU should enter or which
> frequency the CPU should run at.
>
> IIUC, I think the problem lies in the part where although the
> *cpuidle and cpufrequency governors are co-operating with the scheduler,
> the scheduler is not doing the same.*
I think you are missing Ingo's point. It's not about the scheduler
complying with decisions made by various governors in the kernel
(which may or may not have enough information) but rather the
scheduler being in a better position for making such decisions.
Take the cpuidle example, it uses the load average of the CPUs,
however this load average is currently controlled by the scheduler
(load balance). Rather than using a load average that degrades over
time and gradually putting the CPU into deeper sleep states, the
scheduler could predict more accurately that a run-queue won't have
any work over the next x ms and ask for a deeper sleep state from the
beginning.
Of course, you could export more scheduler information to cpuidle,
various hooks (task wakeup etc.) but then we have another framework,
cpufreq. It also decides the CPU parameters (frequency) based on the
load controlled by the scheduler. Can cpufreq decide whether it's
better to keep the CPU at higher frequency so that it gets to idle
quicker and therefore deeper sleep states? I don't think it has enough
information because there are at least three deciding factors
(cpufreq, cpuidle and scheduler's load balancing) which are not
unified.
Some tasks could be known to the scheduler to require significant CPU
cycles when waken up. The scheduler can make the decision to either
boost the frequency of the non-idle CPU and place the task there or
simply wake up the idle CPU. There are all sorts of power implications
here like whether it's better to keep two CPUs at half speed or one at
full speed and the other idle. Such parameters could be provided by
per-platform hooks.
> I would repeat here that today we interface cpuidle/cpufrequency
> policies with scheduler but not the other way around. They do their bit
> when a cpu is busy/idle. However scheduler does not see that somebody
> else is taking instructions from it and comes back to give different
> instructions!
The key here is that cpuidle/cpufreq make their primary decision based
on something controlled by the scheduler: the CPU load (via run-queue
balancing). You would then like the scheduler take such decision back
into account. It just looks like a closed loop, possibly 'unstable' .
So I think we either (a) come up with 'clearer' separation of
responsibilities between scheduler and cpufreq/cpuidle or (b) come up
with a unified load-balancing/cpufreq/cpuidle implementation as per
Ingo's request. The latter is harder but, with a good design, has
potentially a lot more benefits.
A possible implementation for (a) is to let the scheduler focus on
performance load-balancing but control the balance ratio from a
cpufreq governor (via things like arch_scale_freq_power() or something
new). CPUfreq would not be concerned just with individual CPU
load/frequency but also making a decision on how tasks are balanced
between CPUs based on the overall load (e.g. four CPUs are enough for
the current load, I can shut the other four off by telling the
scheduler not to use them).
As for Ingo's preferred solution (b), a proposal forward could be to
factor the load balancing out of kernel/sched/fair.c and provide an
abstract interface (like load_class?) for easier extending or
different policies (e.g. small task packing). You may for example
implement a power saving load policy where idle_balance() does not
pull tasks from other CPUs but rather invoke cpuidle with a prediction
about how long it's going to be idle for. A load class could also give
hints to the cpufreq about the actual load needed using normalised
values and the cpufreq driver could set the best frequency to match
such load. Another hook for task wake-up could place it on the
appropriate run-queue (either for power or performance). And so on.
I don't say the above is the right solution, just a proposal. I think
an initial prototype for Ingo's approach could make a good topic for
the KS.
Best regards.
--
Catalin
On 6/6/2013 11:03 PM, Preeti U Murthy wrote:
> Hi,
>
> On 05/31/2013 04:22 PM, Ingo Molnar wrote:
>> PeterZ and me tried to point out the design requirements previously, but
>> it still does not appear to be clear enough to people, so let me spell it
>> out again, in a hopefully clearer fashion.
>>
>> The scheduler has valuable power saving information available:
>>
>> - when a CPU is busy: about how long the current task expects to run
>>
>> - when a CPU is idle: how long the current CPU expects _not_ to run
>>
>> - topology: it knows how the CPUs and caches interrelate and already
>> optimizes based on that
and I will argue we do too much of this already; various caches (and tlbs) get flushed
(on x86 at least) much much more than you'd think.
>>
>> so the scheduler is in an _ideal_ position to do a judgement call about
>> the near future
this part I will buy
>> and estimate how deep an idle state a CPU core should
>> enter into and what frequency it should run at.
this part I cannot buy.
First of all, we really need to stop thinking about choosing frequency (at least for x86).
that concept basically died for x86 6 years ago.
Second, the interactions between these two, and the "what does it mean if I chose something"
is highly hardware specific and complex nowadays, and going forward is going to be increasingly so.
If anything, we've been moving AWAY from centralized infrastructure there, going towards
CPU specific drivers/policies. And hardware rules are very different between platforms here.
On Intel, asking for different performance is just an MSR write, and going idle is usually just one instruction.
On some ARM, this might involve a long complex interaction calculations, or even *blocking* operation manipulating VRs and PLLs directly... depending
on the platform and the states you want to pick. (Hence the CPUFREQ design of requiring changes to be
done in a kernel thread)
Now, I would like the scheduler to give some notifications at certain events (like migrations,
starting realtime tasks)...but a few atomic notifier chains will do for that.
The policies will be very hardware specific, and thus will live outside the scheduler, no matter which way you
put it. Now, the scheduler can and should participate more in terms of sharing information in both directions...
that I think we can all agree on.
Hi Catalin,
On 06/07/2013 08:21 PM, Catalin Marinas wrote:
> I think you are missing Ingo's point. It's not about the scheduler
> complying with decisions made by various governors in the kernel
> (which may or may not have enough information) but rather the
> scheduler being in a better position for making such decisions.
My mail pointed out that I disagree with this design ("the scheduler
being in a better position for making such decisions").
I think it should be a 2 way co-operation. I have elaborated below.
> Take the cpuidle example, it uses the load average of the CPUs,
> however this load average is currently controlled by the scheduler
> (load balance). Rather than using a load average that degrades over
> time and gradually putting the CPU into deeper sleep states, the
> scheduler could predict more accurately that a run-queue won't have
> any work over the next x ms and ask for a deeper sleep state from the
> beginning.
How will the scheduler know that there will not be work in the near
future? How will the scheduler ask for a deeper sleep state?
My answer to the above two questions are, the scheduler cannot know how
much work will come up. All it knows is the current load of the
runqueues and the nature of the task (thanks to the PJT's metric). It
can then match the task load to the cpu capacity and schedule the tasks
on the appropriate cpus.
As a consequence, it leaves certain cpus idle. The load of these cpus
degrade. It is via this load that the scheduler asks for a deeper sleep
state. Right here we have scheduler talking to the cpuidle governor.
I don't see what the problem is with the cpuidle governor waiting for
the load to degrade before putting that cpu to sleep. In my opinion,
putting a cpu to deeper sleep states should happen gradually. This means
time will tell the governors what kinds of workloads are running on the
system. If the cpu is idle for long, it probably means that the system
is less loaded and it makes sense to put the cpus to deeper sleep
states. Of course there could be sporadic bursts or quieting down of
tasks, but these are corner cases.
>
> Of course, you could export more scheduler information to cpuidle,
> various hooks (task wakeup etc.) but then we have another framework,
> cpufreq. It also decides the CPU parameters (frequency) based on the
> load controlled by the scheduler. Can cpufreq decide whether it's
> better to keep the CPU at higher frequency so that it gets to idle
> quicker and therefore deeper sleep states? I don't think it has enough
> information because there are at least three deciding factors
> (cpufreq, cpuidle and scheduler's load balancing) which are not
> unified.
Why not? When the cpu load is high, cpu frequency governor knows it has
to boost the frequency of that CPU. The task gets over quickly, the CPU
goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
sleep state gradually.
Meanwhile the scheduler should ensure that the tasks are retained on
that CPU,whose frequency is boosted and should not load balance it, so
that they can get over quickly. This I think is what is missing. Again
this comes down to the scheduler taking feedback from the CPU frequency
governors which is not currently happening.
>
> Some tasks could be known to the scheduler to require significant CPU
> cycles when waken up. The scheduler can make the decision to either
> boost the frequency of the non-idle CPU and place the task there or
> simply wake up the idle CPU. There are all sorts of power implications
> here like whether it's better to keep two CPUs at half speed or one at
> full speed and the other idle. Such parameters could be provided by
> per-platform hooks.
This is why the cpuidle and cpufrequency drivers are for. They are meant
to collect such parameters. It is just that the scheduler should be made
aware of them.
>
>> I would repeat here that today we interface cpuidle/cpufrequency
>> policies with scheduler but not the other way around. They do their bit
>> when a cpu is busy/idle. However scheduler does not see that somebody
>> else is taking instructions from it and comes back to give different
>> instructions!
>
> The key here is that cpuidle/cpufreq make their primary decision based
> on something controlled by the scheduler: the CPU load (via run-queue
> balancing). You would then like the scheduler take such decision back
> into account. It just looks like a closed loop, possibly 'unstable' .
Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a
closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a
closed loop? Here too the scheduler should be made well aware of the
decisions it took in the past right?
>
> So I think we either (a) come up with 'clearer' separation of
> responsibilities between scheduler and cpufreq/cpuidle
I agree with this. This is what I have been emphasizing, if we feel that
the cpufrequency/ cpuidle subsystems are suboptimal in terms of the
information that they use to make their decisions, let us improve them.
But this will not yield us any improvement if the scheduler does not
have enough information. And IMHO, the next fundamental information that
the scheduler needs should come from cpufreq and cpuidle.
Then we should move onto supplying scheduler information from the power
domain topology, thermal factors, user policies. This does not need a
re-write of the scheduler, this would need a good interface between the
scheduler and the rest of the ecosystem. This ecosystem includes the
cpuidle subsystem, cpu frequency subsystems and they are already in
place. Lets use them.
or (b) come up
> with a unified load-balancing/cpufreq/cpuidle implementation as per
> Ingo's request. The latter is harder but, with a good design, has
> potentially a lot more benefits.
>
> A possible implementation for (a) is to let the scheduler focus on
> performance load-balancing but control the balance ratio from a
> cpufreq governor (via things like arch_scale_freq_power() or something
> new). CPUfreq would not be concerned just with individual CPU
> load/frequency but also making a decision on how tasks are balanced
> between CPUs based on the overall load (e.g. four CPUs are enough for
> the current load, I can shut the other four off by telling the
> scheduler not to use them).
>
> As for Ingo's preferred solution (b), a proposal forward could be to
> factor the load balancing out of kernel/sched/fair.c and provide an
> abstract interface (like load_class?) for easier extending or
> different policies (e.g. small task packing).
Let me elaborate on the patches that have been posted so far on the
power awareness of the scheduler. When we say *power aware scheduler*
what exactly do we want it to do?
In my opinion, we want it to *avoid touching idle cpus*, so as to keep
them in that state longer and *keep more power domains idle*, so as to
yield power savings with them turned off. The patches released so far
are striving to do the latter. Correct me if I am wrong at this. Also
feel free to point out any other expectation from the power aware
scheduler if I am missing any.
If I have got Ingo's point right, the issues with them are that they are
not taking a holistic approach to meet the said goal. Keeping more power
domains idle (by packing tasks) would sound much better if the scheduler
has taken all aspects of doing such a thing into account, like
1. How idle are the cpus, on the domain that it is packing
2. Can they go to turbo mode, because if they do,then we cant pack
tasks. We would need certain cpus in that domain idle.
3. Are the domains in which we pack tasks power gated?
4. Will there be significant performance drop by packing? Meaning do the
tasks share cpu resources? If they do there will be severe contention.
The approach I suggest therefore would be to get the scheduler well in
sync with the eco system, then the patches posted so far will achieve
their goals more easily and with very few regressions because they are
well informed decisions.
Regards
Preeti U Murthy
> Best regards.
>
> --
> Catalin
>
On Fri, 7 Jun 2013, Preeti U Murthy wrote:
> Hi Catalin,
>
> On 06/07/2013 08:21 PM, Catalin Marinas wrote:
<SNIP>
>> Take the cpuidle example, it uses the load average of the CPUs,
>> however this load average is currently controlled by the scheduler
>> (load balance). Rather than using a load average that degrades over
>> time and gradually putting the CPU into deeper sleep states, the
>> scheduler could predict more accurately that a run-queue won't have
>> any work over the next x ms and ask for a deeper sleep state from the
>> beginning.
>
> How will the scheduler know that there will not be work in the near
> future? How will the scheduler ask for a deeper sleep state?
>
> My answer to the above two questions are, the scheduler cannot know how
> much work will come up. All it knows is the current load of the
> runqueues and the nature of the task (thanks to the PJT's metric). It
> can then match the task load to the cpu capacity and schedule the tasks
> on the appropriate cpus.
how will the cpuidle govenor know what will come up in the future?
the scheduler knows more than the current load on the runqueus, it tracks some
information about the past behavior of the process that it uses for it's
decisions. This is information that cpuidle doesn't have.
<SNIP>
> I don't see what the problem is with the cpuidle governor waiting for
> the load to degrade before putting that cpu to sleep. In my opinion,
> putting a cpu to deeper sleep states should happen gradually.
remember that it takes power and time to wake up a cpu to put it in a deeper
sleep state.
>> Of course, you could export more scheduler information to cpuidle,
>> various hooks (task wakeup etc.) but then we have another framework,
>> cpufreq. It also decides the CPU parameters (frequency) based on the
>> load controlled by the scheduler. Can cpufreq decide whether it's
>> better to keep the CPU at higher frequency so that it gets to idle
>> quicker and therefore deeper sleep states? I don't think it has enough
>> information because there are at least three deciding factors
>> (cpufreq, cpuidle and scheduler's load balancing) which are not
>> unified.
>
> Why not? When the cpu load is high, cpu frequency governor knows it has
> to boost the frequency of that CPU. The task gets over quickly, the CPU
> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
> sleep state gradually.
>
> Meanwhile the scheduler should ensure that the tasks are retained on
> that CPU,whose frequency is boosted and should not load balance it, so
> that they can get over quickly. This I think is what is missing. Again
> this comes down to the scheduler taking feedback from the CPU frequency
> governors which is not currently happening.
how should the scheduler know that the cpufreq governor decided to boost the
speed of one CPU to handle an important process as opposed to handling multiple
smaller processes?
the communication between the two is starting to sound really messy
David Lang
On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
> On 06/07/2013 08:21 PM, Catalin Marinas wrote:
> > I think you are missing Ingo's point. It's not about the scheduler
> > complying with decisions made by various governors in the kernel
> > (which may or may not have enough information) but rather the
> > scheduler being in a better position for making such decisions.
>
> My mail pointed out that I disagree with this design ("the scheduler
> being in a better position for making such decisions").
> I think it should be a 2 way co-operation. I have elaborated below.
>
> > Take the cpuidle example, it uses the load average of the CPUs,
> > however this load average is currently controlled by the scheduler
> > (load balance). Rather than using a load average that degrades over
> > time and gradually putting the CPU into deeper sleep states, the
> > scheduler could predict more accurately that a run-queue won't have
> > any work over the next x ms and ask for a deeper sleep state from the
> > beginning.
>
> How will the scheduler know that there will not be work in the near
> future? How will the scheduler ask for a deeper sleep state?
>
> My answer to the above two questions are, the scheduler cannot know how
> much work will come up. All it knows is the current load of the
> runqueues and the nature of the task (thanks to the PJT's metric). It
> can then match the task load to the cpu capacity and schedule the tasks
> on the appropriate cpus.
The scheduler can decide to load a single CPU or cluster and let the
others idle. If the total CPU load can fit into a smaller number of CPUs
it could as well tell cpuidle to go into deeper state from the
beginning as it moved all the tasks elsewhere.
Regarding future work, neither cpuidle nor the scheduler know this but
the scheduler would make a better prediction, for example by tracking
task periodicity.
> As a consequence, it leaves certain cpus idle. The load of these cpus
> degrade. It is via this load that the scheduler asks for a deeper sleep
> state. Right here we have scheduler talking to the cpuidle governor.
So we agree that the scheduler _tells_ the cpuidle governor when to go
idle (but not how deep). IOW, the scheduler drives the cpuidle
decisions. Two problems: (1) the cpuidle does not get enough information
from the scheduler (arguably this could be fixed) and (2) the scheduler
does not have any information about the idle states (power gating etc.)
to make any informed decision on which/when CPUs should go idle.
As you said, it is a non-optimal one-way communication but the solution
is not feedback loop from cpuidle into scheduler. It's like the
scheduler managed by chance to get the CPU into a deeper sleep state and
now you'd like the scheduler to get feedback form cpuidle and not
disturb that CPU anymore. That's the closed loop I disagree with. Could
the scheduler not make this informed decision before - it has this total
load, let's get this CPU into deeper sleep state?
> I don't see what the problem is with the cpuidle governor waiting for
> the load to degrade before putting that cpu to sleep. In my opinion,
> putting a cpu to deeper sleep states should happen gradually. This means
> time will tell the governors what kinds of workloads are running on the
> system. If the cpu is idle for long, it probably means that the system
> is less loaded and it makes sense to put the cpus to deeper sleep
> states. Of course there could be sporadic bursts or quieting down of
> tasks, but these are corner cases.
It's nothing wrong with degrading given the information that cpuidle
currently has. It's a heuristics that worked ok so far and may continue
to do so. But see my comments above on why the scheduler could make more
informed decisions.
We may not move all the power gating information to the scheduler but
maybe find a way to abstract this by giving more hints via the CPU and
cache topology. The cpuidle framework (it may not be much left of a
governor) would then take hints about estimated idle time and invoke the
low-level driver about the right C state.
> > Of course, you could export more scheduler information to cpuidle,
> > various hooks (task wakeup etc.) but then we have another framework,
> > cpufreq. It also decides the CPU parameters (frequency) based on the
> > load controlled by the scheduler. Can cpufreq decide whether it's
> > better to keep the CPU at higher frequency so that it gets to idle
> > quicker and therefore deeper sleep states? I don't think it has enough
> > information because there are at least three deciding factors
> > (cpufreq, cpuidle and scheduler's load balancing) which are not
> > unified.
>
> Why not? When the cpu load is high, cpu frequency governor knows it has
> to boost the frequency of that CPU. The task gets over quickly, the CPU
> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
> sleep state gradually.
The cpufreq governor boosts the frequency enough to cover the load,
which means reducing the idle time. It does not know whether it is
better to boost the frequency twice as high so that it gets to idle
quicker. You can change the governor's policy but does it have any
information from cpuidle?
> Meanwhile the scheduler should ensure that the tasks are retained on
> that CPU,whose frequency is boosted and should not load balance it, so
> that they can get over quickly. This I think is what is missing. Again
> this comes down to the scheduler taking feedback from the CPU frequency
> governors which is not currently happening.
Same loop again. The cpu load goes high because (a) there is more work,
possibly triggered by external events, and (b) the scheduler decided to
balance the CPUs in a certain way. As for cpuidle above, the scheduler
has direct influence on the cpufreq decisions. How would the scheduler
know which CPU not to balance against? Are CPUs in a cluster
synchronous? Is it better do let other CPU idle or more efficient to run
this cluster at half-speed?
Let's say there is an increase in the load, does the scheduler wait
until cpufreq figures this out or tries to take the other CPUs out of
idle? Who's making this decision? That's currently a potentially
unstable loop.
> >> I would repeat here that today we interface cpuidle/cpufrequency
> >> policies with scheduler but not the other way around. They do their bit
> >> when a cpu is busy/idle. However scheduler does not see that somebody
> >> else is taking instructions from it and comes back to give different
> >> instructions!
> >
> > The key here is that cpuidle/cpufreq make their primary decision based
> > on something controlled by the scheduler: the CPU load (via run-queue
> > balancing). You would then like the scheduler take such decision back
> > into account. It just looks like a closed loop, possibly 'unstable' .
>
> Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a
> closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a
> closed loop? Here too the scheduler should be made well aware of the
> decisions it took in the past right?
It's more like:
scheduler -> cpuidle/cpufreq -> hardware operating point
^ |
+--------------------------------------+
You can argue that you can make an adaptive loop that works fine but
there are so many parameters that I don't see how it would work. The
patches so far don't seem to address this. Small task packing, while
useful, it's some heuristics just at the scheduler level.
With a combined decision maker, you aim to reduce this separate decision
process and feedback loop. Probably impossible to eliminate the loop
completely because of hardware latencies, PLLs, CPU frequency not always
the main factor, but you can make the loop more tolerant to
instabilities.
> > So I think we either (a) come up with 'clearer' separation of
> > responsibilities between scheduler and cpufreq/cpuidle
>
> I agree with this. This is what I have been emphasizing, if we feel that
> the cpufrequency/ cpuidle subsystems are suboptimal in terms of the
> information that they use to make their decisions, let us improve them.
> But this will not yield us any improvement if the scheduler does not
> have enough information. And IMHO, the next fundamental information that
> the scheduler needs should come from cpufreq and cpuidle.
What kind of information? Your suggestion that the scheduler should
avoid loading a CPU because it went idle is wrong IMHO. It went idle
because the scheduler decided this in first instance.
> Then we should move onto supplying scheduler information from the power
> domain topology, thermal factors, user policies.
I agree with this but at this point you get the scheduler to make more
informed decisions about task placement. It can then give more precise
hints to cpufreq/cpuidle like the predicted load and those frameworks
could become dumber in time, just complying with the requested
performance level (trying to break the loop above).
> > or (b) come up
> > with a unified load-balancing/cpufreq/cpuidle implementation as per
> > Ingo's request. The latter is harder but, with a good design, has
> > potentially a lot more benefits.
> >
> > A possible implementation for (a) is to let the scheduler focus on
> > performance load-balancing but control the balance ratio from a
> > cpufreq governor (via things like arch_scale_freq_power() or something
> > new). CPUfreq would not be concerned just with individual CPU
> > load/frequency but also making a decision on how tasks are balanced
> > between CPUs based on the overall load (e.g. four CPUs are enough for
> > the current load, I can shut the other four off by telling the
> > scheduler not to use them).
> >
> > As for Ingo's preferred solution (b), a proposal forward could be to
> > factor the load balancing out of kernel/sched/fair.c and provide an
> > abstract interface (like load_class?) for easier extending or
> > different policies (e.g. small task packing).
>
> Let me elaborate on the patches that have been posted so far on the
> power awareness of the scheduler. When we say *power aware scheduler*
> what exactly do we want it to do?
>
> In my opinion, we want it to *avoid touching idle cpus*, so as to keep
> them in that state longer and *keep more power domains idle*, so as to
> yield power savings with them turned off. The patches released so far
> are striving to do the latter. Correct me if I am wrong at this.
Don't take me wrong, task packing to keep more power domains idle is
probably in the right direction but it may not address all issues. You
realised this is not enough since you are now asking for the scheduler
to take feedback from cpuidle. As I pointed out above, you try to create
a loop which may or may not work, especially given the wide variety of
hardware parameters.
> Also
> feel free to point out any other expectation from the power aware
> scheduler if I am missing any.
If the patches so far are enough and solved all the problems, you are
not missing any. Otherwise, please see my view above.
Please define clearly what the scheduler, cpufreq, cpuidle should be
doing and what communication should happen between them.
> If I have got Ingo's point right, the issues with them are that they are
> not taking a holistic approach to meet the said goal.
Probably because scheduler changes, cpufreq and cpuidle are all trying
to address the same thing but independent of each other and possibly
conflicting.
> Keeping more power
> domains idle (by packing tasks) would sound much better if the scheduler
> has taken all aspects of doing such a thing into account, like
>
> 1. How idle are the cpus, on the domain that it is packing
> 2. Can they go to turbo mode, because if they do,then we cant pack
> tasks. We would need certain cpus in that domain idle.
> 3. Are the domains in which we pack tasks power gated?
> 4. Will there be significant performance drop by packing? Meaning do the
> tasks share cpu resources? If they do there will be severe contention.
So by this you add a lot more information about the power configuration
into the scheduler, getting it to make more informed decisions about
task scheduling. You may eventually reach a point where cpuidle governor
doesn't have much to do (which may be a good thing) and reach Ingo's
goal.
That's why I suggested maybe starting to take the load balancing out of
fair.c and make it easily extensible (my opinion, the scheduler guys may
disagree). Then make it more aware of topology, power configuration so
that it makes the right task placement decision. You then get it to
tell cpufreq about the expected performance requirements (frequency
decided by cpufreq) and cpuidle about how long it could be idle for (you
detect a periodic task every 1ms, or you don't have any at all because
they were migrated, the right C state being decided by the governor).
Regards.
--
Catalin
On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
> > On 06/07/2013 08:21 PM, Catalin Marinas wrote:
> > > I think you are missing Ingo's point. It's not about the scheduler
> > > complying with decisions made by various governors in the kernel
> > > (which may or may not have enough information) but rather the
> > > scheduler being in a better position for making such decisions.
> >
> > My mail pointed out that I disagree with this design ("the scheduler
> > being in a better position for making such decisions").
> > I think it should be a 2 way co-operation. I have elaborated below.
I agree with that.
> > > Take the cpuidle example, it uses the load average of the CPUs,
> > > however this load average is currently controlled by the scheduler
> > > (load balance). Rather than using a load average that degrades over
> > > time and gradually putting the CPU into deeper sleep states, the
> > > scheduler could predict more accurately that a run-queue won't have
> > > any work over the next x ms and ask for a deeper sleep state from the
> > > beginning.
> >
> > How will the scheduler know that there will not be work in the near
> > future? How will the scheduler ask for a deeper sleep state?
> >
> > My answer to the above two questions are, the scheduler cannot know how
> > much work will come up. All it knows is the current load of the
> > runqueues and the nature of the task (thanks to the PJT's metric). It
> > can then match the task load to the cpu capacity and schedule the tasks
> > on the appropriate cpus.
>
> The scheduler can decide to load a single CPU or cluster and let the
> others idle. If the total CPU load can fit into a smaller number of CPUs
> it could as well tell cpuidle to go into deeper state from the
> beginning as it moved all the tasks elsewhere.
So why can't it do that today? What's the problem?
> Regarding future work, neither cpuidle nor the scheduler know this but
> the scheduler would make a better prediction, for example by tracking
> task periodicity.
Well, basically, two pieces of information are needed to make target idle
state selections: (1) when the CPU (core or package) is going to be used
next time and (2) how much latency for going back to the non-idle state
can be tolerated. While the scheduler knows (1) to some extent (arguably,
it generally cannot predict when hardware interrupts are going to occur),
I'm not really sure about (2).
> > As a consequence, it leaves certain cpus idle. The load of these cpus
> > degrade. It is via this load that the scheduler asks for a deeper sleep
> > state. Right here we have scheduler talking to the cpuidle governor.
>
> So we agree that the scheduler _tells_ the cpuidle governor when to go
> idle (but not how deep).
It does indicate to cpuidle how deep it can go, however, by providing it with
the information about when the CPU is going to be used next time (from the
scheduler's perspective).
> IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the
> cpuidle does not get enough information from the scheduler (arguably this
> could be fixed)
OK, so what information is missing in your opinion?
> and (2) the scheduler does not have any information about the idle states
> (power gating etc.) to make any informed decision on which/when CPUs should
> go idle.
That's correct, which is a drawback. However, on some systems it may never
have that information (because hardware coordinates idle states in a way that
is opaque to the OS - e.g. by autopromoting deeper states when idle for
sufficiently long time) and on some systems that information may change over
time (i.e. the availablility of specific idle states may depend on factors
that aren't constant).
If you attempted to take all of the possible complications related to hardware
designs in that area in the scheduler, you'd end up with completely
unmaintainable piece of code.
> As you said, it is a non-optimal one-way communication but the solution
> is not feedback loop from cpuidle into scheduler. It's like the
> scheduler managed by chance to get the CPU into a deeper sleep state and
> now you'd like the scheduler to get feedback form cpuidle and not
> disturb that CPU anymore. That's the closed loop I disagree with. Could
> the scheduler not make this informed decision before - it has this total
> load, let's get this CPU into deeper sleep state?
No, it couldn't in general, for the above reasons.
> > I don't see what the problem is with the cpuidle governor waiting for
> > the load to degrade before putting that cpu to sleep. In my opinion,
> > putting a cpu to deeper sleep states should happen gradually.
If we know in advance that the CPU can be put into idle state Cn, there is no
reason to put it into anything shallower than that.
On the other hand, if the CPU is in Cn already and there is a possibility to
put it into a deeper low-power state (which we didn't know about before), it
may make sense to promote it into that state (if that's safe) or even wake it
up and idle it again.
> > This means time will tell the governors what kinds of workloads are running
> > on the system. If the cpu is idle for long, it probably means that the system
> > is less loaded and it makes sense to put the cpus to deeper sleep
> > states. Of course there could be sporadic bursts or quieting down of
> > tasks, but these are corner cases.
>
> It's nothing wrong with degrading given the information that cpuidle
> currently has. It's a heuristics that worked ok so far and may continue
> to do so. But see my comments above on why the scheduler could make more
> informed decisions.
>
> We may not move all the power gating information to the scheduler but
> maybe find a way to abstract this by giving more hints via the CPU and
> cache topology. The cpuidle framework (it may not be much left of a
> governor) would then take hints about estimated idle time and invoke the
> low-level driver about the right C state.
Overall, it looks like it'd be better to split the governor "layer" between the
scheduler and the idle driver with a well defined interface between them. That
interface needs to be general enough to be independent of the underlying
hardware.
We need to determine what kinds of information should be passed both ways and
how to represent it.
> > > Of course, you could export more scheduler information to cpuidle,
> > > various hooks (task wakeup etc.) but then we have another framework,
> > > cpufreq. It also decides the CPU parameters (frequency) based on the
> > > load controlled by the scheduler. Can cpufreq decide whether it's
> > > better to keep the CPU at higher frequency so that it gets to idle
> > > quicker and therefore deeper sleep states? I don't think it has enough
> > > information because there are at least three deciding factors
> > > (cpufreq, cpuidle and scheduler's load balancing) which are not
> > > unified.
> >
> > Why not? When the cpu load is high, cpu frequency governor knows it has
> > to boost the frequency of that CPU. The task gets over quickly, the CPU
> > goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
> > sleep state gradually.
>
> The cpufreq governor boosts the frequency enough to cover the load,
> which means reducing the idle time. It does not know whether it is
> better to boost the frequency twice as high so that it gets to idle
> quicker. You can change the governor's policy but does it have any
> information from cpuidle?
Well, it may get that information directly from the hardware. Actually,
intel_pstate does that, but intel_pstate is the governor and the scaling
driver combined.
> > Meanwhile the scheduler should ensure that the tasks are retained on
> > that CPU,whose frequency is boosted and should not load balance it, so
> > that they can get over quickly. This I think is what is missing. Again
> > this comes down to the scheduler taking feedback from the CPU frequency
> > governors which is not currently happening.
>
> Same loop again. The cpu load goes high because (a) there is more work,
> possibly triggered by external events, and (b) the scheduler decided to
> balance the CPUs in a certain way. As for cpuidle above, the scheduler
> has direct influence on the cpufreq decisions. How would the scheduler
> know which CPU not to balance against? Are CPUs in a cluster
> synchronous? Is it better do let other CPU idle or more efficient to run
> this cluster at half-speed?
>
> Let's say there is an increase in the load, does the scheduler wait
> until cpufreq figures this out or tries to take the other CPUs out of
> idle? Who's making this decision? That's currently a potentially
> unstable loop.
Yes, it is and I don't think we currently have good answers here.
The results of many measurements seem to indicate that it generally is better
to do the work as quickly as possible and then go idle again, but there are
costs associated with going back and forth from idle to non-idle etc.
The main problem with cpufreq that I personally have is that the governors
carry out their own sampling with pretty much arbitrary resolution that may
lead to suboptimal decisions. It would be much better if the scheduler
indicated when to *consider* the changing of CPU performance parameters (that
may not be frequency alone and not even frequency at all in general), more or
less the same way it tells cpuidle about idle CPUs, but I'm not sure if it
should decide what performance points to run at.
> > >> I would repeat here that today we interface cpuidle/cpufrequency
> > >> policies with scheduler but not the other way around. They do their bit
> > >> when a cpu is busy/idle. However scheduler does not see that somebody
> > >> else is taking instructions from it and comes back to give different
> > >> instructions!
> > >
> > > The key here is that cpuidle/cpufreq make their primary decision based
> > > on something controlled by the scheduler: the CPU load (via run-queue
> > > balancing). You would then like the scheduler take such decision back
> > > into account. It just looks like a closed loop, possibly 'unstable' .
> >
> > Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a
> > closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a
> > closed loop? Here too the scheduler should be made well aware of the
> > decisions it took in the past right?
>
> It's more like:
>
> scheduler -> cpuidle/cpufreq -> hardware operating point
> ^ |
> +--------------------------------------+
>
> You can argue that you can make an adaptive loop that works fine but
> there are so many parameters that I don't see how it would work. The
> patches so far don't seem to address this. Small task packing, while
> useful, it's some heuristics just at the scheduler level.
I agree.
> With a combined decision maker, you aim to reduce this separate decision
> process and feedback loop. Probably impossible to eliminate the loop
> completely because of hardware latencies, PLLs, CPU frequency not always
> the main factor, but you can make the loop more tolerant to
> instabilities.
Well, in theory. :-)
Another question to ask is whether or not the structure of our software
reflects the underlying problem. I mean, on the one hand there is the
scheduler that needs to optimally assign work items to computational units
(hyperthreads, CPU cores, packages) and on the other hand there's hardware
with different capabilities (idle states, performance points etc.). Arguably,
the scheduler internals cannot cover all of the differences between all of the
existing types of hardware Linux can run on, so there needs to be a layer of
code providing an interface between the scheduler and the hardware. But that
layer of code needs to be just *one*, so why do we have *two* different
frameworks (cpuidle and cpufreq) that talk to the same hardware and kind of to
the scheduler, but not to each other?
To me, the reason is history, and more precisely the fact that cpufreq had been
there first, then came cpuidle and only then poeple started to realize that
some scheduler tweaks may allow us to save energy without sacrificing too
much performance. However, it looks like there's time to go back and see how
we can integrate all that. And there's more, because we may need to take power
budgets and thermal management into account as well (i.e. we may not be allowed
to use full performance of the processors all the time because of some
additional limitations) and the CPUs may be members of power domains, so what
we can do with them may depend on the states of other devices.
> > > So I think we either (a) come up with 'clearer' separation of
> > > responsibilities between scheduler and cpufreq/cpuidle
> >
> > I agree with this. This is what I have been emphasizing, if we feel that
> > the cpufrequency/ cpuidle subsystems are suboptimal in terms of the
> > information that they use to make their decisions, let us improve them.
> > But this will not yield us any improvement if the scheduler does not
> > have enough information. And IMHO, the next fundamental information that
> > the scheduler needs should come from cpufreq and cpuidle.
>
> What kind of information? Your suggestion that the scheduler should
> avoid loading a CPU because it went idle is wrong IMHO. It went idle
> because the scheduler decided this in first instance.
>
> > Then we should move onto supplying scheduler information from the power
> > domain topology, thermal factors, user policies.
>
> I agree with this but at this point you get the scheduler to make more
> informed decisions about task placement. It can then give more precise
> hints to cpufreq/cpuidle like the predicted load and those frameworks
> could become dumber in time, just complying with the requested
> performance level (trying to break the loop above).
Well, there's nothing like "predicted load". At best, we may be able to make
more or less educated guesses about it, so in my opinion it is better to use
the information about what happened in the past for making decisions regarding
the current settings and re-adjust them over time as we get more information.
So how much decision making regarding the idle state to put the given CPU into
should be there in the scheduler? I believe the only information coming out
of the scheduler regarding that should be "OK, this CPU is now idle and I'll
need it in X nanoseconds from now" plus possibly a hint about the wakeup
latency tolerance (but those hints may come from other places too). That said
the decision *which* CPU should become idle at the moment very well may require
some information about what options are available from the layer below (for
example, "putting core X into idle for Y of time will save us Z energy" or
something like that).
And what about performance scaling? Quite frankly, in my opinion that
requires some more investigation, because there still are some open questions
in that area. To start with we can just continue using the current heuristics,
but perhaps with the scheduler calling the scaling "governor" when it sees fit
instead of that "governor" running kind of in parallel with it.
> > > or (b) come up
> > > with a unified load-balancing/cpufreq/cpuidle implementation as per
> > > Ingo's request. The latter is harder but, with a good design, has
> > > potentially a lot more benefits.
> > >
> > > A possible implementation for (a) is to let the scheduler focus on
> > > performance load-balancing but control the balance ratio from a
> > > cpufreq governor (via things like arch_scale_freq_power() or something
> > > new). CPUfreq would not be concerned just with individual CPU
> > > load/frequency but also making a decision on how tasks are balanced
> > > between CPUs based on the overall load (e.g. four CPUs are enough for
> > > the current load, I can shut the other four off by telling the
> > > scheduler not to use them).
> > >
> > > As for Ingo's preferred solution (b), a proposal forward could be to
> > > factor the load balancing out of kernel/sched/fair.c and provide an
> > > abstract interface (like load_class?) for easier extending or
> > > different policies (e.g. small task packing).
> >
> > Let me elaborate on the patches that have been posted so far on the
> > power awareness of the scheduler. When we say *power aware scheduler*
> > what exactly do we want it to do?
> >
> > In my opinion, we want it to *avoid touching idle cpus*, so as to keep
> > them in that state longer and *keep more power domains idle*, so as to
> > yield power savings with them turned off. The patches released so far
> > are striving to do the latter. Correct me if I am wrong at this.
>
> Don't take me wrong, task packing to keep more power domains idle is
> probably in the right direction but it may not address all issues. You
> realised this is not enough since you are now asking for the scheduler
> to take feedback from cpuidle. As I pointed out above, you try to create
> a loop which may or may not work, especially given the wide variety of
> hardware parameters.
>
> > Also
> > feel free to point out any other expectation from the power aware
> > scheduler if I am missing any.
>
> If the patches so far are enough and solved all the problems, you are
> not missing any. Otherwise, please see my view above.
>
> Please define clearly what the scheduler, cpufreq, cpuidle should be
> doing and what communication should happen between them.
>
> > If I have got Ingo's point right, the issues with them are that they are
> > not taking a holistic approach to meet the said goal.
>
> Probably because scheduler changes, cpufreq and cpuidle are all trying
> to address the same thing but independent of each other and possibly
> conflicting.
>
> > Keeping more power
> > domains idle (by packing tasks) would sound much better if the scheduler
> > has taken all aspects of doing such a thing into account, like
> >
> > 1. How idle are the cpus, on the domain that it is packing
> > 2. Can they go to turbo mode, because if they do,then we cant pack
> > tasks. We would need certain cpus in that domain idle.
> > 3. Are the domains in which we pack tasks power gated?
> > 4. Will there be significant performance drop by packing? Meaning do the
> > tasks share cpu resources? If they do there will be severe contention.
>
> So by this you add a lot more information about the power configuration
> into the scheduler, getting it to make more informed decisions about
> task scheduling. You may eventually reach a point where cpuidle governor
> doesn't have much to do (which may be a good thing) and reach Ingo's
> goal.
>
> That's why I suggested maybe starting to take the load balancing out of
> fair.c and make it easily extensible (my opinion, the scheduler guys may
> disagree). Then make it more aware of topology, power configuration so
> that it makes the right task placement decision. You then get it to
> tell cpufreq about the expected performance requirements (frequency
> decided by cpufreq) and cpuidle about how long it could be idle for (you
> detect a periodic task every 1ms, or you don't have any at all because
> they were migrated, the right C state being decided by the governor).
There is another angle to look at that as I said somewhere above.
What if we could integrate cpuidle with cpufreq so that there is one code
layer representing what the hardware can do to the scheduler? What benefits
can we get from that, if any?
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
Hi Rafael,
On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
> On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
>> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
>>> On 06/07/2013 08:21 PM, Catalin Marinas wrote:
>>>> I think you are missing Ingo's point. It's not about the scheduler
>>>> complying with decisions made by various governors in the kernel
>>>> (which may or may not have enough information) but rather the
>>>> scheduler being in a better position for making such decisions.
>>>
>>> My mail pointed out that I disagree with this design ("the scheduler
>>> being in a better position for making such decisions").
>>> I think it should be a 2 way co-operation. I have elaborated below.
>
> I agree with that.
>
>>>> Take the cpuidle example, it uses the load average of the CPUs,
>>>> however this load average is currently controlled by the scheduler
>>>> (load balance). Rather than using a load average that degrades over
>>>> time and gradually putting the CPU into deeper sleep states, the
>>>> scheduler could predict more accurately that a run-queue won't have
>>>> any work over the next x ms and ask for a deeper sleep state from the
>>>> beginning.
>>>
>>> How will the scheduler know that there will not be work in the near
>>> future? How will the scheduler ask for a deeper sleep state?
>>>
>>> My answer to the above two questions are, the scheduler cannot know how
>>> much work will come up. All it knows is the current load of the
>>> runqueues and the nature of the task (thanks to the PJT's metric). It
>>> can then match the task load to the cpu capacity and schedule the tasks
>>> on the appropriate cpus.
>>
>> The scheduler can decide to load a single CPU or cluster and let the
>> others idle. If the total CPU load can fit into a smaller number of CPUs
>> it could as well tell cpuidle to go into deeper state from the
>> beginning as it moved all the tasks elsewhere.
>
> So why can't it do that today? What's the problem?
The reason that scheduler does not do it today is due to the
prefer_sibling logic. The tasks within a core get distributed across
cores if they are more than 1, since the cpu power of a core is not high
enough to handle more than one task.
However at a socket level/ MC level (cluster at a low level), there can
be as many tasks as there are cores because the socket has enough CPU
capacity to handle them. But the prefer_sibling logic moves tasks across
socket/MC level domains even when load<=domain_capacity.
I think the reason why the prefer_sibling logic was introduced, is that
scheduler looks at spreading tasks across all the resources it has. It
believes keeping tasks within a cluster/socket level domain would mean
tasks are being throttled by having access to only the cluster/socket
level resources. Which is why it spreads.
The prefer_sibling logic is nothing but a flag set at domain level to
communicate to the scheduler that load should be spread across the
groups of this domain. In the above example across sockets/clusters.
But I think it is time we take another look at the prefer_sibling logic
and decide on its worthiness.
>
>> Regarding future work, neither cpuidle nor the scheduler know this but
>> the scheduler would make a better prediction, for example by tracking
>> task periodicity.
>
> Well, basically, two pieces of information are needed to make target idle
> state selections: (1) when the CPU (core or package) is going to be used
> next time and (2) how much latency for going back to the non-idle state
> can be tolerated. While the scheduler knows (1) to some extent (arguably,
> it generally cannot predict when hardware interrupts are going to occur),
> I'm not really sure about (2).
>
>>> As a consequence, it leaves certain cpus idle. The load of these cpus
>>> degrade. It is via this load that the scheduler asks for a deeper sleep
>>> state. Right here we have scheduler talking to the cpuidle governor.
>>
>> So we agree that the scheduler _tells_ the cpuidle governor when to go
>> idle (but not how deep).
>
> It does indicate to cpuidle how deep it can go, however, by providing it with
> the information about when the CPU is going to be used next time (from the
> scheduler's perspective).
>
>> IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the
>> cpuidle does not get enough information from the scheduler (arguably this
>> could be fixed)
>
> OK, so what information is missing in your opinion?
>
>> and (2) the scheduler does not have any information about the idle states
>> (power gating etc.) to make any informed decision on which/when CPUs should
>> go idle.
>
> That's correct, which is a drawback. However, on some systems it may never
> have that information (because hardware coordinates idle states in a way that
> is opaque to the OS - e.g. by autopromoting deeper states when idle for
> sufficiently long time) and on some systems that information may change over
> time (i.e. the availablility of specific idle states may depend on factors
> that aren't constant).
>
> If you attempted to take all of the possible complications related to hardware
> designs in that area in the scheduler, you'd end up with completely
> unmaintainable piece of code.
>
>> As you said, it is a non-optimal one-way communication but the solution
>> is not feedback loop from cpuidle into scheduler. It's like the
>> scheduler managed by chance to get the CPU into a deeper sleep state and
>> now you'd like the scheduler to get feedback form cpuidle and not
>> disturb that CPU anymore. That's the closed loop I disagree with. Could
>> the scheduler not make this informed decision before - it has this total
>> load, let's get this CPU into deeper sleep state?
>
> No, it couldn't in general, for the above reasons.
>
>>> I don't see what the problem is with the cpuidle governor waiting for
>>> the load to degrade before putting that cpu to sleep. In my opinion,
>>> putting a cpu to deeper sleep states should happen gradually.
>
> If we know in advance that the CPU can be put into idle state Cn, there is no
> reason to put it into anything shallower than that.
>
> On the other hand, if the CPU is in Cn already and there is a possibility to
> put it into a deeper low-power state (which we didn't know about before), it
> may make sense to promote it into that state (if that's safe) or even wake it
> up and idle it again.
Yes, sorry I said it wrong in the previous mail. Today the cpuidle
governor is capable of putting a CPU in idle state Cn directly, by
looking at various factors like the current load, next timer, history of
interrupts, exit latency of states. At the end of this evaluation it
puts it into idle state Cn.
Also it cares to check if its decision is right. This is with respect to
your statement "if there is a possibility to put it into deeper low
power state". It queues a timer at a time just after its predicted wake
up time before putting the cpu to idle state. If this time of wakeup
prediction is wrong, this timer triggers to wake up the cpu and the cpu
is hence put into a deeper sleep state.
>
>>> This means time will tell the governors what kinds of workloads are running
>>> on the system. If the cpu is idle for long, it probably means that the system
>>> is less loaded and it makes sense to put the cpus to deeper sleep
>>> states. Of course there could be sporadic bursts or quieting down of
>>> tasks, but these are corner cases.
>>
>> It's nothing wrong with degrading given the information that cpuidle
>> currently has. It's a heuristics that worked ok so far and may continue
>> to do so. But see my comments above on why the scheduler could make more
>> informed decisions.
>>
>> We may not move all the power gating information to the scheduler but
>> maybe find a way to abstract this by giving more hints via the CPU and
>> cache topology. The cpuidle framework (it may not be much left of a
>> governor) would then take hints about estimated idle time and invoke the
>> low-level driver about the right C state.
>
> Overall, it looks like it'd be better to split the governor "layer" between the
> scheduler and the idle driver with a well defined interface between them. That
> interface needs to be general enough to be independent of the underlying
> hardware.
>
> We need to determine what kinds of information should be passed both ways and
> how to represent it.
I agree with this design decision.
>>>> Of course, you could export more scheduler information to cpuidle,
>>>> various hooks (task wakeup etc.) but then we have another framework,
>>>> cpufreq. It also decides the CPU parameters (frequency) based on the
>>>> load controlled by the scheduler. Can cpufreq decide whether it's
>>>> better to keep the CPU at higher frequency so that it gets to idle
>>>> quicker and therefore deeper sleep states? I don't think it has enough
>>>> information because there are at least three deciding factors
>>>> (cpufreq, cpuidle and scheduler's load balancing) which are not
>>>> unified.
>>>
>>> Why not? When the cpu load is high, cpu frequency governor knows it has
>>> to boost the frequency of that CPU. The task gets over quickly, the CPU
>>> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
>>> sleep state gradually.
>>
>> The cpufreq governor boosts the frequency enough to cover the load,
>> which means reducing the idle time. It does not know whether it is
>> better to boost the frequency twice as high so that it gets to idle
>> quicker. You can change the governor's policy but does it have any
>> information from cpuidle?
>
> Well, it may get that information directly from the hardware. Actually,
> intel_pstate does that, but intel_pstate is the governor and the scaling
> driver combined.
To add to this, cpufreq currently functions in the below fashion. I am
talking of the on demand governor, since it is more relevant to our
discussion.
----stepped up frequency------
----threshold--------
-----stepped down freq level1---
-----stepped down freq level2---
---stepped down freq level3----
If the cpu idle time is below a threshold , it boosts the frequency to
one level above straight away and does not vary it any further. If the
cpu idle time is below a threshold there is a step down in frequency
levels by 5% of the current frequency at every sampling period, provided
the cpu behavior is constant.
I think we can improve this implementation by better interaction with
cpuidle and scheduler.
When it is stepping up frequency, it should do it in steps of frequency
being a *function of the current cpu load* also, or function of idle
time will also do.
When it is stepping down frequency, it should interact with cpuidle. It
should get from cpuidle information regarding the idle state that the
cpu is in.The reason is cpufrequency governor is aware of only the idle
time of the cpu, not the idle state it is in. If it gets to know that
the cpu is in a deep idle state, it could step down frequency levels to
leveln straight away. Just like cpuidle does to put cpus into state Cn.
Or an alternate option could be just like stepping up, make the stepping
down also a function of idle time. Perhaps
fn(|threshold-idle_time|).
Also one more point to note is that if cpuidle puts cpus into such idle
states that clock gate the cpus, then there is no need for cpufrequency
governor for that cpu. cpufreq can check with cpuidle on this front
before it queries a cpu.
>
>>> Meanwhile the scheduler should ensure that the tasks are retained on
>>> that CPU,whose frequency is boosted and should not load balance it, so
>>> that they can get over quickly. This I think is what is missing. Again
>>> this comes down to the scheduler taking feedback from the CPU frequency
>>> governors which is not currently happening.
>>
>> Same loop again. The cpu load goes high because (a) there is more work,
>> possibly triggered by external events, and (b) the scheduler decided to
>> balance the CPUs in a certain way. As for cpuidle above, the scheduler
>> has direct influence on the cpufreq decisions. How would the scheduler
>> know which CPU not to balance against? Are CPUs in a cluster
>> synchronous? Is it better do let other CPU idle or more efficient to run
>> this cluster at half-speed?
>>
>> Let's say there is an increase in the load, does the scheduler wait
>> until cpufreq figures this out or tries to take the other CPUs out of
>> idle? Who's making this decision? That's currently a potentially
>> unstable loop.
>
> Yes, it is and I don't think we currently have good answers here.
My answer to the above question is scheduler does not wait until cpufreq
figures it out. All that the scheduler cares about today is load
balancing. Spread the load and hope it finishes soon. There is a
possibility today that even before cpu frequency governor can boost the
frequency of cpu, the scheduler can spread the load.
As for the second question it will wakeup idle cpus if it must to load
balance.
It is a good question asked: "does the scheduler wait until cpufreq
figures it out." Currently the answer is no, it does not communicate
with cpu frequency at all (except through cpu power, but that is the
good part of the story, so I will not get there now). But maybe we
should change this. I think we can do so the following way.
When can a scheduler talk to cpu frequency? It can do so under the below
circumstances:
1. Load is too high across the systems, all cpus are loaded, no chance
of load balancing. Therefore ask cpu frequency governor to step up
frequency to get improve performance.
2. The scheduler finds out that if it has to load balance, it has to do
so on cpus which are in deep idle state( Currently this logic is not
present, but worth getting it in). It then decides to increase the
frequency of the already loaded cpus to improve performance. It calls
cpu freq governor.
3. The scheduler finds out that if it has to load balance, it has to do
so on a different power domain which is idle currently(shallow/deep). It
thinks the better of it and calls cpu frequency governor to boost the
frequency of the cpus in the current domain.
While 2 and 3 depend on scheduler having knowledge about idle states and
power domains, which it currently does not have, 1 can be achieved with
the current code. scheduler keeps track of failed ld balancing efforts
with lb_failed. If it finds that while load balancing from busy group
failed (lb_failed > 0), it can call cpu freq governor to step up the cpu
frequency of this busy cpu group, with gov_check_cpu() in cpufrequency
governor code.
>
> The results of many measurements seem to indicate that it generally is better
> to do the work as quickly as possible and then go idle again, but there are
> costs associated with going back and forth from idle to non-idle etc.
I think we can even out the cost benefit of race to idle, by choosing to
do it wisely. Like for example if points 2 and 3 above are true (idle
cpus are in deep sleep states or need to ld balance on a different power
domain), then step up the frequency of the current working cpus and reap
its benefit.
>
> The main problem with cpufreq that I personally have is that the governors
> carry out their own sampling with pretty much arbitrary resolution that may
> lead to suboptimal decisions. It would be much better if the scheduler
> indicated when to *consider* the changing of CPU performance parameters (that
> may not be frequency alone and not even frequency at all in general), more or
> less the same way it tells cpuidle about idle CPUs, but I'm not sure if it
> should decide what performance points to run at.
Very true. See the points 1,2 and 3 above where I list out when
scheduler can call cpu frequency. Also an idea about how cpu frequency
governor can decide on the scaling frequency is stated above.
>
>>>>> I would repeat here that today we interface cpuidle/cpufrequency
>>>>> policies with scheduler but not the other way around. They do their bit
>>>>> when a cpu is busy/idle. However scheduler does not see that somebody
>>>>> else is taking instructions from it and comes back to give different
>>>>> instructions!
>>>>
>>>> The key here is that cpuidle/cpufreq make their primary decision based
>>>> on something controlled by the scheduler: the CPU load (via run-queue
>>>> balancing). You would then like the scheduler take such decision back
>>>> into account. It just looks like a closed loop, possibly 'unstable' .
>>>
>>> Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a
>>> closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a
>>> closed loop? Here too the scheduler should be made well aware of the
>>> decisions it took in the past right?
>>
>> It's more like:
>>
>> scheduler -> cpuidle/cpufreq -> hardware operating point
>> ^ |
>> +--------------------------------------+
>>
>> You can argue that you can make an adaptive loop that works fine but
>> there are so many parameters that I don't see how it would work. The
>> patches so far don't seem to address this. Small task packing, while
>> useful, it's some heuristics just at the scheduler level.
>
> I agree.
>
>> With a combined decision maker, you aim to reduce this separate decision
>> process and feedback loop. Probably impossible to eliminate the loop
>> completely because of hardware latencies, PLLs, CPU frequency not always
>> the main factor, but you can make the loop more tolerant to
>> instabilities.
>
> Well, in theory. :-)
>
> Another question to ask is whether or not the structure of our software
> reflects the underlying problem. I mean, on the one hand there is the
> scheduler that needs to optimally assign work items to computational units
> (hyperthreads, CPU cores, packages) and on the other hand there's hardware
> with different capabilities (idle states, performance points etc.). Arguably,
> the scheduler internals cannot cover all of the differences between all of the
> existing types of hardware Linux can run on, so there needs to be a layer of
> code providing an interface between the scheduler and the hardware. But that
> layer of code needs to be just *one*, so why do we have *two* different
> frameworks (cpuidle and cpufreq) that talk to the same hardware and kind of to
> the scheduler, but not to each other?
>
> To me, the reason is history, and more precisely the fact that cpufreq had been
> there first, then came cpuidle and only then poeple started to realize that
> some scheduler tweaks may allow us to save energy without sacrificing too
> much performance. However, it looks like there's time to go back and see how
> we can integrate all that. And there's more, because we may need to take power
> budgets and thermal management into account as well (i.e. we may not be allowed
> to use full performance of the processors all the time because of some
> additional limitations) and the CPUs may be members of power domains, so what
> we can do with them may depend on the states of other devices.
>
>>>> So I think we either (a) come up with 'clearer' separation of
>>>> responsibilities between scheduler and cpufreq/cpuidle
>>>
>>> I agree with this. This is what I have been emphasizing, if we feel that
>>> the cpufrequency/ cpuidle subsystems are suboptimal in terms of the
>>> information that they use to make their decisions, let us improve them.
>>> But this will not yield us any improvement if the scheduler does not
>>> have enough information. And IMHO, the next fundamental information that
>>> the scheduler needs should come from cpufreq and cpuidle.
>>
>> What kind of information? Your suggestion that the scheduler should
>> avoid loading a CPU because it went idle is wrong IMHO. It went idle
>> because the scheduler decided this in first instance.
>>
>>> Then we should move onto supplying scheduler information from the power
>>> domain topology, thermal factors, user policies.
>>
>> I agree with this but at this point you get the scheduler to make more
>> informed decisions about task placement. It can then give more precise
>> hints to cpufreq/cpuidle like the predicted load and those frameworks
>> could become dumber in time, just complying with the requested
>> performance level (trying to break the loop above).
>
> Well, there's nothing like "predicted load". At best, we may be able to make
> more or less educated guesses about it, so in my opinion it is better to use
> the information about what happened in the past for making decisions regarding
> the current settings and re-adjust them over time as we get more information.
Agree with this as well. scheduler can at best supply information
regarding the historic load and hope that it is what defines the future
as well. Apart from this I dont know what other information scheduler
can supply cpuidle governor with.
>
> So how much decision making regarding the idle state to put the given CPU into
> should be there in the scheduler? I believe the only information coming out
> of the scheduler regarding that should be "OK, this CPU is now idle and I'll
> need it in X nanoseconds from now" plus possibly a hint about the wakeup
> latency tolerance (but those hints may come from other places too). That said
> the decision *which* CPU should become idle at the moment very well may require
> some information about what options are available from the layer below (for
> example, "putting core X into idle for Y of time will save us Z energy" or
> something like that).
Agree. Except that the information should be "Ok , this CPU is now idle
and it has not done much work in the recent past,it is a 10% loaded CPU".
This can be said today using PJT's metric. It is now for the cpuidle
governor to decide the idle state to go to. Thats what happens today too.
>
> And what about performance scaling? Quite frankly, in my opinion that
> requires some more investigation, because there still are some open questions
> in that area. To start with we can just continue using the current heuristics,
> but perhaps with the scheduler calling the scaling "governor" when it sees fit
> instead of that "governor" running kind of in parallel with it.
Exactly. How this can be done is elaborated above. This is one of the
key things we need today,IMHO.
>
>>>> or (b) come up
>>>> with a unified load-balancing/cpufreq/cpuidle implementation as per
>>>> Ingo's request. The latter is harder but, with a good design, has
>>>> potentially a lot more benefits.
>>>>
>>>> A possible implementation for (a) is to let the scheduler focus on
>>>> performance load-balancing but control the balance ratio from a
>>>> cpufreq governor (via things like arch_scale_freq_power() or something
>>>> new). CPUfreq would not be concerned just with individual CPU
>>>> load/frequency but also making a decision on how tasks are balanced
>>>> between CPUs based on the overall load (e.g. four CPUs are enough for
>>>> the current load, I can shut the other four off by telling the
>>>> scheduler not to use them).
>>>>
>>>> As for Ingo's preferred solution (b), a proposal forward could be to
>>>> factor the load balancing out of kernel/sched/fair.c and provide an
>>>> abstract interface (like load_class?) for easier extending or
>>>> different policies (e.g. small task packing).
>>>
>>> Let me elaborate on the patches that have been posted so far on the
>>> power awareness of the scheduler. When we say *power aware scheduler*
>>> what exactly do we want it to do?
>>>
>>> In my opinion, we want it to *avoid touching idle cpus*, so as to keep
>>> them in that state longer and *keep more power domains idle*, so as to
>>> yield power savings with them turned off. The patches released so far
>>> are striving to do the latter. Correct me if I am wrong at this.
>>
>> Don't take me wrong, task packing to keep more power domains idle is
>> probably in the right direction but it may not address all issues. You
>> realised this is not enough since you are now asking for the scheduler
>> to take feedback from cpuidle. As I pointed out above, you try to create
>> a loop which may or may not work, especially given the wide variety of
>> hardware parameters.
>>
>>> Also
>>> feel free to point out any other expectation from the power aware
>>> scheduler if I am missing any.
>>
>> If the patches so far are enough and solved all the problems, you are
>> not missing any. Otherwise, please see my view above.
>>
>> Please define clearly what the scheduler, cpufreq, cpuidle should be
>> doing and what communication should happen between them.
>>
>>> If I have got Ingo's point right, the issues with them are that they are
>>> not taking a holistic approach to meet the said goal.
>>
>> Probably because scheduler changes, cpufreq and cpuidle are all trying
>> to address the same thing but independent of each other and possibly
>> conflicting.
>>
>>> Keeping more power
>>> domains idle (by packing tasks) would sound much better if the scheduler
>>> has taken all aspects of doing such a thing into account, like
>>>
>>> 1. How idle are the cpus, on the domain that it is packing
>>> 2. Can they go to turbo mode, because if they do,then we cant pack
>>> tasks. We would need certain cpus in that domain idle.
>>> 3. Are the domains in which we pack tasks power gated?
>>> 4. Will there be significant performance drop by packing? Meaning do the
>>> tasks share cpu resources? If they do there will be severe contention.
>>
>> So by this you add a lot more information about the power configuration
>> into the scheduler, getting it to make more informed decisions about
>> task scheduling. You may eventually reach a point where cpuidle governor
>> doesn't have much to do (which may be a good thing) and reach Ingo's
>> goal.
>>
>> That's why I suggested maybe starting to take the load balancing out of
>> fair.c and make it easily extensible (my opinion, the scheduler guys may
>> disagree). Then make it more aware of topology, power configuration so
>> that it makes the right task placement decision. You then get it to
>> tell cpufreq about the expected performance requirements (frequency
>> decided by cpufreq) and cpuidle about how long it could be idle for (you
>> detect a periodic task every 1ms, or you don't have any at all because
>> they were migrated, the right C state being decided by the governor).
>
> There is another angle to look at that as I said somewhere above.
>
> What if we could integrate cpuidle with cpufreq so that there is one code
> layer representing what the hardware can do to the scheduler? What benefits
> can we get from that, if any?
We could debate on this point. I am a bit confused about this. As I see
it, there is no problem with keeping them separately. One, because of
code readability; it is easy to understand what are the different
parameters that the performance of CPU depends on, without needing to
dig through the code. Two, because cpu frequency kicks in during runtime
primarily and cpuidle during idle time of the cpu.
But this would also mean creating well defined interfaces between them.
Integrating cpufreq and cpuidle seems like a better argument to make due
to their common functionality at a higher level of talking to hardware
and tuning the performance parameters of cpu. But I disagree that
scheduler should be put into this common framework as well as it has
functionalities which are totally disjoint from what subsystems such as
cpuidle and cpufreq are intended to do.
>
> Rafael
>
>
Regards
Preeti U Murthy
Hi Catalin,
On 06/08/2013 04:58 PM, Catalin Marinas wrote:
> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
>> On 06/07/2013 08:21 PM, Catalin Marinas wrote:
>>> I think you are missing Ingo's point. It's not about the scheduler
>>> complying with decisions made by various governors in the kernel
>>> (which may or may not have enough information) but rather the
>>> scheduler being in a better position for making such decisions.
>>
>> My mail pointed out that I disagree with this design ("the scheduler
>> being in a better position for making such decisions").
>> I think it should be a 2 way co-operation. I have elaborated below.
>>
>>> Take the cpuidle example, it uses the load average of the CPUs,
>>> however this load average is currently controlled by the scheduler
>>> (load balance). Rather than using a load average that degrades over
>>> time and gradually putting the CPU into deeper sleep states, the
>>> scheduler could predict more accurately that a run-queue won't have
>>> any work over the next x ms and ask for a deeper sleep state from the
>>> beginning.
>>
>> How will the scheduler know that there will not be work in the near
>> future? How will the scheduler ask for a deeper sleep state?
>>
>> My answer to the above two questions are, the scheduler cannot know how
>> much work will come up. All it knows is the current load of the
>> runqueues and the nature of the task (thanks to the PJT's metric). It
>> can then match the task load to the cpu capacity and schedule the tasks
>> on the appropriate cpus.
>
> The scheduler can decide to load a single CPU or cluster and let the
> others idle. If the total CPU load can fit into a smaller number of CPUs
> it could as well tell cpuidle to go into deeper state from the
> beginning as it moved all the tasks elsewhere.
This currently does not happen. I have elaborated in the response to
Rafael's mail. Sorry I should have put you on the 'To' list, missed
that. Do take a look at that mail since many of the replies to your
current mail are in it.
What do you mean "from the beginning"? As soon as those cpus go idle,
cpuidle will kick in anyway. If you are saying that scheduler should
tell cpuidle that "this cpu can go into deep sleep state x, since I am
not going to use it for the next y seconds", that is not possible.
Firstly, because scheduler can't "predict" this 'y' parameter. Secondly
because hardware could change the idle state availibility or details
dynamically as Rafael pointed out and hence this 'x' is best not to be
told by the scheduler, but be queried by cpuidle governor by itself.
>
> Regarding future work, neither cpuidle nor the scheduler know this but
> the scheduler would make a better prediction, for example by tracking
> task periodicity.
This prediction that you mention scheduler already exports it to
cpuidle. load_avg does precisely that, it tracks history and predicts
the future based on this. load_avg being tracked by scheduler
periodically is already seen by cpuidle governor.
>
>> As a consequence, it leaves certain cpus idle. The load of these cpus
>> degrade. It is via this load that the scheduler asks for a deeper sleep
>> state. Right here we have scheduler talking to the cpuidle governor.
>
> So we agree that the scheduler _tells_ the cpuidle governor when to go
> idle (but not how deep). IOW, the scheduler drives the cpuidle
> decisions. Two problems: (1) the cpuidle does not get enough information
> from the scheduler (arguably this could be fixed) and (2) the scheduler
> does not have any information about the idle states (power gating etc.)
> to make any informed decision on which/when CPUs should go idle.
>
> As you said, it is a non-optimal one-way communication but the solution
> is not feedback loop from cpuidle into scheduler. It's like the
> scheduler managed by chance to get the CPU into a deeper sleep state and
> now you'd like the scheduler to get feedback form cpuidle and not
> disturb that CPU anymore. That's the closed loop I disagree with. Could
> the scheduler not make this informed decision before - it has this total
> load, let's get this CPU into deeper sleep state?
Lets say the scheduler does make an informed decision before, with lets
get this cpu into idle state. Then what? Say the load begins to increase
on the system. The scheduler has to wake up cpus. Which cpus to wake up
best? Who tells scheduler this? One, the power gating information which
is yet to be exported to the scheduler can tell scheduler this to an
extent. As far as I can see the next person to guide the scheduler here
is cpuidle, isnt it?
>
>> I don't see what the problem is with the cpuidle governor waiting for
>> the load to degrade before putting that cpu to sleep. In my opinion,
>> putting a cpu to deeper sleep states should happen gradually. This means
>> time will tell the governors what kinds of workloads are running on the
>> system. If the cpu is idle for long, it probably means that the system
>> is less loaded and it makes sense to put the cpus to deeper sleep
>> states. Of course there could be sporadic bursts or quieting down of
>> tasks, but these are corner cases.
>
> It's nothing wrong with degrading given the information that cpuidle
> currently has. It's a heuristics that worked ok so far and may continue
> to do so. But see my comments above on why the scheduler could make more
> informed decisions.
scheduler can certainly make more informed decisions like:
1. Dont wakup idle cpus
2. Dont wake up cpus in a different power domain
3. Do not move task away from cpus in turbo mode.
These are a few. See how all of them require scheduler to talk to
cpufreq and cpuidle to find out? Can you list how scheduler can make
informed decision without getting information from them?
For this you may say that which is why we need to get all the decision
making into the scheduler. But I disagree because integrating cpuidle
and cpufreq governing seems fine, because at a high level their
functionality is the same; that being querying the hardware and deciding
what is best for cpus. But thats not the case with scheduler. Its
primary aim is to make sure there are enough resources for the tasks,
that it is able to see the topology of cpus and load balance bottom up,
do fair scheduling within a cpu and so on. Why would you want to add
more complexity to it?
>
> We may not move all the power gating information to the scheduler but
> maybe find a way to abstract this by giving more hints via the CPU and
> cache topology.
Correct.Power gating and topology information should best be in
scheduler primarily because this information is no where else and
secondly because scheduling domains and groups topology were created
specifically for the scheduler.
> The cpuidle framework (it may not be much left of a
> governor) would then take hints about estimated idle time and invoke the
> low-level driver about the right C state.
This happens today.
>
>>> Of course, you could export more scheduler information to cpuidle,
>>> various hooks (task wakeup etc.) but then we have another framework,
>>> cpufreq. It also decides the CPU parameters (frequency) based on the
>>> load controlled by the scheduler. Can cpufreq decide whether it's
>>> better to keep the CPU at higher frequency so that it gets to idle
>>> quicker and therefore deeper sleep states? I don't think it has enough
>>> information because there are at least three deciding factors
>>> (cpufreq, cpuidle and scheduler's load balancing) which are not
>>> unified.
>>
>> Why not? When the cpu load is high, cpu frequency governor knows it has
>> to boost the frequency of that CPU. The task gets over quickly, the CPU
>> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
>> sleep state gradually.
>
> The cpufreq governor boosts the frequency enough to cover the load,
> which means reducing the idle time. It does not know whether it is
> better to boost the frequency twice as high so that it gets to idle
> quicker. You can change the governor's policy but does it have any
> information from cpuidle?
This I have elaborated in the response to Rafael's mail.
>
>> Meanwhile the scheduler should ensure that the tasks are retained on
>> that CPU,whose frequency is boosted and should not load balance it, so
>> that they can get over quickly. This I think is what is missing. Again
>> this comes down to the scheduler taking feedback from the CPU frequency
>> governors which is not currently happening.
>
> Same loop again. The cpu load goes high because (a) there is more work,
> possibly triggered by external events, and (b) the scheduler decided to
> balance the CPUs in a certain way. As for cpuidle above, the scheduler
> has direct influence on the cpufreq decisions. How would the scheduler
> know which CPU not to balance against? Are CPUs in a cluster
> synchronous? Is it better do let other CPU idle or more efficient to run
> this cluster at half-speed?
>
> Let's say there is an increase in the load, does the scheduler wait
> until cpufreq figures this out or tries to take the other CPUs out of
> idle? Who's making this decision? That's currently a potentially
> unstable loop.
>
The answers to the above as I see it are in my response to Rafael's
mail. I don't intend to duplicate the replies, hence I would be glad if
you could read through that mail and give your feedback on the same.
>>>> I would repeat here that today we interface cpuidle/cpufrequency
>>>> policies with scheduler but not the other way around. They do their bit
>>>> when a cpu is busy/idle. However scheduler does not see that somebody
>>>> else is taking instructions from it and comes back to give different
>>>> instructions!
>>>
>>> The key here is that cpuidle/cpufreq make their primary decision based
>>> on something controlled by the scheduler: the CPU load (via run-queue
>>> balancing). You would then like the scheduler take such decision back
>>> into account. It just looks like a closed loop, possibly 'unstable' .
>>
>> Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a
>> closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a
>> closed loop? Here too the scheduler should be made well aware of the
>> decisions it took in the past right?
>
> It's more like:
>
> scheduler -> cpuidle/cpufreq -> hardware operating point
> ^ |
> +--------------------------------------+
>
> You can argue that you can make an adaptive loop that works fine but
> there are so many parameters that I don't see how it would work. The
> patches so far don't seem to address this. Small task packing, while
> useful, it's some heuristics just at the scheduler level.
Correct. That is the issue with them and we need to rectify that.
>
> With a combined decision maker, you aim to reduce this separate decision
> process and feedback loop. Probably impossible to eliminate the loop
> completely because of hardware latencies, PLLs, CPU frequency not always
> the main factor, but you can make the loop more tolerant to
> instabilities.
I dont see how we can break the above loop that you have drawn and I
dont think it is a good idea to merge scheduler and cpuidle/cpufreq into
one for reasons mentioned above.
>
>>> So I think we either (a) come up with 'clearer' separation of
>>> responsibilities between scheduler and cpufreq/cpuidle
>>
>> I agree with this. This is what I have been emphasizing, if we feel that
>> the cpufrequency/ cpuidle subsystems are suboptimal in terms of the
>> information that they use to make their decisions, let us improve them.
>> But this will not yield us any improvement if the scheduler does not
>> have enough information. And IMHO, the next fundamental information that
>> the scheduler needs should come from cpufreq and cpuidle.
>
> What kind of information? Your suggestion that the scheduler should
> avoid loading a CPU because it went idle is wrong IMHO. It went idle
> because the scheduler decided this in first instance.
With regard to cpu idle, which idle state a CPU is in and with regard to
cpu freq, when to call it. The former is detailed above and latter is
detailed in my response to Rafael's mail.
>
>> Then we should move onto supplying scheduler information from the power
>> domain topology, thermal factors, user policies.
>
> I agree with this but at this point you get the scheduler to make more
> informed decisions about task placement. It can then give more precise
> hints to cpufreq/cpuidle like the predicted load and those frameworks
> could become dumber in time, just complying with the requested
> performance level (trying to break the loop above).
>
>>> or (b) come up
>>> with a unified load-balancing/cpufreq/cpuidle implementation as per
>>> Ingo's request. The latter is harder but, with a good design, has
>>> potentially a lot more benefits.
>>>
>>> A possible implementation for (a) is to let the scheduler focus on
>>> performance load-balancing but control the balance ratio from a
>>> cpufreq governor (via things like arch_scale_freq_power() or something
>>> new). CPUfreq would not be concerned just with individual CPU
>>> load/frequency but also making a decision on how tasks are balanced
>>> between CPUs based on the overall load (e.g. four CPUs are enough for
>>> the current load, I can shut the other four off by telling the
>>> scheduler not to use them).
>>>
>>> As for Ingo's preferred solution (b), a proposal forward could be to
>>> factor the load balancing out of kernel/sched/fair.c and provide an
>>> abstract interface (like load_class?) for easier extending or
>>> different policies (e.g. small task packing).
>>
>> Let me elaborate on the patches that have been posted so far on the
>> power awareness of the scheduler. When we say *power aware scheduler*
>> what exactly do we want it to do?
>>
>> In my opinion, we want it to *avoid touching idle cpus*, so as to keep
>> them in that state longer and *keep more power domains idle*, so as to
>> yield power savings with them turned off. The patches released so far
>> are striving to do the latter. Correct me if I am wrong at this.
>
> Don't take me wrong, task packing to keep more power domains idle is
> probably in the right direction but it may not address all issues. You
> realised this is not enough since you are now asking for the scheduler
> to take feedback from cpuidle. As I pointed out above, you try to create
> a loop which may or may not work, especially given the wide variety of
> hardware parameters.
>
>> Also
>> feel free to point out any other expectation from the power aware
>> scheduler if I am missing any.
>
> If the patches so far are enough and solved all the problems, you are
> not missing any. Otherwise, please see my view above.
>
> Please define clearly what the scheduler, cpufreq, cpuidle should be
> doing and what communication should happen between them.
This I have to an extent elaborated in this mail and in the response to
Rafael's.
>
>> If I have got Ingo's point right, the issues with them are that they are
>> not taking a holistic approach to meet the said goal.
>
> Probably because scheduler changes, cpufreq and cpuidle are all trying
> to address the same thing but independent of each other and possibly
> conflicting.
>
>> Keeping more power
>> domains idle (by packing tasks) would sound much better if the scheduler
>> has taken all aspects of doing such a thing into account, like
>>
>> 1. How idle are the cpus, on the domain that it is packing
>> 2. Can they go to turbo mode, because if they do,then we cant pack
>> tasks. We would need certain cpus in that domain idle.
>> 3. Are the domains in which we pack tasks power gated?
>> 4. Will there be significant performance drop by packing? Meaning do the
>> tasks share cpu resources? If they do there will be severe contention.
>
> So by this you add a lot more information about the power configuration
> into the scheduler, getting it to make more informed decisions about
> task scheduling. You may eventually reach a point where cpuidle governor
> doesn't have much to do (which may be a good thing) and reach Ingo's
> goal.
>
> That's why I suggested maybe starting to take the load balancing out of
> fair.c and make it easily extensible (my opinion, the scheduler guys may
> disagree). Then make it more aware of topology, power configuration so
> that it makes the right task placement decision. You then get it to
> tell cpufreq about the expected performance requirements (frequency
> decided by cpufreq) and cpuidle about how long it could be idle for (you
> detect a periodic task every 1ms, or you don't have any at all because
> they were migrated, the right C state being decided by the governor).
>
All the above questions have been addressed above.
> Regards.
>
Regards
Preeti U Murthy
Hi David,
On 06/07/2013 11:06 PM, David Lang wrote:
> On Fri, 7 Jun 2013, Preeti U Murthy wrote:
>
>> Hi Catalin,
>>
>> On 06/07/2013 08:21 PM, Catalin Marinas wrote:
> <SNIP>
>>> Take the cpuidle example, it uses the load average of the CPUs,
>>> however this load average is currently controlled by the scheduler
>>> (load balance). Rather than using a load average that degrades over
>>> time and gradually putting the CPU into deeper sleep states, the
>>> scheduler could predict more accurately that a run-queue won't have
>>> any work over the next x ms and ask for a deeper sleep state from the
>>> beginning.
>>
>> How will the scheduler know that there will not be work in the near
>> future? How will the scheduler ask for a deeper sleep state?
>>
>> My answer to the above two questions are, the scheduler cannot know how
>> much work will come up. All it knows is the current load of the
>> runqueues and the nature of the task (thanks to the PJT's metric). It
>> can then match the task load to the cpu capacity and schedule the tasks
>> on the appropriate cpus.
>
> how will the cpuidle govenor know what will come up in the future?
>
> the scheduler knows more than the current load on the runqueus, it
> tracks some information about the past behavior of the process that it
> uses for it's decisions. This is information that cpuidle doesn't have.
This is incorrect. The scheduler knows the possible future load on a cpu
due to past behavior, thats right, and so does cpuidle today. It queries
the load average for predicted idle time and compares this with exit
latencies of the idle states.
>
> <SNIP>
>> I don't see what the problem is with the cpuidle governor waiting for
>> the load to degrade before putting that cpu to sleep. In my opinion,
>> putting a cpu to deeper sleep states should happen gradually.
>
> remember that it takes power and time to wake up a cpu to put it in a
> deeper sleep state.
Correct. I apologise in saying that it does it gradually. This is not
entirely right. cpuidle governor can decide on the state the cpu is best
put into directly without going through the shallow idle states.
It also takes care to rectify any incorrect prediction. So there is no
exit-enter-exit-enter sub optimal implementation.
>
>>> Of course, you could export more scheduler information to cpuidle,
>>> various hooks (task wakeup etc.) but then we have another framework,
>>> cpufreq. It also decides the CPU parameters (frequency) based on the
>>> load controlled by the scheduler. Can cpufreq decide whether it's
>>> better to keep the CPU at higher frequency so that it gets to idle
>>> quicker and therefore deeper sleep states? I don't think it has enough
>>> information because there are at least three deciding factors
>>> (cpufreq, cpuidle and scheduler's load balancing) which are not
>>> unified.
>>
>> Why not? When the cpu load is high, cpu frequency governor knows it has
>> to boost the frequency of that CPU. The task gets over quickly, the CPU
>> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
>> sleep state gradually.
>>
>> Meanwhile the scheduler should ensure that the tasks are retained on
>> that CPU,whose frequency is boosted and should not load balance it, so
>> that they can get over quickly. This I think is what is missing. Again
>> this comes down to the scheduler taking feedback from the CPU frequency
>> governors which is not currently happening.
>
> how should the scheduler know that the cpufreq governor decided to boost
> the speed of one CPU to handle an important process as opposed to
> handling multiple smaller processes?
This has been elaborated in my response to Rafael's mail. Scheduler
decides to call cpu frequency governor when it sees fit. Then cpu
frequency governor boosts the frequency of that cpu. cpu_power will now
match the task load. So scheduler will not move the task away from that
cpu since load does not exceed cpu capacity. So scheduler knows in this way.
> the communication between the two is starting to sound really messy
>
Not really. More is elaborated in responses to Catalin and Rafael's mails.
Regards
Preeti U Murthy
> David Lang
>
Hi Preeti,
(trimming lots of text, hopefully to make it easier to follow)
On Sun, Jun 09, 2013 at 04:42:18AM +0100, Preeti U Murthy wrote:
> On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
> > On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
> >> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
> >>> Meanwhile the scheduler should ensure that the tasks are retained on
> >>> that CPU,whose frequency is boosted and should not load balance it, so
> >>> that they can get over quickly. This I think is what is missing. Again
> >>> this comes down to the scheduler taking feedback from the CPU frequency
> >>> governors which is not currently happening.
> >>
> >> Same loop again. The cpu load goes high because (a) there is more work,
> >> possibly triggered by external events, and (b) the scheduler decided to
> >> balance the CPUs in a certain way. As for cpuidle above, the scheduler
> >> has direct influence on the cpufreq decisions. How would the scheduler
> >> know which CPU not to balance against? Are CPUs in a cluster
> >> synchronous? Is it better do let other CPU idle or more efficient to run
> >> this cluster at half-speed?
> >>
> >> Let's say there is an increase in the load, does the scheduler wait
> >> until cpufreq figures this out or tries to take the other CPUs out of
> >> idle? Who's making this decision? That's currently a potentially
> >> unstable loop.
> >
> > Yes, it is and I don't think we currently have good answers here.
>
> My answer to the above question is scheduler does not wait until cpufreq
> figures it out. All that the scheduler cares about today is load
> balancing. Spread the load and hope it finishes soon. There is a
> possibility today that even before cpu frequency governor can boost the
> frequency of cpu, the scheduler can spread the load.
>
> As for the second question it will wakeup idle cpus if it must to load
> balance.
That's exactly my point. Such behaviour can become unstable (it probably
won't oscillate but it affects the power or performance).
> It is a good question asked: "does the scheduler wait until cpufreq
> figures it out." Currently the answer is no, it does not communicate
> with cpu frequency at all (except through cpu power, but that is the
> good part of the story, so I will not get there now). But maybe we
> should change this. I think we can do so the following way.
>
> When can a scheduler talk to cpu frequency? It can do so under the below
> circumstances:
>
> 1. Load is too high across the systems, all cpus are loaded, no chance
> of load balancing. Therefore ask cpu frequency governor to step up
> frequency to get improve performance.
Too high or too low loads across the whole system are relatively simple
scenarios: for the former boost the frequency (cpufreq can do this on
its own, the scheduler has nowhere to balance anyway), for the latter
pack small tasks (or other heuristics).
But the bigger issue is where some CPUs are idle while others are
running at a smaller frequency. With the current implementation it is
even hard to get into this asymmetric state (some cluster loaded while
the other in deep sleep) unless the load is low and you apply some small
task packing patch.
> 2. The scheduler finds out that if it has to load balance, it has to do
> so on cpus which are in deep idle state( Currently this logic is not
> present, but worth getting it in). It then decides to increase the
> frequency of the already loaded cpus to improve performance. It calls
> cpu freq governor.
So you say that the scheduler decides to increase the frequency of the
already loaded cpus to improve performance. Doesn't this mean that the
scheduler takes on some of the responsibilities of cpufreq? You now add
logic about boosting CPU frequency to the scheduler.
What's even more problematic is that cpufreq has policies decided by the
user (or pre-configured OS policies) but the scheduler is not aware of
them. Let's say the user wants a more conservative cpufreq policy, how
long should the scheduler wait for cpufreq to boost the frequency before
waking idle CPUs?
There are many questions like above. I'm not looking for specific
answers but rather trying get a higher level clear view of the
responsibilities of the three main factors contributing to
power/performance: load balancing (scheduler), cpufreq and cpuidle.
> 3. The scheduler finds out that if it has to load balance, it has to do
> so on a different power domain which is idle currently(shallow/deep). It
> thinks the better of it and calls cpu frequency governor to boost the
> frequency of the cpus in the current domain.
As for 2, the scheduler would make power decisions. Then why don't make
a unified implementation? Or remove such decisions from the scheduler.
> > The results of many measurements seem to indicate that it generally is better
> > to do the work as quickly as possible and then go idle again, but there are
> > costs associated with going back and forth from idle to non-idle etc.
>
> I think we can even out the cost benefit of race to idle, by choosing to
> do it wisely. Like for example if points 2 and 3 above are true (idle
> cpus are in deep sleep states or need to ld balance on a different power
> domain), then step up the frequency of the current working cpus and reap
> its benefit.
And such decision would be made by ...? I guess the scheduler again.
> > And what about performance scaling? Quite frankly, in my opinion that
> > requires some more investigation, because there still are some open questions
> > in that area. To start with we can just continue using the current heuristics,
> > but perhaps with the scheduler calling the scaling "governor" when it sees fit
> > instead of that "governor" running kind of in parallel with it.
>
> Exactly. How this can be done is elaborated above. This is one of the
> key things we need today,IMHO.
The scheduler asking the cpufreq governor of what it needs is a too
simplistic view IMHO. What if the governor is conservative? How much
does the scheduler wait until the feedback loop reacts (CPU frequency
raised increasing the idle time so that the scheduler eventually
measures a smaller load)?
The scheduler could get more direct feedback from cpufreq like "I'll get
to this frequency in x ms" or not at all but then the scheduler needs to
make another power-related decision on whether to wait (be conservative)
or wake up an idle CPU. Do you want to add various power policies at the
scheduler level just to match the cpufreq ones?
> >> That's why I suggested maybe starting to take the load balancing out of
> >> fair.c and make it easily extensible (my opinion, the scheduler guys may
> >> disagree). Then make it more aware of topology, power configuration so
> >> that it makes the right task placement decision. You then get it to
> >> tell cpufreq about the expected performance requirements (frequency
> >> decided by cpufreq) and cpuidle about how long it could be idle for (you
> >> detect a periodic task every 1ms, or you don't have any at all because
> >> they were migrated, the right C state being decided by the governor).
> >
> > There is another angle to look at that as I said somewhere above.
> >
> > What if we could integrate cpuidle with cpufreq so that there is one code
> > layer representing what the hardware can do to the scheduler? What benefits
> > can we get from that, if any?
>
> We could debate on this point. I am a bit confused about this. As I see
> it, there is no problem with keeping them separately. One, because of
> code readability; it is easy to understand what are the different
> parameters that the performance of CPU depends on, without needing to
> dig through the code. Two, because cpu frequency kicks in during runtime
> primarily and cpuidle during idle time of the cpu.
>
> But this would also mean creating well defined interfaces between them.
> Integrating cpufreq and cpuidle seems like a better argument to make due
> to their common functionality at a higher level of talking to hardware
> and tuning the performance parameters of cpu. But I disagree that
> scheduler should be put into this common framework as well as it has
> functionalities which are totally disjoint from what subsystems such as
> cpuidle and cpufreq are intended to do.
It's not about the whole scheduler but rather the load balancing, task
placement. You can try to create well defined interfaces between them
but first of all let's define clearly what responsibilities each of the
three frameworks have.
As I said in my first email on this subject, we could:
a) let the scheduler focus on performance only but control (restrict)
the load balancing from cpufreq. For example via cpu_power, a value
of 0 meaning don't balance against it. Cpufreq changes the frequency
based on the load and may allow the scheduler to use idle CPUs. Such
approach requires closer collaboration between cpufreq and cpuidle
(possibly even merging them) and cpufreq needs to become even more
aware of CPU topology.
or:
b) Merge the load balancer and cpufreq together (could leave cpuidle
out initially) with a new design.
Any other proposals are welcome. So far they were either tweaks in
various places (small task packing) or are relatively vague (like we
need two-way communication between cpuidle and scheduler).
Best regards.
--
Catalin
On 06/09/2013 05:42 AM, Preeti U Murthy wrote:
> Hi Rafael,
>
> On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
>> On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
>>> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
>>>> On 06/07/2013 08:21 PM, Catalin Marinas wrote:
>>>>> I think you are missing Ingo's point. It's not about the scheduler
>>>>> complying with decisions made by various governors in the kernel
>>>>> (which may or may not have enough information) but rather the
>>>>> scheduler being in a better position for making such decisions.
>>>>
>>>> My mail pointed out that I disagree with this design ("the scheduler
>>>> being in a better position for making such decisions").
>>>> I think it should be a 2 way co-operation. I have elaborated below.
>>
>> I agree with that.
>>
>>>>> Take the cpuidle example, it uses the load average of the CPUs,
>>>>> however this load average is currently controlled by the scheduler
>>>>> (load balance). Rather than using a load average that degrades over
>>>>> time and gradually putting the CPU into deeper sleep states, the
>>>>> scheduler could predict more accurately that a run-queue won't have
>>>>> any work over the next x ms and ask for a deeper sleep state from the
>>>>> beginning.
>>>>
>>>> How will the scheduler know that there will not be work in the near
>>>> future? How will the scheduler ask for a deeper sleep state?
>>>>
>>>> My answer to the above two questions are, the scheduler cannot know how
>>>> much work will come up. All it knows is the current load of the
>>>> runqueues and the nature of the task (thanks to the PJT's metric). It
>>>> can then match the task load to the cpu capacity and schedule the tasks
>>>> on the appropriate cpus.
>>>
>>> The scheduler can decide to load a single CPU or cluster and let the
>>> others idle. If the total CPU load can fit into a smaller number of CPUs
>>> it could as well tell cpuidle to go into deeper state from the
>>> beginning as it moved all the tasks elsewhere.
>>
>> So why can't it do that today? What's the problem?
>
> The reason that scheduler does not do it today is due to the
> prefer_sibling logic. The tasks within a core get distributed across
> cores if they are more than 1, since the cpu power of a core is not high
> enough to handle more than one task.
>
> However at a socket level/ MC level (cluster at a low level), there can
> be as many tasks as there are cores because the socket has enough CPU
> capacity to handle them. But the prefer_sibling logic moves tasks across
> socket/MC level domains even when load<=domain_capacity.
>
> I think the reason why the prefer_sibling logic was introduced, is that
> scheduler looks at spreading tasks across all the resources it has. It
> believes keeping tasks within a cluster/socket level domain would mean
> tasks are being throttled by having access to only the cluster/socket
> level resources. Which is why it spreads.
>
> The prefer_sibling logic is nothing but a flag set at domain level to
> communicate to the scheduler that load should be spread across the
> groups of this domain. In the above example across sockets/clusters.
>
> But I think it is time we take another look at the prefer_sibling logic
> and decide on its worthiness.
>
>>
>>> Regarding future work, neither cpuidle nor the scheduler know this but
>>> the scheduler would make a better prediction, for example by tracking
>>> task periodicity.
>>
>> Well, basically, two pieces of information are needed to make target idle
>> state selections: (1) when the CPU (core or package) is going to be used
>> next time and (2) how much latency for going back to the non-idle state
>> can be tolerated. While the scheduler knows (1) to some extent (arguably,
>> it generally cannot predict when hardware interrupts are going to occur),
>> I'm not really sure about (2).
>>
>>>> As a consequence, it leaves certain cpus idle. The load of these cpus
>>>> degrade. It is via this load that the scheduler asks for a deeper sleep
>>>> state. Right here we have scheduler talking to the cpuidle governor.
>>>
>>> So we agree that the scheduler _tells_ the cpuidle governor when to go
>>> idle (but not how deep).
>>
>> It does indicate to cpuidle how deep it can go, however, by providing it with
>> the information about when the CPU is going to be used next time (from the
>> scheduler's perspective).
>>
>>> IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the
>>> cpuidle does not get enough information from the scheduler (arguably this
>>> could be fixed)
>>
>> OK, so what information is missing in your opinion?
>>
>>> and (2) the scheduler does not have any information about the idle states
>>> (power gating etc.) to make any informed decision on which/when CPUs should
>>> go idle.
>>
>> That's correct, which is a drawback. However, on some systems it may never
>> have that information (because hardware coordinates idle states in a way that
>> is opaque to the OS - e.g. by autopromoting deeper states when idle for
>> sufficiently long time) and on some systems that information may change over
>> time (i.e. the availablility of specific idle states may depend on factors
>> that aren't constant).
>>
>> If you attempted to take all of the possible complications related to hardware
>> designs in that area in the scheduler, you'd end up with completely
>> unmaintainable piece of code.
>>
>>> As you said, it is a non-optimal one-way communication but the solution
>>> is not feedback loop from cpuidle into scheduler. It's like the
>>> scheduler managed by chance to get the CPU into a deeper sleep state and
>>> now you'd like the scheduler to get feedback form cpuidle and not
>>> disturb that CPU anymore. That's the closed loop I disagree with. Could
>>> the scheduler not make this informed decision before - it has this total
>>> load, let's get this CPU into deeper sleep state?
>>
>> No, it couldn't in general, for the above reasons.
>>
>>>> I don't see what the problem is with the cpuidle governor waiting for
>>>> the load to degrade before putting that cpu to sleep. In my opinion,
>>>> putting a cpu to deeper sleep states should happen gradually.
>>
>> If we know in advance that the CPU can be put into idle state Cn, there is no
>> reason to put it into anything shallower than that.
>>
>> On the other hand, if the CPU is in Cn already and there is a possibility to
>> put it into a deeper low-power state (which we didn't know about before), it
>> may make sense to promote it into that state (if that's safe) or even wake it
>> up and idle it again.
>
> Yes, sorry I said it wrong in the previous mail. Today the cpuidle
> governor is capable of putting a CPU in idle state Cn directly, by
> looking at various factors like the current load, next timer, history of
> interrupts, exit latency of states. At the end of this evaluation it
> puts it into idle state Cn.
>
> Also it cares to check if its decision is right. This is with respect to
> your statement "if there is a possibility to put it into deeper low
> power state". It queues a timer at a time just after its predicted wake
> up time before putting the cpu to idle state. If this time of wakeup
> prediction is wrong, this timer triggers to wake up the cpu and the cpu
> is hence put into a deeper sleep state.
Some SoC can have a cluster of cpus sharing some resources, eg cache, so
they must enter the same state at the same moment. Beside the
synchronization mechanisms, that adds a dependency with the next event.
For example, the u8500 board has a couple of cpus. In order to make them
to enter in retention, both must enter the same state, but not necessary
at the same moment. The first cpu will wait in WFI and the second one
will initiate the retention mode when entering to this state.
Unfortunately, some time could have passed while the second cpu entered
this state and the next event for the first cpu could be too close, thus
violating the criteria of the governor when it choose this state for the
second cpu.
Also the latencies could change with the frequencies, so there is a
dependency with cpufreq, the lesser the frequency is, the higher the
latency is. If the scheduler takes the decision to go to a specific
state assuming the exit latency is a given duration, if the frequency
decrease, this exit latency could increase also and lead the system to
be less responsive.
I don't know, how were made the latencies computation (eg. worst case,
taken with the lower frequency or not) but we have just one set of
values. That should happen with the current code.
Another point is the timer allowing to detect bad decision and go to a
deep idle state. With the cluster dependency described above, we may
wake up a particular cpu, which turns on the cluster and make the entire
cluster to wake up in order to enter a deeper state, which could fail
because of the other cpu may not fulfill the constraint at this moment.
>>>> This means time will tell the governors what kinds of workloads are running
>>>> on the system. If the cpu is idle for long, it probably means that the system
>>>> is less loaded and it makes sense to put the cpus to deeper sleep
>>>> states. Of course there could be sporadic bursts or quieting down of
>>>> tasks, but these are corner cases.
>>>
>>> It's nothing wrong with degrading given the information that cpuidle
>>> currently has. It's a heuristics that worked ok so far and may continue
>>> to do so. But see my comments above on why the scheduler could make more
>>> informed decisions.
>>>
>>> We may not move all the power gating information to the scheduler but
>>> maybe find a way to abstract this by giving more hints via the CPU and
>>> cache topology. The cpuidle framework (it may not be much left of a
>>> governor) would then take hints about estimated idle time and invoke the
>>> low-level driver about the right C state.
>>
>> Overall, it looks like it'd be better to split the governor "layer" between the
>> scheduler and the idle driver with a well defined interface between them. That
>> interface needs to be general enough to be independent of the underlying
>> hardware.
>>
>> We need to determine what kinds of information should be passed both ways and
>> how to represent it.
>
> I agree with this design decision.
>
>>>>> Of course, you could export more scheduler information to cpuidle,
>>>>> various hooks (task wakeup etc.) but then we have another framework,
>>>>> cpufreq. It also decides the CPU parameters (frequency) based on the
>>>>> load controlled by the scheduler. Can cpufreq decide whether it's
>>>>> better to keep the CPU at higher frequency so that it gets to idle
>>>>> quicker and therefore deeper sleep states? I don't think it has enough
>>>>> information because there are at least three deciding factors
>>>>> (cpufreq, cpuidle and scheduler's load balancing) which are not
>>>>> unified.
>>>>
>>>> Why not? When the cpu load is high, cpu frequency governor knows it has
>>>> to boost the frequency of that CPU. The task gets over quickly, the CPU
>>>> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
>>>> sleep state gradually.
>>>
>>> The cpufreq governor boosts the frequency enough to cover the load,
>>> which means reducing the idle time. It does not know whether it is
>>> better to boost the frequency twice as high so that it gets to idle
>>> quicker. You can change the governor's policy but does it have any
>>> information from cpuidle?
>>
>> Well, it may get that information directly from the hardware. Actually,
>> intel_pstate does that, but intel_pstate is the governor and the scaling
>> driver combined.
>
> To add to this, cpufreq currently functions in the below fashion. I am
> talking of the on demand governor, since it is more relevant to our
> discussion.
>
> ----stepped up frequency------
> ----threshold--------
> -----stepped down freq level1---
> -----stepped down freq level2---
> ---stepped down freq level3----
>
> If the cpu idle time is below a threshold , it boosts the frequency to
> one level above straight away and does not vary it any further. If the
> cpu idle time is below a threshold there is a step down in frequency
> levels by 5% of the current frequency at every sampling period, provided
> the cpu behavior is constant.
>
> I think we can improve this implementation by better interaction with
> cpuidle and scheduler.
>
> When it is stepping up frequency, it should do it in steps of frequency
> being a *function of the current cpu load* also, or function of idle
> time will also do.
>
> When it is stepping down frequency, it should interact with cpuidle. It
> should get from cpuidle information regarding the idle state that the
> cpu is in.The reason is cpufrequency governor is aware of only the idle
> time of the cpu, not the idle state it is in. If it gets to know that
> the cpu is in a deep idle state, it could step down frequency levels to
> leveln straight away. Just like cpuidle does to put cpus into state Cn.
>
> Or an alternate option could be just like stepping up, make the stepping
> down also a function of idle time. Perhaps
> fn(|threshold-idle_time|).
>
> Also one more point to note is that if cpuidle puts cpus into such idle
> states that clock gate the cpus, then there is no need for cpufrequency
> governor for that cpu. cpufreq can check with cpuidle on this front
> before it queries a cpu.
>
>>
>>>> Meanwhile the scheduler should ensure that the tasks are retained on
>>>> that CPU,whose frequency is boosted and should not load balance it, so
>>>> that they can get over quickly. This I think is what is missing. Again
>>>> this comes down to the scheduler taking feedback from the CPU frequency
>>>> governors which is not currently happening.
>>>
>>> Same loop again. The cpu load goes high because (a) there is more work,
>>> possibly triggered by external events, and (b) the scheduler decided to
>>> balance the CPUs in a certain way. As for cpuidle above, the scheduler
>>> has direct influence on the cpufreq decisions. How would the scheduler
>>> know which CPU not to balance against? Are CPUs in a cluster
>>> synchronous? Is it better do let other CPU idle or more efficient to run
>>> this cluster at half-speed?
>>>
>>> Let's say there is an increase in the load, does the scheduler wait
>>> until cpufreq figures this out or tries to take the other CPUs out of
>>> idle? Who's making this decision? That's currently a potentially
>>> unstable loop.
>>
>> Yes, it is and I don't think we currently have good answers here.
>
> My answer to the above question is scheduler does not wait until cpufreq
> figures it out. All that the scheduler cares about today is load
> balancing. Spread the load and hope it finishes soon. There is a
> possibility today that even before cpu frequency governor can boost the
> frequency of cpu, the scheduler can spread the load.
>
> As for the second question it will wakeup idle cpus if it must to load
> balance.
>
> It is a good question asked: "does the scheduler wait until cpufreq
> figures it out." Currently the answer is no, it does not communicate
> with cpu frequency at all (except through cpu power, but that is the
> good part of the story, so I will not get there now). But maybe we
> should change this. I think we can do so the following way.
>
> When can a scheduler talk to cpu frequency? It can do so under the below
> circumstances:
>
> 1. Load is too high across the systems, all cpus are loaded, no chance
> of load balancing. Therefore ask cpu frequency governor to step up
> frequency to get improve performance.
>
> 2. The scheduler finds out that if it has to load balance, it has to do
> so on cpus which are in deep idle state( Currently this logic is not
> present, but worth getting it in). It then decides to increase the
> frequency of the already loaded cpus to improve performance. It calls
> cpu freq governor.
>
> 3. The scheduler finds out that if it has to load balance, it has to do
> so on a different power domain which is idle currently(shallow/deep). It
> thinks the better of it and calls cpu frequency governor to boost the
> frequency of the cpus in the current domain.
>
> While 2 and 3 depend on scheduler having knowledge about idle states and
> power domains, which it currently does not have, 1 can be achieved with
> the current code. scheduler keeps track of failed ld balancing efforts
> with lb_failed. If it finds that while load balancing from busy group
> failed (lb_failed > 0), it can call cpu freq governor to step up the cpu
> frequency of this busy cpu group, with gov_check_cpu() in cpufrequency
> governor code.
>
>>
>> The results of many measurements seem to indicate that it generally is better
>> to do the work as quickly as possible and then go idle again, but there are
>> costs associated with going back and forth from idle to non-idle etc.
>
> I think we can even out the cost benefit of race to idle, by choosing to
> do it wisely. Like for example if points 2 and 3 above are true (idle
> cpus are in deep sleep states or need to ld balance on a different power
> domain), then step up the frequency of the current working cpus and reap
> its benefit.
>
>>
>> The main problem with cpufreq that I personally have is that the governors
>> carry out their own sampling with pretty much arbitrary resolution that may
>> lead to suboptimal decisions. It would be much better if the scheduler
>> indicated when to *consider* the changing of CPU performance parameters (that
>> may not be frequency alone and not even frequency at all in general), more or
>> less the same way it tells cpuidle about idle CPUs, but I'm not sure if it
>> should decide what performance points to run at.
>
> Very true. See the points 1,2 and 3 above where I list out when
> scheduler can call cpu frequency. Also an idea about how cpu frequency
> governor can decide on the scaling frequency is stated above.
>
>>
>>>>>> I would repeat here that today we interface cpuidle/cpufrequency
>>>>>> policies with scheduler but not the other way around. They do their bit
>>>>>> when a cpu is busy/idle. However scheduler does not see that somebody
>>>>>> else is taking instructions from it and comes back to give different
>>>>>> instructions!
>>>>>
>>>>> The key here is that cpuidle/cpufreq make their primary decision based
>>>>> on something controlled by the scheduler: the CPU load (via run-queue
>>>>> balancing). You would then like the scheduler take such decision back
>>>>> into account. It just looks like a closed loop, possibly 'unstable' .
>>>>
>>>> Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a
>>>> closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a
>>>> closed loop? Here too the scheduler should be made well aware of the
>>>> decisions it took in the past right?
>>>
>>> It's more like:
>>>
>>> scheduler -> cpuidle/cpufreq -> hardware operating point
>>> ^ |
>>> +--------------------------------------+
>>>
>>> You can argue that you can make an adaptive loop that works fine but
>>> there are so many parameters that I don't see how it would work. The
>>> patches so far don't seem to address this. Small task packing, while
>>> useful, it's some heuristics just at the scheduler level.
>>
>> I agree.
>>
>>> With a combined decision maker, you aim to reduce this separate decision
>>> process and feedback loop. Probably impossible to eliminate the loop
>>> completely because of hardware latencies, PLLs, CPU frequency not always
>>> the main factor, but you can make the loop more tolerant to
>>> instabilities.
>>
>> Well, in theory. :-)
>>
>> Another question to ask is whether or not the structure of our software
>> reflects the underlying problem. I mean, on the one hand there is the
>> scheduler that needs to optimally assign work items to computational units
>> (hyperthreads, CPU cores, packages) and on the other hand there's hardware
>> with different capabilities (idle states, performance points etc.). Arguably,
>> the scheduler internals cannot cover all of the differences between all of the
>> existing types of hardware Linux can run on, so there needs to be a layer of
>> code providing an interface between the scheduler and the hardware. But that
>> layer of code needs to be just *one*, so why do we have *two* different
>> frameworks (cpuidle and cpufreq) that talk to the same hardware and kind of to
>> the scheduler, but not to each other?
>>
>> To me, the reason is history, and more precisely the fact that cpufreq had been
>> there first, then came cpuidle and only then poeple started to realize that
>> some scheduler tweaks may allow us to save energy without sacrificing too
>> much performance. However, it looks like there's time to go back and see how
>> we can integrate all that. And there's more, because we may need to take power
>> budgets and thermal management into account as well (i.e. we may not be allowed
>> to use full performance of the processors all the time because of some
>> additional limitations) and the CPUs may be members of power domains, so what
>> we can do with them may depend on the states of other devices.
>>
>>>>> So I think we either (a) come up with 'clearer' separation of
>>>>> responsibilities between scheduler and cpufreq/cpuidle
>>>>
>>>> I agree with this. This is what I have been emphasizing, if we feel that
>>>> the cpufrequency/ cpuidle subsystems are suboptimal in terms of the
>>>> information that they use to make their decisions, let us improve them.
>>>> But this will not yield us any improvement if the scheduler does not
>>>> have enough information. And IMHO, the next fundamental information that
>>>> the scheduler needs should come from cpufreq and cpuidle.
>>>
>>> What kind of information? Your suggestion that the scheduler should
>>> avoid loading a CPU because it went idle is wrong IMHO. It went idle
>>> because the scheduler decided this in first instance.
>>>
>>>> Then we should move onto supplying scheduler information from the power
>>>> domain topology, thermal factors, user policies.
>>>
>>> I agree with this but at this point you get the scheduler to make more
>>> informed decisions about task placement. It can then give more precise
>>> hints to cpufreq/cpuidle like the predicted load and those frameworks
>>> could become dumber in time, just complying with the requested
>>> performance level (trying to break the loop above).
>>
>> Well, there's nothing like "predicted load". At best, we may be able to make
>> more or less educated guesses about it, so in my opinion it is better to use
>> the information about what happened in the past for making decisions regarding
>> the current settings and re-adjust them over time as we get more information.
>
> Agree with this as well. scheduler can at best supply information
> regarding the historic load and hope that it is what defines the future
> as well. Apart from this I dont know what other information scheduler
> can supply cpuidle governor with.
>>
>> So how much decision making regarding the idle state to put the given CPU into
>> should be there in the scheduler? I believe the only information coming out
>> of the scheduler regarding that should be "OK, this CPU is now idle and I'll
>> need it in X nanoseconds from now" plus possibly a hint about the wakeup
>> latency tolerance (but those hints may come from other places too). That said
>> the decision *which* CPU should become idle at the moment very well may require
>> some information about what options are available from the layer below (for
>> example, "putting core X into idle for Y of time will save us Z energy" or
>> something like that).
>
> Agree. Except that the information should be "Ok , this CPU is now idle
> and it has not done much work in the recent past,it is a 10% loaded CPU".
>
> This can be said today using PJT's metric. It is now for the cpuidle
> governor to decide the idle state to go to. Thats what happens today too.
>
>>
>> And what about performance scaling? Quite frankly, in my opinion that
>> requires some more investigation, because there still are some open questions
>> in that area. To start with we can just continue using the current heuristics,
>> but perhaps with the scheduler calling the scaling "governor" when it sees fit
>> instead of that "governor" running kind of in parallel with it.
>
> Exactly. How this can be done is elaborated above. This is one of the
> key things we need today,IMHO.
>
>>
>>>>> or (b) come up
>>>>> with a unified load-balancing/cpufreq/cpuidle implementation as per
>>>>> Ingo's request. The latter is harder but, with a good design, has
>>>>> potentially a lot more benefits.
>>>>>
>>>>> A possible implementation for (a) is to let the scheduler focus on
>>>>> performance load-balancing but control the balance ratio from a
>>>>> cpufreq governor (via things like arch_scale_freq_power() or something
>>>>> new). CPUfreq would not be concerned just with individual CPU
>>>>> load/frequency but also making a decision on how tasks are balanced
>>>>> between CPUs based on the overall load (e.g. four CPUs are enough for
>>>>> the current load, I can shut the other four off by telling the
>>>>> scheduler not to use them).
>>>>>
>>>>> As for Ingo's preferred solution (b), a proposal forward could be to
>>>>> factor the load balancing out of kernel/sched/fair.c and provide an
>>>>> abstract interface (like load_class?) for easier extending or
>>>>> different policies (e.g. small task packing).
>>>>
>>>> Let me elaborate on the patches that have been posted so far on the
>>>> power awareness of the scheduler. When we say *power aware scheduler*
>>>> what exactly do we want it to do?
>>>>
>>>> In my opinion, we want it to *avoid touching idle cpus*, so as to keep
>>>> them in that state longer and *keep more power domains idle*, so as to
>>>> yield power savings with them turned off. The patches released so far
>>>> are striving to do the latter. Correct me if I am wrong at this.
>>>
>>> Don't take me wrong, task packing to keep more power domains idle is
>>> probably in the right direction but it may not address all issues. You
>>> realised this is not enough since you are now asking for the scheduler
>>> to take feedback from cpuidle. As I pointed out above, you try to create
>>> a loop which may or may not work, especially given the wide variety of
>>> hardware parameters.
>>>
>>>> Also
>>>> feel free to point out any other expectation from the power aware
>>>> scheduler if I am missing any.
>>>
>>> If the patches so far are enough and solved all the problems, you are
>>> not missing any. Otherwise, please see my view above.
>>>
>>> Please define clearly what the scheduler, cpufreq, cpuidle should be
>>> doing and what communication should happen between them.
>>>
>>>> If I have got Ingo's point right, the issues with them are that they are
>>>> not taking a holistic approach to meet the said goal.
>>>
>>> Probably because scheduler changes, cpufreq and cpuidle are all trying
>>> to address the same thing but independent of each other and possibly
>>> conflicting.
>>>
>>>> Keeping more power
>>>> domains idle (by packing tasks) would sound much better if the scheduler
>>>> has taken all aspects of doing such a thing into account, like
>>>>
>>>> 1. How idle are the cpus, on the domain that it is packing
>>>> 2. Can they go to turbo mode, because if they do,then we cant pack
>>>> tasks. We would need certain cpus in that domain idle.
>>>> 3. Are the domains in which we pack tasks power gated?
>>>> 4. Will there be significant performance drop by packing? Meaning do the
>>>> tasks share cpu resources? If they do there will be severe contention.
>>>
>>> So by this you add a lot more information about the power configuration
>>> into the scheduler, getting it to make more informed decisions about
>>> task scheduling. You may eventually reach a point where cpuidle governor
>>> doesn't have much to do (which may be a good thing) and reach Ingo's
>>> goal.
>>>
>>> That's why I suggested maybe starting to take the load balancing out of
>>> fair.c and make it easily extensible (my opinion, the scheduler guys may
>>> disagree). Then make it more aware of topology, power configuration so
>>> that it makes the right task placement decision. You then get it to
>>> tell cpufreq about the expected performance requirements (frequency
>>> decided by cpufreq) and cpuidle about how long it could be idle for (you
>>> detect a periodic task every 1ms, or you don't have any at all because
>>> they were migrated, the right C state being decided by the governor).
>>
>> There is another angle to look at that as I said somewhere above.
>>
>> What if we could integrate cpuidle with cpufreq so that there is one code
>> layer representing what the hardware can do to the scheduler? What benefits
>> can we get from that, if any?
>
> We could debate on this point. I am a bit confused about this. As I see
> it, there is no problem with keeping them separately. One, because of
> code readability; it is easy to understand what are the different
> parameters that the performance of CPU depends on, without needing to
> dig through the code. Two, because cpu frequency kicks in during runtime
> primarily and cpuidle during idle time of the cpu.
>
> But this would also mean creating well defined interfaces between them.
> Integrating cpufreq and cpuidle seems like a better argument to make due
> to their common functionality at a higher level of talking to hardware
> and tuning the performance parameters of cpu. But I disagree that
> scheduler should be put into this common framework as well as it has
> functionalities which are totally disjoint from what subsystems such as
> cpuidle and cpufreq are intended to do.
>>
>> Rafael
>>
>>
>
> Regards
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On Sunday, June 09, 2013 09:12:18 AM Preeti U Murthy wrote:
> Hi Rafael,
Hi Preeti,
> On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
> > On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
> >> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
[...]
> >> The scheduler can decide to load a single CPU or cluster and let the
> >> others idle. If the total CPU load can fit into a smaller number of CPUs
> >> it could as well tell cpuidle to go into deeper state from the
> >> beginning as it moved all the tasks elsewhere.
> >
> > So why can't it do that today? What's the problem?
>
> The reason that scheduler does not do it today is due to the
> prefer_sibling logic. The tasks within a core get distributed across
> cores if they are more than 1, since the cpu power of a core is not high
> enough to handle more than one task.
>
> However at a socket level/ MC level (cluster at a low level), there can
> be as many tasks as there are cores because the socket has enough CPU
> capacity to handle them. But the prefer_sibling logic moves tasks across
> socket/MC level domains even when load<=domain_capacity.
>
> I think the reason why the prefer_sibling logic was introduced, is that
> scheduler looks at spreading tasks across all the resources it has. It
> believes keeping tasks within a cluster/socket level domain would mean
> tasks are being throttled by having access to only the cluster/socket
> level resources. Which is why it spreads.
>
> The prefer_sibling logic is nothing but a flag set at domain level to
> communicate to the scheduler that load should be spread across the
> groups of this domain. In the above example across sockets/clusters.
>
> But I think it is time we take another look at the prefer_sibling logic
> and decide on its worthiness.
Well, it does look like something that would be good to reconsider.
Some results indicate that for a given CPU package (cluster/socket) there
is a threshold number of tasks such that it is beneficial to pack tasks
into that package as long as the total number of tasks running on it does
not exceed that number. It may be 1 (which is the value used currently with
prefer_sibling set if I understood what you said correctly), but it very
well may be 2 or more (depending on the hardware characteristics).
[...]
> > If we know in advance that the CPU can be put into idle state Cn, there is no
> > reason to put it into anything shallower than that.
> >
> > On the other hand, if the CPU is in Cn already and there is a possibility to
> > put it into a deeper low-power state (which we didn't know about before), it
> > may make sense to promote it into that state (if that's safe) or even wake it
> > up and idle it again.
>
> Yes, sorry I said it wrong in the previous mail. Today the cpuidle
> governor is capable of putting a CPU in idle state Cn directly, by
> looking at various factors like the current load, next timer, history of
> interrupts, exit latency of states. At the end of this evaluation it
> puts it into idle state Cn.
>
> Also it cares to check if its decision is right. This is with respect to
> your statement "if there is a possibility to put it into deeper low
> power state". It queues a timer at a time just after its predicted wake
> up time before putting the cpu to idle state. If this time of wakeup
> prediction is wrong, this timer triggers to wake up the cpu and the cpu
> is hence put into a deeper sleep state.
So I don't think we need to modify that behavior. :-)
> >>> This means time will tell the governors what kinds of workloads are running
> >>> on the system. If the cpu is idle for long, it probably means that the system
> >>> is less loaded and it makes sense to put the cpus to deeper sleep
> >>> states. Of course there could be sporadic bursts or quieting down of
> >>> tasks, but these are corner cases.
> >>
> >> It's nothing wrong with degrading given the information that cpuidle
> >> currently has. It's a heuristics that worked ok so far and may continue
> >> to do so. But see my comments above on why the scheduler could make more
> >> informed decisions.
> >>
> >> We may not move all the power gating information to the scheduler but
> >> maybe find a way to abstract this by giving more hints via the CPU and
> >> cache topology. The cpuidle framework (it may not be much left of a
> >> governor) would then take hints about estimated idle time and invoke the
> >> low-level driver about the right C state.
> >
> > Overall, it looks like it'd be better to split the governor "layer" between the
> > scheduler and the idle driver with a well defined interface between them. That
> > interface needs to be general enough to be independent of the underlying
> > hardware.
> >
> > We need to determine what kinds of information should be passed both ways and
> > how to represent it.
>
> I agree with this design decision.
OK, so let's try to take one step more and think about what part should belong
to the scheduler and what part should be taken care of by the "idle" driver.
Do you have any specific view on that?
> >>>> Of course, you could export more scheduler information to cpuidle,
> >>>> various hooks (task wakeup etc.) but then we have another framework,
> >>>> cpufreq. It also decides the CPU parameters (frequency) based on the
> >>>> load controlled by the scheduler. Can cpufreq decide whether it's
> >>>> better to keep the CPU at higher frequency so that it gets to idle
> >>>> quicker and therefore deeper sleep states? I don't think it has enough
> >>>> information because there are at least three deciding factors
> >>>> (cpufreq, cpuidle and scheduler's load balancing) which are not
> >>>> unified.
> >>>
> >>> Why not? When the cpu load is high, cpu frequency governor knows it has
> >>> to boost the frequency of that CPU. The task gets over quickly, the CPU
> >>> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
> >>> sleep state gradually.
> >>
> >> The cpufreq governor boosts the frequency enough to cover the load,
> >> which means reducing the idle time. It does not know whether it is
> >> better to boost the frequency twice as high so that it gets to idle
> >> quicker. You can change the governor's policy but does it have any
> >> information from cpuidle?
> >
> > Well, it may get that information directly from the hardware. Actually,
> > intel_pstate does that, but intel_pstate is the governor and the scaling
> > driver combined.
>
> To add to this, cpufreq currently functions in the below fashion. I am
> talking of the on demand governor, since it is more relevant to our
> discussion.
>
> ----stepped up frequency------
> ----threshold--------
> -----stepped down freq level1---
> -----stepped down freq level2---
> ---stepped down freq level3----
>
> If the cpu idle time is below a threshold , it boosts the frequency to
Did you mean "above the threshold"?
> one level above straight away and does not vary it any further. If the
> cpu idle time is below a threshold there is a step down in frequency
> levels by 5% of the current frequency at every sampling period, provided
> the cpu behavior is constant.
>
> I think we can improve this implementation by better interaction with
> cpuidle and scheduler.
>
> When it is stepping up frequency, it should do it in steps of frequency
> being a *function of the current cpu load* also, or function of idle
> time will also do.
>
> When it is stepping down frequency, it should interact with cpuidle. It
> should get from cpuidle information regarding the idle state that the
> cpu is in.The reason is cpufrequency governor is aware of only the idle
> time of the cpu, not the idle state it is in. If it gets to know that
> the cpu is in a deep idle state, it could step down frequency levels to
> leveln straight away. Just like cpuidle does to put cpus into state Cn.
>
> Or an alternate option could be just like stepping up, make the stepping
> down also a function of idle time. Perhaps
> fn(|threshold-idle_time|).
>
> Also one more point to note is that if cpuidle puts cpus into such idle
> states that clock gate the cpus, then there is no need for cpufrequency
> governor for that cpu. cpufreq can check with cpuidle on this front
> before it queries a cpu.
cpufreq ondemand (or intel_pstate for that matter) doesn't touch idle CPUs,
because it uses deferrable timers. It basically only handles CPUs that aren't
idle at the moment.
However, it doesn't exactly know when the given CPU stopped being idle, because
its sampling is not generally synchronized with the scheduler's operations.
That, among other things, is why I'm thinking that it might be better if the
scheduler told cpufreq (or intel_pstate) when to try to adjust frequencies so
that it doesn't need to sample by itself.
[...]
> >>
> >> Let's say there is an increase in the load, does the scheduler wait
> >> until cpufreq figures this out or tries to take the other CPUs out of
> >> idle? Who's making this decision? That's currently a potentially
> >> unstable loop.
> >
> > Yes, it is and I don't think we currently have good answers here.
>
> My answer to the above question is scheduler does not wait until cpufreq
> figures it out. All that the scheduler cares about today is load
> balancing. Spread the load and hope it finishes soon. There is a
> possibility today that even before cpu frequency governor can boost the
> frequency of cpu, the scheduler can spread the load.
That is a valid observation, but I wanted to say that we didn't really
understood how those things should be arranged.
> As for the second question it will wakeup idle cpus if it must to load
> balance.
>
> It is a good question asked: "does the scheduler wait until cpufreq
> figures it out." Currently the answer is no, it does not communicate
> with cpu frequency at all (except through cpu power, but that is the
> good part of the story, so I will not get there now). But maybe we
> should change this. I think we can do so the following way.
>
> When can a scheduler talk to cpu frequency? It can do so under the below
> circumstances:
>
> 1. Load is too high across the systems, all cpus are loaded, no chance
> of load balancing. Therefore ask cpu frequency governor to step up
> frequency to get improve performance.
>
> 2. The scheduler finds out that if it has to load balance, it has to do
> so on cpus which are in deep idle state( Currently this logic is not
> present, but worth getting it in). It then decides to increase the
> frequency of the already loaded cpus to improve performance. It calls
> cpu freq governor.
>
> 3. The scheduler finds out that if it has to load balance, it has to do
> so on a different power domain which is idle currently(shallow/deep). It
> thinks the better of it and calls cpu frequency governor to boost the
> frequency of the cpus in the current domain.
>
> While 2 and 3 depend on scheduler having knowledge about idle states and
> power domains, which it currently does not have, 1 can be achieved with
> the current code. scheduler keeps track of failed ld balancing efforts
> with lb_failed. If it finds that while load balancing from busy group
> failed (lb_failed > 0), it can call cpu freq governor to step up the cpu
> frequency of this busy cpu group, with gov_check_cpu() in cpufrequency
> governor code.
Well, if the model is that the scheduler tells cpufreq when to modify
frequencies, then it'll need to do that on a regular basis, like every time
a task is scheduled or similar.
> > The results of many measurements seem to indicate that it generally is better
> > to do the work as quickly as possible and then go idle again, but there are
> > costs associated with going back and forth from idle to non-idle etc.
>
> I think we can even out the cost benefit of race to idle, by choosing to
> do it wisely. Like for example if points 2 and 3 above are true (idle
> cpus are in deep sleep states or need to ld balance on a different power
> domain), then step up the frequency of the current working cpus and reap
> its benefit.
>
> >
> > The main problem with cpufreq that I personally have is that the governors
> > carry out their own sampling with pretty much arbitrary resolution that may
> > lead to suboptimal decisions. It would be much better if the scheduler
> > indicated when to *consider* the changing of CPU performance parameters (that
> > may not be frequency alone and not even frequency at all in general), more or
> > less the same way it tells cpuidle about idle CPUs, but I'm not sure if it
> > should decide what performance points to run at.
>
> Very true. See the points 1,2 and 3 above where I list out when
> scheduler can call cpu frequency.
Well, as I said above, I think that'd need to be done more frequently.
> Also an idea about how cpu frequency governor can decide on the scaling
> frequency is stated above.
Actaully, intel_pstate uses a PID controller for making those decisions and
I think this may be just the right thing to do.
[...]
> >
> > Well, there's nothing like "predicted load". At best, we may be able to make
> > more or less educated guesses about it, so in my opinion it is better to use
> > the information about what happened in the past for making decisions regarding
> > the current settings and re-adjust them over time as we get more information.
>
> Agree with this as well. scheduler can at best supply information
> regarding the historic load and hope that it is what defines the future
> as well. Apart from this I dont know what other information scheduler
> can supply cpuidle governor with.
> >
> > So how much decision making regarding the idle state to put the given CPU into
> > should be there in the scheduler? I believe the only information coming out
> > of the scheduler regarding that should be "OK, this CPU is now idle and I'll
> > need it in X nanoseconds from now" plus possibly a hint about the wakeup
> > latency tolerance (but those hints may come from other places too). That said
> > the decision *which* CPU should become idle at the moment very well may require
> > some information about what options are available from the layer below (for
> > example, "putting core X into idle for Y of time will save us Z energy" or
> > something like that).
>
> Agree. Except that the information should be "Ok , this CPU is now idle
> and it has not done much work in the recent past,it is a 10% loaded CPU".
And what would that be useful for to the "idle" layer? What matters is the
"I'll need it in X nanoseconds from now" part.
Yes, the load part would be interesting to the "frequency" layer.
> This can be said today using PJT's metric. It is now for the cpuidle
> governor to decide the idle state to go to. Thats what happens today too.
>
> >
> > And what about performance scaling? Quite frankly, in my opinion that
> > requires some more investigation, because there still are some open questions
> > in that area. To start with we can just continue using the current heuristics,
> > but perhaps with the scheduler calling the scaling "governor" when it sees fit
> > instead of that "governor" running kind of in parallel with it.
>
> Exactly. How this can be done is elaborated above. This is one of the
> key things we need today,IMHO.
>
> >
[...]
> >
> > There is another angle to look at that as I said somewhere above.
> >
> > What if we could integrate cpuidle with cpufreq so that there is one code
> > layer representing what the hardware can do to the scheduler? What benefits
> > can we get from that, if any?
>
> We could debate on this point. I am a bit confused about this. As I see
> it, there is no problem with keeping them separately. One, because of
> code readability; it is easy to understand what are the different
> parameters that the performance of CPU depends on, without needing to
> dig through the code. Two, because cpu frequency kicks in during runtime
> primarily and cpuidle during idle time of the cpu.
That's a very useful observation. Indeed, there's the "idle" part that needs
to be invoked when the CPU goes idle (and it should decide what idle state to
put that CPU into), and there's the "scaling" part that needs to be invoked
when the CPU has work to do (and it should decide what performance point to
put that CPU into). The question is, though, if it's better to have two
separate frameworks for those things (which is what we have today) or to make
them two parts of the same framework (like two callbacks one of which will be
executed for CPUs that have just become idle and the other will be invoked
for CPUs that have just got work to do).
> But this would also mean creating well defined interfaces between them.
> Integrating cpufreq and cpuidle seems like a better argument to make due
> to their common functionality at a higher level of talking to hardware
> and tuning the performance parameters of cpu. But I disagree that
> scheduler should be put into this common framework as well as it has
> functionalities which are totally disjoint from what subsystems such as
> cpuidle and cpufreq are intended to do.
That's correct. The role of the scheduler, in my opinion, may be to call the
"idle" and "scaling" functions at the right time and to give them information
needed to make optimal choices.
Thanks,
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
On Mon, 10 Jun 2013, Daniel Lezcano wrote:
> Some SoC can have a cluster of cpus sharing some resources, eg cache, so
> they must enter the same state at the same moment. Beside the
> synchronization mechanisms, that adds a dependency with the next event.
> For example, the u8500 board has a couple of cpus. In order to make them
> to enter in retention, both must enter the same state, but not necessary
> at the same moment. The first cpu will wait in WFI and the second one
> will initiate the retention mode when entering to this state.
> Unfortunately, some time could have passed while the second cpu entered
> this state and the next event for the first cpu could be too close, thus
> violating the criteria of the governor when it choose this state for the
> second cpu.
>
> Also the latencies could change with the frequencies, so there is a
> dependency with cpufreq, the lesser the frequency is, the higher the
> latency is. If the scheduler takes the decision to go to a specific
> state assuming the exit latency is a given duration, if the frequency
> decrease, this exit latency could increase also and lead the system to
> be less responsive.
>
> I don't know, how were made the latencies computation (eg. worst case,
> taken with the lower frequency or not) but we have just one set of
> values. That should happen with the current code.
>
> Another point is the timer allowing to detect bad decision and go to a
> deep idle state. With the cluster dependency described above, we may
> wake up a particular cpu, which turns on the cluster and make the entire
> cluster to wake up in order to enter a deeper state, which could fail
> because of the other cpu may not fulfill the constraint at this moment.
Nobody is saying that this sort of thing should be in the fastpath of the
scheduler.
But if the scheduler has a table that tells it the possible states, and the cost
to get from the current state to each of these states (and to get back and/or
wake up to full power), then the scheduler can make the decision on what to do,
invoke a routine to make the change (and in the meantime, not be fighting the
change by trying to schedule processes on a core that's about to be powered
off), and then when the change happens, the scheduler will have a new version of
the table of possible states and costs
This isn't in the fastpath, it's in the rebalancing logic.
David Lang
On 6/11/2013 5:27 PM, David Lang wrote:
>
> Nobody is saying that this sort of thing should be in the fastpath of the scheduler.
>
> But if the scheduler has a table that tells it the possible states, and the cost to get from the current state to each of these states (and to get back and/or wake up to
> full power), then the scheduler can make the decision on what to do, invoke a routine to make the change (and in the meantime, not be fighting the change by trying to
> schedule processes on a core that's about to be powered off), and then when the change happens, the scheduler will have a new version of the table of possible states and costs
>
> This isn't in the fastpath, it's in the rebalancing logic.
the reality is much more complex unfortunately.
C and P states hang together tightly, and even C state on
one core impacts other cores' performance, just like P state selection
on one core impacts other cores.
(at least for x86, we should really stop talking as if the OS picks the "frequency",
that's just not the case anymore)
On 06/12/2013 02:27 AM, David Lang wrote:
> On Mon, 10 Jun 2013, Daniel Lezcano wrote:
>
>> Some SoC can have a cluster of cpus sharing some resources, eg cache, so
>> they must enter the same state at the same moment. Beside the
>> synchronization mechanisms, that adds a dependency with the next event.
>> For example, the u8500 board has a couple of cpus. In order to make them
>> to enter in retention, both must enter the same state, but not necessary
>> at the same moment. The first cpu will wait in WFI and the second one
>> will initiate the retention mode when entering to this state.
>> Unfortunately, some time could have passed while the second cpu entered
>> this state and the next event for the first cpu could be too close, thus
>> violating the criteria of the governor when it choose this state for the
>> second cpu.
>>
>> Also the latencies could change with the frequencies, so there is a
>> dependency with cpufreq, the lesser the frequency is, the higher the
>> latency is. If the scheduler takes the decision to go to a specific
>> state assuming the exit latency is a given duration, if the frequency
>> decrease, this exit latency could increase also and lead the system to
>> be less responsive.
>>
>> I don't know, how were made the latencies computation (eg. worst case,
>> taken with the lower frequency or not) but we have just one set of
>> values. That should happen with the current code.
>>
>> Another point is the timer allowing to detect bad decision and go to a
>> deep idle state. With the cluster dependency described above, we may
>> wake up a particular cpu, which turns on the cluster and make the entire
>> cluster to wake up in order to enter a deeper state, which could fail
>> because of the other cpu may not fulfill the constraint at this moment.
>
> Nobody is saying that this sort of thing should be in the fastpath of
> the scheduler.
>
> But if the scheduler has a table that tells it the possible states, and
> the cost to get from the current state to each of these states (and to
> get back and/or wake up to full power), then the scheduler can make the
> decision on what to do, invoke a routine to make the change (and in the
> meantime, not be fighting the change by trying to schedule processes on
> a core that's about to be powered off), and then when the change
> happens, the scheduler will have a new version of the table of possible
> states and costs
>
> This isn't in the fastpath, it's in the rebalancing logic.
As Arjan mentionned it is not as simple as this.
We want the scheduler to take some decisions with the knowledge of idle
latencies. In other words move the governor logic into the scheduler.
The scheduler can take decision and the backend driver provides the
interface to go to the idle state.
But unfortunately each hardware is behaving in different ways and
describing such behaviors will help to find the correct design, I am not
raising a lot of issues but just trying to enumerate the constraints we
have.
What is the correct decision when a lot of pm blocks are tied together
and the
In the example given by Arjan, the frequencies could be per cluster,
hence decreasing the frequency for a core will decrease the frequency of
the other core. So if the scheduler takes the decision to put one core
into a specific idle state, regarding the target residency and the exit
latency when the frequency is at max (the other core is doing
something), and then the frequency decrease, the exit latency may
increase in this case and the idle cpu will take more time to exit the
idle state than expected thus adding latency to the system.
What would be the correct decision in this case ? Wake up the idle cpu
when the frequency change to re-evaluate an idle state ? Provide idle
latencies for the min freq only ? Or is it acceptable to have such
latency added when the frequency decrease ?
Also, an interesting question is how do we get these latencies ?
They are all written in the c-state tables but we don't know the
accuracy of these values ? Were they measured with freq max / min ?
Were they measured with a driver powering down the peripherals or without ?
For the embedded systems, we may have different implementations and
maybe different latencies. Would be makes sense to pass these values
through a device tree and let the SoC vendor to specify the right values
? (IMHO, only the SoC vendor can do a correct measurement with an
oscilloscope).
I know there are lot of questions :)
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On Wed, Jun 12, 2013 at 7:18 AM, Arjan van de Ven <[email protected]> wrote:
> On 6/11/2013 5:27 PM, David Lang wrote:
>>
>>
>> Nobody is saying that this sort of thing should be in the fastpath of the
>> scheduler.
>>
>> But if the scheduler has a table that tells it the possible states, and
>> the cost to get from the current state to each of these states (and to get
>> back and/or wake up to
>> full power), then the scheduler can make the decision on what to do,
>> invoke a routine to make the change (and in the meantime, not be fighting
>> the change by trying to
>> schedule processes on a core that's about to be powered off), and then
>> when the change happens, the scheduler will have a new version of the table
>> of possible states and costs
>>
>> This isn't in the fastpath, it's in the rebalancing logic.
>
>
> the reality is much more complex unfortunately.
> C and P states hang together tightly, and even C state on
> one core impacts other cores' performance, just like P state selection
> on one core impacts other cores.
>
> (at least for x86, we should really stop talking as if the OS picks the
> "frequency",
> that's just not the case anymore)
This is true of ARM platforms too. As Daniel pointed out in an earlier
email, the operating point (frequency, voltage) has a bearing on the
c-state latency too.
An additional complexity is thermal constraints. E.g. On a quad-core
Cortex-A15 processor capable of say 1.5GHz, you won't be able to run
all 4 cores at that speed for very long w/o exceeding the thermal
envelope. These overdrive frequencies (turbo in x86-speak) impact the
rest of the system by either constraining the frequency of other cores
or requiring aggresive thermal management.
Do we really want to track these details in the scheduler or just let
the scheduler provide notifications to the existing subsystems
(cpufreq, cpuidle, thermal, etc.) with some sort of feedback going
back to the scheduler to influence future decisions?
Feeback to the scheduler could be something like the following (pardon
the names):
1. ISOLATE_CORE: Don't schedule anything on this core - cpuidle might
use this to synchronise cores for a cluster shutdown, thermal
framework could use this as idle injection to reduce temperature
2. CAP_CAPACITY: Don't expect cpufreq to raise the frequency on this
core - cpufreq might use this to cap overall energy since overdrive
operating points are very expensive, thermal might use this to slow
down rate of increase of die temperature
Regards,
Amit
Hi Arjan,
On Wed, Jun 12, 2013 at 02:48:58AM +0100, Arjan van de Ven wrote:
> On 6/11/2013 5:27 PM, David Lang wrote:
> > Nobody is saying that this sort of thing should be in the fastpath
> > of the scheduler.
> >
> > But if the scheduler has a table that tells it the possible states,
> > and the cost to get from the current state to each of these states
> > (and to get back and/or wake up to full power), then the scheduler
> > can make the decision on what to do, invoke a routine to make the
> > change (and in the meantime, not be fighting the change by trying to
> > schedule processes on a core that's about to be powered off), and
> > then when the change happens, the scheduler will have a new version
> > of the table of possible states and costs
> >
> > This isn't in the fastpath, it's in the rebalancing logic.
>
> the reality is much more complex unfortunately.
> C and P states hang together tightly, and even C state on one core
> impacts other cores' performance, just like P state selection on one
> core impacts other cores.
>
> (at least for x86, we should really stop talking as if the OS picks
> the "frequency", that's just not the case anymore)
I agree, the reality is very complex. But we should go back and analyse
what problem we are trying to solve, what each framework is trying to
address.
When viewed separately from the scheduler, cpufreq and cpuidle governors
do the right thing. But they both base their action on the CPU load
(balance) decided by the scheduler and it's the latter that we are
trying to adjust (and we are still debating what the right approach is).
Since such information seems too complex to be moved into the scheduler,
why don't we get cpufreq in charge of restricting the load balancing to
certain CPUs? It already tracks the load/idle time to (gradually) change
the P state. Depending on the governor/policy, it could decide that (for
example) 4 CPUs running at higher power P state are enough, telling the
scheduler to ignore the other CPUs. It won't pick a frequency, but (as
it currently does) adjust it to keep a minimal idle state on those CPUs.
If that's not longer possible (high load), it can remove the restriction
and let the scheduler use the other idle CPUs (cpufreq could even do a
direct a load_balance() call). This is a governor decision and the user
is in control of what governors are used.
Cpuidle I think for now can stay the same, gradually entering deeper
sleep states. It could be later unified with cpufreq if there are any
benefits. In deciding the load balancing restrictions, maybe cpufreq
should be aware of C-state latencies.
Cpufreq would need to get more knowledge of the power topology and
thermal management. It would still be the framework restricting the P
state or changing the load balancing restrictions to let CPUs cool down.
More hooks could be added if needed for better responsiveness (like
entering idle or task wake-up).
With the above, the scheduler will just focus on performance (given the
restrictions imposed by cpufreq) and it only needs to be aware of the
CPU topology from a performance perspective (caches, hyperthreading)
together with the cpu_power parameter for the weighted load.
--
Catalin
>>> This isn't in the fastpath, it's in the rebalancing logic.
>>
>> the reality is much more complex unfortunately.
>> C and P states hang together tightly, and even C state on one core
>> impacts other cores' performance, just like P state selection on one
>> core impacts other cores.
>>
>> (at least for x86, we should really stop talking as if the OS picks
>> the "frequency", that's just not the case anymore)
>
> I agree, the reality is very complex. But we should go back and analyse
> what problem we are trying to solve, what each framework is trying to
> address.
>
> When viewed separately from the scheduler, cpufreq and cpuidle governors
> do the right thing. But they both base their action on the CPU load
> (balance) decided by the scheduler and it's the latter that we are
> trying to adjust (and we are still debating what the right approach is).
>
> Since such information seems too complex to be moved into the scheduler,
> why don't we get cpufreq in charge of restricting the load balancing to
> certain CPUs? It already tracks the load/idle time to (gradually) change
> the P state. Depending on the governor/policy, it could decide that (for
(btw in case you missed it, for Intel HW we no longer use cpufreq anymore)
> Cpuidle I think for now can stay the same, gradually entering deeper
> sleep states. It could be later unified with cpufreq if there are any
> benefits. In deciding the load balancing restrictions, maybe cpufreq
> should be aware of C-state latencies.
on the Intel side, we're likely to merge the Intel idle driver and P state driver
in the near future fwiw.
We'll keep using cpuidle framework (since it doesn't do all that much other than
provide a nice hook for the idle loop), but we likely will make a hw specific
selection logic there.
I do agree the scheduler needs to get integrated a bit better, in that it
has some better knowledge, and to be honest, we likely need to switch from
giving tasks credit for "time consumed" to giving them credit for something like
"cycles consumed" or "instructions executed" or a mix thereof.
So that a task that runs on a slower CPU (for either policy choice reasons or
due to hardware capabilities), it gets charged less than when it runs fast.
On Wed, Jun 12, 2013 at 04:24:52PM +0100, Arjan van de Ven wrote:
> >>> This isn't in the fastpath, it's in the rebalancing logic.
> >>
> >> the reality is much more complex unfortunately.
> >> C and P states hang together tightly, and even C state on one core
> >> impacts other cores' performance, just like P state selection on one
> >> core impacts other cores.
> >>
> >> (at least for x86, we should really stop talking as if the OS picks
> >> the "frequency", that's just not the case anymore)
> >
> > I agree, the reality is very complex. But we should go back and analyse
> > what problem we are trying to solve, what each framework is trying to
> > address.
> >
> > When viewed separately from the scheduler, cpufreq and cpuidle governors
> > do the right thing. But they both base their action on the CPU load
> > (balance) decided by the scheduler and it's the latter that we are
> > trying to adjust (and we are still debating what the right approach is).
> >
> > Since such information seems too complex to be moved into the scheduler,
> > why don't we get cpufreq in charge of restricting the load balancing to
> > certain CPUs? It already tracks the load/idle time to (gradually) change
> > the P state. Depending on the governor/policy, it could decide that (for
>
> (btw in case you missed it, for Intel HW we no longer use cpufreq anymore)
Do you mean the intel_pstate.c code? It indeed doesn't use much of
cpufreq, just setpolicy and it's on its own afterwards. Separating this
from the framework probably has real benefits for the Intel processors
but it would make a unified scheduler/cpufreq/cpuidle solution harder
(just a remark, I don't say it's good or bad, there are many
opinions against the unified solution; ARM could do the same for
configurations like big.LITTLE).
But such driver could still interact with the scheduler to control it's
load balancing. At a quick look (I'm not familiar with this driver), it
tracks the per-CPU load and increases or decreases the P-state (similar
to a cpufreq governor). It could as well track the total load and
(depending on hardware configuration), get some CPUs in lower
performance P-state (or even C-state) and tell the scheduler to avoid
them.
One way to control load-balancing ratio is via something like
arch_scale_freq_power(). We could tweak the scheduler further so that
something like cpu_power==0 means do not schedule anything there.
So my proposal is to move the load-balancing hints (load ratio, avoiding
CPUs etc.) outside the scheduler into drivers like intel_pstate.c or
cpufreq governors. We then focus on getting the best performance out of
the scheduler (like quicker migration) but it would not be concerned
with the power consumption.
> I do agree the scheduler needs to get integrated a bit better, in that it
> has some better knowledge, and to be honest, we likely need to switch from
> giving tasks credit for "time consumed" to giving them credit for something like
> "cycles consumed" or "instructions executed" or a mix thereof.
> So that a task that runs on a slower CPU (for either policy choice reasons or
> due to hardware capabilities), it gets charged less than when it runs fast.
I agree, this would be useful in optimising the scheduler so that it
makes the right task placement/migration decisions (but as I said above,
make the power aspect transparent to the scheduler).
--
Catalin
On Wed, 12 Jun 2013, Amit Kucheria wrote:
> On Wed, Jun 12, 2013 at 7:18 AM, Arjan van de Ven <[email protected]> wrote:
>> On 6/11/2013 5:27 PM, David Lang wrote:
>>>
>>>
>>> Nobody is saying that this sort of thing should be in the fastpath of the
>>> scheduler.
>>>
>>> But if the scheduler has a table that tells it the possible states, and
>>> the cost to get from the current state to each of these states (and to get
>>> back and/or wake up to
>>> full power), then the scheduler can make the decision on what to do,
>>> invoke a routine to make the change (and in the meantime, not be fighting
>>> the change by trying to
>>> schedule processes on a core that's about to be powered off), and then
>>> when the change happens, the scheduler will have a new version of the table
>>> of possible states and costs
>>>
>>> This isn't in the fastpath, it's in the rebalancing logic.
>>
>>
>> the reality is much more complex unfortunately.
>> C and P states hang together tightly, and even C state on
>> one core impacts other cores' performance, just like P state selection
>> on one core impacts other cores.
>>
>> (at least for x86, we should really stop talking as if the OS picks the
>> "frequency",
>> that's just not the case anymore)
>
> This is true of ARM platforms too. As Daniel pointed out in an earlier
> email, the operating point (frequency, voltage) has a bearing on the
> c-state latency too.
>
> An additional complexity is thermal constraints. E.g. On a quad-core
> Cortex-A15 processor capable of say 1.5GHz, you won't be able to run
> all 4 cores at that speed for very long w/o exceeding the thermal
> envelope. These overdrive frequencies (turbo in x86-speak) impact the
> rest of the system by either constraining the frequency of other cores
> or requiring aggresive thermal management.
>
> Do we really want to track these details in the scheduler or just let
> the scheduler provide notifications to the existing subsystems
> (cpufreq, cpuidle, thermal, etc.) with some sort of feedback going
> back to the scheduler to influence future decisions?
>
> Feeback to the scheduler could be something like the following (pardon
> the names):
>
> 1. ISOLATE_CORE: Don't schedule anything on this core - cpuidle might
> use this to synchronise cores for a cluster shutdown, thermal
> framework could use this as idle injection to reduce temperature
> 2. CAP_CAPACITY: Don't expect cpufreq to raise the frequency on this
> core - cpufreq might use this to cap overall energy since overdrive
> operating points are very expensive, thermal might use this to slow
> down rate of increase of die temperature
How much data are you going to have to move back and forth between the different
systems?
do you really only want the all-or-nothing "use this core as much as possible"
vs "don't use this core at all"? or do you need the ability to indicate how much
to use a particular core (something that is needed anyway for asymetrical cores
I think)
If there is too much information that needs to be moved back and forth between
these 'subsystems' for the 'right' thing to happen, then it would seem like it
makes more sense to combine them.
Even combined, there are parts that are still pretty modular (like the details
of shifting from one state to another, and the different high level strategies
to follow for different modes of operation), but having access to all the
information rather than only bits and pieces of the information at lower
granularity would seem like an improvement.
David Lang
On Wed, 12 Jun 2013, Daniel Lezcano wrote:
>> On Mon, 10 Jun 2013, Daniel Lezcano wrote:
>>
>>> Some SoC can have a cluster of cpus sharing some resources, eg cache, so
>>> they must enter the same state at the same moment. Beside the
>>> synchronization mechanisms, that adds a dependency with the next event.
>>> For example, the u8500 board has a couple of cpus. In order to make them
>>> to enter in retention, both must enter the same state, but not necessary
>>> at the same moment. The first cpu will wait in WFI and the second one
>>> will initiate the retention mode when entering to this state.
>>> Unfortunately, some time could have passed while the second cpu entered
>>> this state and the next event for the first cpu could be too close, thus
>>> violating the criteria of the governor when it choose this state for the
>>> second cpu.
>>>
>>> Also the latencies could change with the frequencies, so there is a
>>> dependency with cpufreq, the lesser the frequency is, the higher the
>>> latency is. If the scheduler takes the decision to go to a specific
>>> state assuming the exit latency is a given duration, if the frequency
>>> decrease, this exit latency could increase also and lead the system to
>>> be less responsive.
>>>
>>> I don't know, how were made the latencies computation (eg. worst case,
>>> taken with the lower frequency or not) but we have just one set of
>>> values. That should happen with the current code.
>>>
>>> Another point is the timer allowing to detect bad decision and go to a
>>> deep idle state. With the cluster dependency described above, we may
>>> wake up a particular cpu, which turns on the cluster and make the entire
>>> cluster to wake up in order to enter a deeper state, which could fail
>>> because of the other cpu may not fulfill the constraint at this moment.
>>
>> Nobody is saying that this sort of thing should be in the fastpath of
>> the scheduler.
>>
>> But if the scheduler has a table that tells it the possible states, and
>> the cost to get from the current state to each of these states (and to
>> get back and/or wake up to full power), then the scheduler can make the
>> decision on what to do, invoke a routine to make the change (and in the
>> meantime, not be fighting the change by trying to schedule processes on
>> a core that's about to be powered off), and then when the change
>> happens, the scheduler will have a new version of the table of possible
>> states and costs
>>
>> This isn't in the fastpath, it's in the rebalancing logic.
>
> As Arjan mentionned it is not as simple as this.
>
> We want the scheduler to take some decisions with the knowledge of idle
> latencies. In other words move the governor logic into the scheduler.
>
> The scheduler can take decision and the backend driver provides the
> interface to go to the idle state.
>
> But unfortunately each hardware is behaving in different ways and
> describing such behaviors will help to find the correct design, I am not
> raising a lot of issues but just trying to enumerate the constraints we
> have.
>
> What is the correct decision when a lot of pm blocks are tied together
> and the
>
> In the example given by Arjan, the frequencies could be per cluster,
> hence decreasing the frequency for a core will decrease the frequency of
> the other core. So if the scheduler takes the decision to put one core
> into a specific idle state, regarding the target residency and the exit
> latency when the frequency is at max (the other core is doing
> something), and then the frequency decrease, the exit latency may
> increase in this case and the idle cpu will take more time to exit the
> idle state than expected thus adding latency to the system.
>
> What would be the correct decision in this case ? Wake up the idle cpu
> when the frequency change to re-evaluate an idle state ? Provide idle
> latencies for the min freq only ? Or is it acceptable to have such
> latency added when the frequency decrease ?
>
> Also, an interesting question is how do we get these latencies ?
>
> They are all written in the c-state tables but we don't know the
> accuracy of these values ? Were they measured with freq max / min ?
>
> Were they measured with a driver powering down the peripherals or without ?
>
> For the embedded systems, we may have different implementations and
> maybe different latencies. Would be makes sense to pass these values
> through a device tree and let the SoC vendor to specify the right values
> ? (IMHO, only the SoC vendor can do a correct measurement with an
> oscilloscope).
>
> I know there are lot of questions :)
well, I have two immediate reactions.
First, use the values provided by the vendor, if they are wrong performance is
not optimum and people will pick a different vendor (so they have an incentive
to be right :-)
Second, "measure them" :-)
have the device tree enumerate the modes of operation, but then at bootup, run
through a series of tests to bounce between the different modes and measure how
long it takes to move back and forth. If the system can't measure the difference
against it's clocks, then the user isn't going to see the difference either, so
there's no need to be as accurate as a lab bench with a scope. What matters is
how much work can end up getting done for the user, not the number of
nanoseconds between voltage changes (the latter will affect the former, but it's
the former that you really care about)
remember, perfect is the enemy of good enough. you don't have to have a perfect
mapping of every possible change, you just need to be close enough to make
reasonable decisions. You can't really predict the future anyway, so you are
making a guess at what the load on the system is going to be in the future.
Sometimes you will guess wrong no matter how accurate your latency measurements
are. You have to accept that, and once you accept that, the severity of being
wrong in some corner cases become less significant.
David Lang
Hi,
On 06/11/2013 06:20 AM, Rafael J. Wysocki wrote:
>
> OK, so let's try to take one step more and think about what part should belong
> to the scheduler and what part should be taken care of by the "idle" driver.
>
> Do you have any specific view on that?
I gave it some thought and went through Ingo's mail once again. I have
some view points which I have stated at the end of this mail.
>>>>>> Of course, you could export more scheduler information to cpuidle,
>>>>>> various hooks (task wakeup etc.) but then we have another framework,
>>>>>> cpufreq. It also decides the CPU parameters (frequency) based on the
>>>>>> load controlled by the scheduler. Can cpufreq decide whether it's
>>>>>> better to keep the CPU at higher frequency so that it gets to idle
>>>>>> quicker and therefore deeper sleep states? I don't think it has enough
>>>>>> information because there are at least three deciding factors
>>>>>> (cpufreq, cpuidle and scheduler's load balancing) which are not
>>>>>> unified.
>>>>>
>>>>> Why not? When the cpu load is high, cpu frequency governor knows it has
>>>>> to boost the frequency of that CPU. The task gets over quickly, the CPU
>>>>> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
>>>>> sleep state gradually.
>>>>
>>>> The cpufreq governor boosts the frequency enough to cover the load,
>>>> which means reducing the idle time. It does not know whether it is
>>>> better to boost the frequency twice as high so that it gets to idle
>>>> quicker. You can change the governor's policy but does it have any
>>>> information from cpuidle?
>>>
>>> Well, it may get that information directly from the hardware. Actually,
>>> intel_pstate does that, but intel_pstate is the governor and the scaling
>>> driver combined.
>>
>> To add to this, cpufreq currently functions in the below fashion. I am
>> talking of the on demand governor, since it is more relevant to our
>> discussion.
>>
>> ----stepped up frequency------
>> ----threshold--------
>> -----stepped down freq level1---
>> -----stepped down freq level2---
>> ---stepped down freq level3----
>>
>> If the cpu idle time is below a threshold , it boosts the frequency to
>
> Did you mean "above the threshold"?
No I meant "above". I am referring to the cpu *idle* time.
>> Also an idea about how cpu frequency governor can decide on the scaling
>> frequency is stated above.
>
> Actaully, intel_pstate uses a PID controller for making those decisions and
> I think this may be just the right thing to do.
But don't you think we need to include the current cpu load during this
decision making as well? I mean a fn(idle_time) logic in cpu frequency
governor, which is currently absent. Today, it just checks if idle_time
< threshold, and sets one specific frequency. Of course the PID could
then make the decision about the frequencies which can be candidates for
scaling up, but cpu freq governor could decide which among these to pick
based on fn(idle_time) .
>
> [...]
>
>>>
>>> Well, there's nothing like "predicted load". At best, we may be able to make
>>> more or less educated guesses about it, so in my opinion it is better to use
>>> the information about what happened in the past for making decisions regarding
>>> the current settings and re-adjust them over time as we get more information.
>>
>> Agree with this as well. scheduler can at best supply information
>> regarding the historic load and hope that it is what defines the future
>> as well. Apart from this I dont know what other information scheduler
>> can supply cpuidle governor with.
>>>
>>> So how much decision making regarding the idle state to put the given CPU into
>>> should be there in the scheduler? I believe the only information coming out
>>> of the scheduler regarding that should be "OK, this CPU is now idle and I'll
>>> need it in X nanoseconds from now" plus possibly a hint about the wakeup
>>> latency tolerance (but those hints may come from other places too). That said
>>> the decision *which* CPU should become idle at the moment very well may require
>>> some information about what options are available from the layer below (for
>>> example, "putting core X into idle for Y of time will save us Z energy" or
>>> something like that).
>>
>> Agree. Except that the information should be "Ok , this CPU is now idle
>> and it has not done much work in the recent past,it is a 10% loaded CPU".
>
> And what would that be useful for to the "idle" layer? What matters is the
> "I'll need it in X nanoseconds from now" part.
>
> Yes, the load part would be interesting to the "frequency" layer.
>>> What if we could integrate cpuidle with cpufreq so that there is one code
>>> layer representing what the hardware can do to the scheduler? What benefits
>>> can we get from that, if any?
>>
>> We could debate on this point. I am a bit confused about this. As I see
>> it, there is no problem with keeping them separately. One, because of
>> code readability; it is easy to understand what are the different
>> parameters that the performance of CPU depends on, without needing to
>> dig through the code. Two, because cpu frequency kicks in during runtime
>> primarily and cpuidle during idle time of the cpu.
>
> That's a very useful observation. Indeed, there's the "idle" part that needs
> to be invoked when the CPU goes idle (and it should decide what idle state to
> put that CPU into), and there's the "scaling" part that needs to be invoked
> when the CPU has work to do (and it should decide what performance point to
> put that CPU into). The question is, though, if it's better to have two
> separate frameworks for those things (which is what we have today) or to make
> them two parts of the same framework (like two callbacks one of which will be
> executed for CPUs that have just become idle and the other will be invoked
> for CPUs that have just got work to do).
>
>> But this would also mean creating well defined interfaces between them.
>> Integrating cpufreq and cpuidle seems like a better argument to make due
>> to their common functionality at a higher level of talking to hardware
>> and tuning the performance parameters of cpu. But I disagree that
>> scheduler should be put into this common framework as well as it has
>> functionalities which are totally disjoint from what subsystems such as
>> cpuidle and cpufreq are intended to do.
>
> That's correct. The role of the scheduler, in my opinion, may be to call the
> "idle" and "scaling" functions at the right time and to give them information
> needed to make optimal choices.
Having looked at the points being brought about in this discussion and
the mail that Ingo sent out regarding his view points, I have a few
points to make.
David Lezcano made a valid point when he stated that we need to
*move cpufrequency and cpuidle governor logic into scheduler while
retaining their driver functionality in those subsystems.*
It is true that I was strongly against moving the governor logic into
the scheduler, thinking it would be simpler to enhance the communication
interface between the scheduler and the governors.
But having given this some thought,I think this would mean greater scope
for loopholes.
Catalin pointed it out well with an example, when he said in one of his
mails that, assuming scheduler ends up telling cpu frequency governor
when to boost/lower the frequency and note that scheduler is not aware
of the user policies that have gone in to decide if cpu frequency
governor actually does what the scheduler is asking it to do.
And it is only cpu frequency governor who is aware of these user
policies and not scheduler. So how long should the scheduler wait for
cpu frequency governor to boost the frequency? What if the user has
selected a powersave mode, and the cpu frequency cannot rise any
further? That would mean cpu frequency governor telling scheduler that
it can't do what the scheduler is asking it to do.
This decision of scheduler then is a waste of time,since it gets
rejected by the cpufrequency governor and nothing comes of it.
Very clearly the scheduler not being aware of the user policy is a big
drawback; had it known the user policies before hand it would not even
have considered boosting the cpu frequency of the cpu in question.
This point that Ingo made is something we need to look hard at."Today
the power saving landscape is fragmented." The scheduler today does not
know what in the world is the end result of its decisions. cpuidle and
cpu frequency could take decisions that is totally counter intuitive to
the scheduler's. Improving the communication between them would surely
mean we export more and more information back and forth for better
communication, whose end result would probably be to merge the governor
and scheduler. If this vision that "they will eventually get so close,
that we will end up merging them", is agreed upon, then it might be best
to merge them right away without wasting effort into adding logic that
tries to communicate between them or even trying to separate the
functionalities between scheduler and governors.
I don't think removing certain scheduler functionalities and putting it
instead into governors is the right thing to do. Scheduler's functions
are tightly coupled with one another. Breaking one will in my opinion
break a lot of things.
There have been points brought out strongly about how the scheduler
should have global view of cores so that it knows the effect on a socket
when it decides on what to do with a core for instance. This could be
the next step in its enhancement. Taking up one of the examples that
Daniel brought out:" Putting one of the cpus to idle state could lower
the frequency of the socket,thus hampering the exit latency of this idle
state ". (Not the exact words, but this is the point.)
Notice how in the above,if a scheduler were to be able to understand the
above statement, it needs to first off be aware of the cpu frequency and
idle state details. *Therefore as a first step we need better knowledge
in scheduler before it makes global decisions*.
Also note a scheduler cannot under the above circumstances talk back and
forth to the governors to begin to learn about idle states and
frequencies at that point. This simply does not make sense.(True at this
point I am heavily contradicting my previous arguments :P. I felt that
the existing communication is good enough and all that was needed a few
more additions, but that does not seem to be the case. )
Arjan also pointed out how the a task running on a slower core, should
be charged less than when it runs on a faster core. Right here is a use
case for scheduler to be aware of the cpu frequency of a core, since
today it is the one which charges a task, but is not aware of what cpu
frequency it is running on.(It is aware of cpu frequency of core through
cpu power stats, but it uses it only for load balancing today and not
when it charges a task for its run time).
My suggestion at this point is :
1. Begin to move the cpuidle and cpufreq *governor* logic into the
scheduler little by little.
2. Scheduler is already aware of the topology details, maybe enhance
that as the next step.
At this point, we would have a scheduler well aware of the effect of its
load balancing decisions to some extent.
3. Add the logic for the scheduler to get a global view of the cpufreq
and idle.
4. Then get system user policies (powersave/performance) to alter
scheduler behavior accordingly.
At this point if we bring in today's patchsets (power aware scheduling
and packing tasks), they could fetch us their intended benefits pretty
much in most cases as against sporadic behaviour, because
the scheduler is aware of the whole picture and will do what these
patches command only if it is right till the point of idle states and
cpu frequencies and not just till load balancing.
I would appreciate all of yours feedback on the above. I think at this
point we are in a position to judge what would be the next move in this
direction and make that move soon.
Regards
Preeti U Murthy
Hi,
On Fri, May 31, 2013 at 11:52:04AM +0100, Ingo Molnar wrote:
>
> * Morten Rasmussen <[email protected]> wrote:
>
> > Hi,
> >
> > A number of patch sets related to power-efficient scheduling have been
> > posted over the last couple of months. Most of them do not have much
> > data to back them up, so I decided to do some testing.
>
> Thanks, numbers are always welcome!
>
> > Measurement technique:
> > Time spent non-idle (not in idle state) for each cpu based on cpuidle
> > ftrace events. TC2 does not have per-core power-gating, so packing
> > inside the A7 cluster does not lead to any significant power savings.
> > Note that any product grade hardware (TC2 is a test-chip) will very
> > likely have per-core power-gating, so in those cases packing will have
> > an appreciable effect on power savings.
> > Measuring non-idle time rather than power should give a more clear idea
> > about the effect of the patch sets given that the idle back-end is
> > highly implementation specific.
>
> Note that I still disagree with the whole design notion of having an "idle
> back-end" (and a 'cpufreq back end') separate from scheduler power saving
> policy, and none of the patch-sets offered so far solve this fundamental
> design problem.
>
> PeterZ and me tried to point out the design requirements previously, but
> it still does not appear to be clear enough to people, so let me spell it
> out again, in a hopefully clearer fashion.
>
> The scheduler has valuable power saving information available:
>
> - when a CPU is busy: about how long the current task expects to run
>
> - when a CPU is idle: how long the current CPU expects _not_ to run
>
> - topology: it knows how the CPUs and caches interrelate and already
> optimizes based on that
>
> - various high level and low level load averages and other metrics about
> the recent past that show how busy a particular CPU is, how busy the
> whole system is, and what the runtime properties of individual tasks is
> (how often it sleeps, etc.)
>
> so the scheduler is in an _ideal_ position to do a judgement call about
> the near future and estimate how deep an idle state a CPU core should
> enter into and what frequency it should run at.
>
> The scheduler is also at a high enough level to host a "I want maximum
> performance, power does not matter to me" user policy override switch and
> similar user policy details.
>
> No ifs and whens about that.
>
> Today the power saving landscape is fragmented and sad: we just randomly
> interface scheduler task packing changes with some idle policy (and
> cpufreq policy), which might or might not combine correctly.
>
> Even when the numbers improve, it's an entirely random, essentially
> unmaintainable property: because there's no clear split (possible) between
> 'scheduler policy' and 'idle policy'. This is why we removed the old,
> broken power saving scheduler code a year ago: to make room for something
> _better_.
>
> So if we want to add back scheduler power saving then what should happen
> is genuinely better code:
>
> To create a new low level idle driver mechanism the scheduler could use
> and integrate proper power saving / idle policy into the scheduler.
>
> In that power saving framework the already existing scheduler topology
> information should be extended with deep idle parameters:
>
> - enumeration of idle states
>
> - how long it takes to enter+exit a particular idle state
>
> - [ perhaps information about how destructive to CPU caches that
> particular idle state is. ]
>
> - new driver entry point that allows the scheduler to enter any of the
> enumerated idle states. Platform code will not change this state, all
> policy decisions and the idle state is decided at the power saving
> policy level.
>
> All of this combines into a 'cost to enter and exit an idle state'
> estimation plus a way to enter idle states. It should be presented to the
> scheduler in a platform independent fashion, but without policy embedded:
> a low level platform driver interface in essence.
>
> Thomas Gleixner's recent work to generalize platform idle routines will
> further help the implementation of this. (that code is upstream already)
>
> _All_ policy, all metrics, all averaging should happen at the scheduler
> power saving level, in a single place, and then the scheduler should
> directly drive the new low level idle state driver mechanism.
>
> 'scheduler power saving' and 'idle policy' are one and the same principle
> and they should be handled in a single place to offer the best power
> saving results.
>
> Note that any RFC patch-set that offers an implementation for this could
> be structured in a gradual fashion: only implementing it for a limited CPU
> range initially. The new framework can then be extended to more and more
> CPUs and architectures, incorporating more complicated power saving
> features gradually. (The old, existing idle policy code would remain
> untouched and available - it would simply not be used when the new policy
> is activated.)
>
> I.e. I'm not asking for a 'rewrite the world' kind of impossible task -
> I'm providing an actionable path to get improved power saving upstream,
> but it has to use a _sane design_.
>
> This is a "line in the sand", a 'must have' design property for any
> scheduler power saving patches to be acceptable - and I'm NAK-ing
> incomplete approaches that don't solve the root design cause of our power
> saving troubles...
>
Looking at the discussion it seems that people have slightly different
views, but most agree that the goal is an integrated scheduling,
frequency, and idle policy like you pointed out from the beginning.
What is less clear is how such design would look like. Catalin has
suggested two different approaches. Integrating cpufreq into the load
balancing, or let the scheduler focus on load balancing and extend
cpufreq to also restrict number of cpus available to the scheduler using
cpu_power. The former approach would increase the scheduler complexity
significantly as I already highlighted in my first reply. The latter
approach introduces a way to, at lease initially, separate load
balancing from capacity management, which I think is an interesting
approach. Based on this idea I propose the following design:
+-----------------+
| | +----------+
current load | Power scheduler |<----+ cpufreq |
+--------->| sched/power.c +---->| driver |
| | | +----------+
| +-------+---------+
| ^ |
+-----+---------+ | |
| | | | available capacity
| Scheduler |<--+----+ (e.g. cpu_power)
| sched/fair.c | |
| +--+|
+---------------+ ||
^ ||
| v|
+---------+--------+ +----------+
| task load metric | | cpuidle |
| arch/* | | driver |
+------------------+ +----------+
The intention is that the power scheduler will implement the (unified)
power policy. It gets the current load of the system from the scheduler.
Based on this information it will adjust the compute capacity available
to the scheduler and drive frequency changes such that enough compute
capacity is available to handle the current load. If the total load can
be handled by a subset of cpus, it will reduce the capacity of the
excess cpus to 0 (cpu_power=1). Likewise, if the load increases it will
increase capacity of one or more idle cpus to allow the scheduler to
spread the load. The power scheduler has knowledge about the power
topology and will guide the scheduler to idle the most optimum cpus by
reducing its capacity. Global idle decision will be handled by the power
scheduler, so cpuidle can over time be reduced to become just a driver,
once we have added C-state selection to the power scheduler.
The scheduler is left to focus on scheduling mechanics and finding the
best possible load balance on the cpu capacities set by the power
scheduler. It will share a detailed view of the current load with the
power scheduler to enable it to make the right capacity adjustments. The
scheduler will need some optimization to cope better with asymmetric
compute capacities. We may want to reduce capacity of some cpu to
increase their idle time while letting others take the majority of the
load.
Frequency scaling has a problematic impact on PJT's load metic, which
was pointed out a while ago by Chris Redpath
<https://lkml.org/lkml/2013/4/16/289>. So I agree with Arjan's
suggestion to change the load calculation basis to something which is
frequency invariant. Use whatever counters that are available on the
specific platform.
I'm aware that the scheduler and power scheduler decisions may be
inextricably linked so we may decide to merge them. However, I think it
is worth trying to keep the power scheduling decisions out of the
scheduler until we have proven it infeasible.
We are going to start working on this design and see where it takes us.
We will post any results and suggested patches for folk to comment on.
As a starting point we are planning to create a power scheduler
(kernel/sched/power.c) similar to a cpufreq governor that does capacity
management, and then evolve the solution from there.
Morten
> Thanks,
>
> Ingo
>
On Fri, Jun 14, 2013 at 05:05:22PM +0100, Morten Rasmussen wrote:
> The intention is that the power scheduler will implement the (unified)
> power policy. It gets the current load of the system from the scheduler.
> Based on this information it will adjust the compute capacity available
> to the scheduler and drive frequency changes such that enough compute
> capacity is available to handle the current load. If the total load can
> be handled by a subset of cpus, it will reduce the capacity of the
> excess cpus to 0 (cpu_power=1). Likewise, if the load increases it will
> increase capacity of one or more idle cpus to allow the scheduler to
> spread the load. The power scheduler has knowledge about the power
> topology and will guide the scheduler to idle the most optimum cpus by
> reducing its capacity. Global idle decision will be handled by the power
> scheduler, so cpuidle can over time be reduced to become just a driver,
> once we have added C-state selection to the power scheduler.
>
> The scheduler is left to focus on scheduling mechanics and finding the
> best possible load balance on the cpu capacities set by the power
> scheduler. It will share a detailed view of the current load with the
> power scheduler to enable it to make the right capacity adjustments. The
> scheduler will need some optimization to cope better with asymmetric
> compute capacities. We may want to reduce capacity of some cpu to
> increase their idle time while letting others take the majority of the
> load.
...
> I'm aware that the scheduler and power scheduler decisions may be
> inextricably linked so we may decide to merge them. However, I think it
> is worth trying to keep the power scheduling decisions out of the
> scheduler until we have proven it infeasible.
Thanks for posting this, I agree with the proposal. I would like to
emphasise that this is a rather "divide and conquer" approach to
reaching a unified solution. Some of the steps involved (not necessarily
in this order):
1. Introduction of a power scheduler (replacing cpufreq governor) aware
of the overall load and CPU capacities. It requests CPU frequency
changes from the low-level cpufreq driver and gives hints to the task
scheduler about load asymmetry (via cpu_power).
2. More accurate task load tracking (an attempt here -
https://lkml.org/lkml/2013/4/16/289 - but possibly better accuracy
using CPU cycles or other arch-specific counters).
3. Load balancer improvements for asymmetric CPU performance levels
(e.g. frequency scaling).
4. Power scheduler driving the CPU idle decisions (replacing the cpuidle
governor).
5. Power scheduler increased awareness of the run-queues content
(number of tasks, individual task loads) and load balancer behaviour,
feeding extra hints back to the load balancer (e.g. only move tasks
below/above certain load, trigger a load balance).
6. Performance vs power saving tuning (policies).
7. More specific optimisations based on the CPU topology (big.little,
turbo boost, etc.)
?. Lots of other things based on testing and community reviews.
Step 5 above will further increase the coupling between load balancer
and power scheduler and we could end up with a unified implementation.
But before then it is simpler to reason in terms of (a) better load
balancing in an asymmetric configuration and (b) CPU capacity needed for
the overall load.
--
Catalin
On Fri, 14 Jun 2013, Morten Rasmussen wrote:
> Looking at the discussion it seems that people have slightly different
> views, but most agree that the goal is an integrated scheduling,
> frequency, and idle policy like you pointed out from the beginning.
>
> What is less clear is how such design would look like. Catalin has
> suggested two different approaches. Integrating cpufreq into the load
> balancing, or let the scheduler focus on load balancing and extend
> cpufreq to also restrict number of cpus available to the scheduler using
> cpu_power. The former approach would increase the scheduler complexity
> significantly as I already highlighted in my first reply. The latter
> approach introduces a way to, at lease initially, separate load
> balancing from capacity management, which I think is an interesting
> approach. Based on this idea I propose the following design:
>
> +-----------------+
> | | +----------+
> current load | Power scheduler |<----+ cpufreq |
> +--------->| sched/power.c +---->| driver |
> | | | +----------+
> | +-------+---------+
> | ^ |
> +-----+---------+ | |
> | | | | available capacity
> | Scheduler |<--+----+ (e.g. cpu_power)
> | sched/fair.c | |
> | +--+|
> +---------------+ ||
> ^ ||
> | v|
> +---------+--------+ +----------+
> | task load metric | | cpuidle |
> | arch/* | | driver |
> +------------------+ +----------+
>
> The intention is that the power scheduler will implement the (unified)
> power policy. It gets the current load of the system from the scheduler.
> Based on this information it will adjust the compute capacity available
> to the scheduler and drive frequency changes such that enough compute
> capacity is available to handle the current load. If the total load can
> be handled by a subset of cpus, it will reduce the capacity of the
> excess cpus to 0 (cpu_power=1). Likewise, if the load increases it will
> increase capacity of one or more idle cpus to allow the scheduler to
> spread the load. The power scheduler has knowledge about the power
> topology and will guide the scheduler to idle the most optimum cpus by
> reducing its capacity. Global idle decision will be handled by the power
> scheduler, so cpuidle can over time be reduced to become just a driver,
> once we have added C-state selection to the power scheduler.
>
> The scheduler is left to focus on scheduling mechanics and finding the
> best possible load balance on the cpu capacities set by the power
> scheduler. It will share a detailed view of the current load with the
> power scheduler to enable it to make the right capacity adjustments. The
> scheduler will need some optimization to cope better with asymmetric
> compute capacities. We may want to reduce capacity of some cpu to
> increase their idle time while letting others take the majority of the
> load.
>
> Frequency scaling has a problematic impact on PJT's load metic, which
> was pointed out a while ago by Chris Redpath
> <https://lkml.org/lkml/2013/4/16/289>. So I agree with Arjan's
> suggestion to change the load calculation basis to something which is
> frequency invariant. Use whatever counters that are available on the
> specific platform.
>
> I'm aware that the scheduler and power scheduler decisions may be
> inextricably linked so we may decide to merge them. However, I think it
> is worth trying to keep the power scheduling decisions out of the
> scheduler until we have proven it infeasible.
>
> We are going to start working on this design and see where it takes us.
> We will post any results and suggested patches for folk to comment on.
> As a starting point we are planning to create a power scheduler
> (kernel/sched/power.c) similar to a cpufreq governor that does capacity
> management, and then evolve the solution from there.
I don't think that you are passing nearly enough information around.
A fairly simple example
take a relatively modern 4-core system with turbo mode where speed controls
affect two cores at a time (I don't know the details of the available CPUs to
know if this is an exact fit to any existing system, but I think it's a
reasonable fit)
If you are running with a loadave of 2, should you power down 2 cores and run
the other two in turbo mode, power down 2 cores and not increase the speed, or
leave all 4 cores running as is.
Depending on the mix of processes, I could see any one of the three being the
right answer.
If you have a process that's maxing out it's cpu time on one core, going to
turbo mode is the right thing as the other processes should fit on the other
core and that process will use more CPU (theoretically getting done sooner)
If no process is close to maxing out the core, then if you are in power saving
mode, you probably want to shut down two cores and run everything on the other
two
If you only have two processes eating almost all your CPU time, going to two
cores is probably the right thing to do.
If you have more processes, each eating a little bit of time, then continuing
to run on all four cores uses more cache, and could let all of the tasks finish
faster.
So, how is the Power Scheduler going to get this level of information?
It doesn't seem reasonable to either pass this much data around, or to try and
give two independant tools access to the same raw data (since that data is so
tied to the internal details of the scheduler). If we are talking two parts of
the same thing, then it's perfectly legitimate to have this sort of intimate
knowledge of the internal data structures.
Also, if the power scheduler puts the cores at different speeds, how is the
balancing scheduler supposed to know so that it can schedule appropriately? This
is the bigLittle problem again.
It's this level of knowledge that both the power management and the scheduler
need to know about what's going on in the guts of the other that make me say
that they really are going to need to be merged.
The routines to change the core modes will be external, and will vary wildly
between different systems, but the decision making logic should be unified.
David Lang
On Tue, Jun 18, 2013 at 02:37:21AM +0100, David Lang wrote:
>
> On Fri, 14 Jun 2013, Morten Rasmussen wrote:
>
> > Looking at the discussion it seems that people have slightly different
> > views, but most agree that the goal is an integrated scheduling,
> > frequency, and idle policy like you pointed out from the beginning.
> >
> > What is less clear is how such design would look like. Catalin has
> > suggested two different approaches. Integrating cpufreq into the load
> > balancing, or let the scheduler focus on load balancing and extend
> > cpufreq to also restrict number of cpus available to the scheduler using
> > cpu_power. The former approach would increase the scheduler complexity
> > significantly as I already highlighted in my first reply. The latter
> > approach introduces a way to, at lease initially, separate load
> > balancing from capacity management, which I think is an interesting
> > approach. Based on this idea I propose the following design:
> >
> > +-----------------+
> > | | +----------+
> > current load | Power scheduler |<----+ cpufreq |
> > +--------->| sched/power.c +---->| driver |
> > | | | +----------+
> > | +-------+---------+
> > | ^ |
> > +-----+---------+ | |
> > | | | | available capacity
> > | Scheduler |<--+----+ (e.g. cpu_power)
> > | sched/fair.c | |
> > | +--+|
> > +---------------+ ||
> > ^ ||
> > | v|
> > +---------+--------+ +----------+
> > | task load metric | | cpuidle |
> > | arch/* | | driver |
> > +------------------+ +----------+
> >
> > The intention is that the power scheduler will implement the (unified)
> > power policy. It gets the current load of the system from the scheduler.
> > Based on this information it will adjust the compute capacity available
> > to the scheduler and drive frequency changes such that enough compute
> > capacity is available to handle the current load. If the total load can
> > be handled by a subset of cpus, it will reduce the capacity of the
> > excess cpus to 0 (cpu_power=1). Likewise, if the load increases it will
> > increase capacity of one or more idle cpus to allow the scheduler to
> > spread the load. The power scheduler has knowledge about the power
> > topology and will guide the scheduler to idle the most optimum cpus by
> > reducing its capacity. Global idle decision will be handled by the power
> > scheduler, so cpuidle can over time be reduced to become just a driver,
> > once we have added C-state selection to the power scheduler.
> >
> > The scheduler is left to focus on scheduling mechanics and finding the
> > best possible load balance on the cpu capacities set by the power
> > scheduler. It will share a detailed view of the current load with the
> > power scheduler to enable it to make the right capacity adjustments. The
> > scheduler will need some optimization to cope better with asymmetric
> > compute capacities. We may want to reduce capacity of some cpu to
> > increase their idle time while letting others take the majority of the
> > load.
> >
> > Frequency scaling has a problematic impact on PJT's load metic, which
> > was pointed out a while ago by Chris Redpath
> > <https://lkml.org/lkml/2013/4/16/289>. So I agree with Arjan's
> > suggestion to change the load calculation basis to something which is
> > frequency invariant. Use whatever counters that are available on the
> > specific platform.
> >
> > I'm aware that the scheduler and power scheduler decisions may be
> > inextricably linked so we may decide to merge them. However, I think it
> > is worth trying to keep the power scheduling decisions out of the
> > scheduler until we have proven it infeasible.
> >
> > We are going to start working on this design and see where it takes us.
> > We will post any results and suggested patches for folk to comment on.
> > As a starting point we are planning to create a power scheduler
> > (kernel/sched/power.c) similar to a cpufreq governor that does capacity
> > management, and then evolve the solution from there.
>
> I don't think that you are passing nearly enough information around.
>
> A fairly simple example
>
> take a relatively modern 4-core system with turbo mode where speed controls
> affect two cores at a time (I don't know the details of the available CPUs to
> know if this is an exact fit to any existing system, but I think it's a
> reasonable fit)
>
> If you are running with a loadave of 2, should you power down 2 cores and run
> the other two in turbo mode, power down 2 cores and not increase the speed, or
> leave all 4 cores running as is.
>
> Depending on the mix of processes, I could see any one of the three being the
> right answer.
>
> If you have a process that's maxing out it's cpu time on one core, going to
> turbo mode is the right thing as the other processes should fit on the other
> core and that process will use more CPU (theoretically getting done sooner)
>
> If no process is close to maxing out the core, then if you are in power saving
> mode, you probably want to shut down two cores and run everything on the other
> two
>
> If you only have two processes eating almost all your CPU time, going to two
> cores is probably the right thing to do.
>
> If you have more processes, each eating a little bit of time, then continuing
> to run on all four cores uses more cache, and could let all of the tasks finish
> faster.
>
>
> So, how is the Power Scheduler going to get this level of information?
>
> It doesn't seem reasonable to either pass this much data around, or to try and
> give two independant tools access to the same raw data (since that data is so
> tied to the internal details of the scheduler). If we are talking two parts of
> the same thing, then it's perfectly legitimate to have this sort of intimate
> knowledge of the internal data structures.
I realize that my description is not very clear about this point. Total
load is clearly not enough information for the power scheduler to take
any reasonable decisions. By current load, I mean per-cpu load, number
of tasks, and possibly more task statistics. Enough information to
determine the best use of the system cpus.
As stated in my previous reply, this is not the ultimate design. It
expect to have many design iterations. If it turns out that it doesn't
make sense to have a separate power scheduler, then we should merge
them. I just propose to divide the design into manageable components. A
unified design covering the scheduler, two other policy frameworks, and
new policies is too complex in my opinion.
The power scheduler may be viewed as an external extension to the
periodic scheduler load balance. I don't see a major problem in
accessing raw data in the scheduler. The power scheduler will live in
sched/power.c. In a unified solution where you put everything into
sched/fair.c you would still need access to the same raw data to make
the right power scheduling decisions. By having the power scheduler
separately we just attempt to minimize the entanglement.
>
>
> Also, if the power scheduler puts the cores at different speeds, how is the
> balancing scheduler supposed to know so that it can schedule appropriately? This
> is the bigLittle problem again.
>
> It's this level of knowledge that both the power management and the scheduler
> need to know about what's going on in the guts of the other that make me say
> that they really are going to need to be merged.
>
The scheduler will need to be tuned to make the "right" load balancing
decisions based on the compute capacity made available by the power
scheduler. That includes dealing with symmetric systems with different
cpu frequencies and asymmetric systems, like bigLittle. Clearly, the
power scheduler must be able to trust that the load balancer will do the
right thing.
In an example scenario on bigLittle where you have a single task fully
utilizing a single Little cpu, I would expect the power scheduler to
detect this situation and enable a big cpu (increase its cpu_power). The
tuned load balancer will then move the task to the cpu with the highest
capacity.
So, the power scheduler should figure out the best setup for the current
load, and the scheduler (load balancer) should take care of putting the
right tasks on the right cpus according to the capacities (cpu_power)
set by the power scheduler. For this to work the load balancer must
adhere to a set of rules such that the power scheduler can reason about
the the load balancer behaviour like in the above example. Moving big
tasks to cpus with highest capacity is one of these rules. More will
probably be needed as we refine the design.
Morten
>
> The routines to change the core modes will be external, and will vary wildly
> between different systems, but the decision making logic should be unified.
>
> David Lang
>
On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
> Looking at the discussion it seems that people have slightly different
> views, but most agree that the goal is an integrated scheduling,
> frequency, and idle policy like you pointed out from the beginning.
... except that such a solution does not really work for Intel hardware.
The OS does not get to really pick the CPU "frequency" (never mind that
frequency is not what gets controlled), the hardware picks the frequency.
The OS can do some level of requests (best to think of this as a percentage
more than frequency) but what you actually get is more often than not
what you asked for.
You can look in hindsight what kind of performance you got (from some basic
counters in MSRs), and the scheduler can use that to account backwards to what some process
got. But to predict what you will get in the future...... that's near impossible
on any realistic system nowadays (and even more so in the future).
Treating "frequency" (well "performance) and idle separately is also a false thing to do
(yes I know in 3.9/3.10 we still do that for Intel hw, but we're working
on fixing that). They are by no means separate things. One guy's idle state
is the other guys power budget (and thus performance)!.
On Tue, 18 Jun 2013, Morten Rasmussen wrote:
>> I don't think that you are passing nearly enough information around.
>>
>> A fairly simple example
>>
>> take a relatively modern 4-core system with turbo mode where speed controls
>> affect two cores at a time (I don't know the details of the available CPUs to
>> know if this is an exact fit to any existing system, but I think it's a
>> reasonable fit)
>>
>> If you are running with a loadave of 2, should you power down 2 cores and run
>> the other two in turbo mode, power down 2 cores and not increase the speed, or
>> leave all 4 cores running as is.
>>
>> Depending on the mix of processes, I could see any one of the three being the
>> right answer.
>>
>> If you have a process that's maxing out it's cpu time on one core, going to
>> turbo mode is the right thing as the other processes should fit on the other
>> core and that process will use more CPU (theoretically getting done sooner)
>>
>> If no process is close to maxing out the core, then if you are in power saving
>> mode, you probably want to shut down two cores and run everything on the other
>> two
>>
>> If you only have two processes eating almost all your CPU time, going to two
>> cores is probably the right thing to do.
>>
>> If you have more processes, each eating a little bit of time, then continuing
>> to run on all four cores uses more cache, and could let all of the tasks finish
>> faster.
>>
>>
>> So, how is the Power Scheduler going to get this level of information?
>>
>> It doesn't seem reasonable to either pass this much data around, or to try and
>> give two independant tools access to the same raw data (since that data is so
>> tied to the internal details of the scheduler). If we are talking two parts of
>> the same thing, then it's perfectly legitimate to have this sort of intimate
>> knowledge of the internal data structures.
>
> I realize that my description is not very clear about this point. Total
> load is clearly not enough information for the power scheduler to take
> any reasonable decisions. By current load, I mean per-cpu load, number
> of tasks, and possibly more task statistics. Enough information to
> determine the best use of the system cpus.
>
> As stated in my previous reply, this is not the ultimate design. It
> expect to have many design iterations. If it turns out that it doesn't
> make sense to have a separate power scheduler, then we should merge
> them. I just propose to divide the design into manageable components. A
> unified design covering the scheduler, two other policy frameworks, and
> new policies is too complex in my opinion.
>
> The power scheduler may be viewed as an external extension to the
> periodic scheduler load balance. I don't see a major problem in
> accessing raw data in the scheduler. The power scheduler will live in
> sched/power.c. In a unified solution where you put everything into
> sched/fair.c you would still need access to the same raw data to make
> the right power scheduling decisions. By having the power scheduler
> separately we just attempt to minimize the entanglement.
Why insist on this being treated as an external component that you have to pass
messages to?
If you allow it to be combined, then it can lookup the info it needs rather than
trying to define an API between the two that accounts for everything that you
need to know (now and in the future)
This will mean that as the internals of one change it will affect the internals
of the other, but it seems like this is far more likely to be successful.
If you have hundreds or thousands of processes, it's bad enough to lookup the
data directly, but trying to marshal the infromation to send it to a separate
component seems counterproductive.
David Lang
On Tue, 18 Jun 2013, Arjan van de Ven wrote:
> On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
>
>> Looking at the discussion it seems that people have slightly different
>> views, but most agree that the goal is an integrated scheduling,
>> frequency, and idle policy like you pointed out from the beginning.
>
>
> ... except that such a solution does not really work for Intel hardware.
>
> The OS does not get to really pick the CPU "frequency" (never mind that
> frequency is not what gets controlled), the hardware picks the frequency.
> The OS can do some level of requests (best to think of this as a percentage
> more than frequency) but what you actually get is more often than not
> what you asked for.
so this sounds to me like the process for changing settings on this Intel
hardware is a two phase process
something looks up what should be possible and says "switch to mode X"
after mode switch happens it then looks and finds "it's now in mode Y"
As long as there is some table to list the possible X modes to switch to, and
some table to lookup the characteristics of the possible Y modes that you are in
(and the list of modes you can change to may be different depending on what mode
you are in), this doesn't seem to be a huge problem.
And if you can't tell what mode you are in, or what the expected performance
characteristics are, then you can't possibly do any intellegant allocations.
If Intel is doing this for current CPUs, I expect that they will fix this before
too much longer.
> You can look in hindsight what kind of performance you got (from some basic
> counters in MSRs), and the scheduler can use that to account backwards to what
> some process got. But to predict what you will get in the future...... that's
> near impossible on any realistic system nowadays (and even more so in the
> future).
If you have no way of knowing how much processing power you should expect to
have on each core in the near future, then you have no way of allocating
processes appropriately between the cores.
It's bad enough trying to guess the needs of the processes, but if you also are
reduced to guessing the capabilities of the cores, how can anything be made to
work?
David Lang
On Tue, Jun 18, 2013 at 04:20:28PM +0100, Arjan van de Ven wrote:
> On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
> > Looking at the discussion it seems that people have slightly different
> > views, but most agree that the goal is an integrated scheduling,
> > frequency, and idle policy like you pointed out from the beginning.
>
> ... except that such a solution does not really work for Intel hardware.
I think it can work (see below).
> The OS does not get to really pick the CPU "frequency" (never mind that
> frequency is not what gets controlled), the hardware picks the frequency.
> The OS can do some level of requests (best to think of this as a percentage
> more than frequency) but what you actually get is more often than not
> what you asked for.
Morten's proposal does not try to "pick" a frequency. The P-state change
is still done gradually based on the load (so we still have an adaptive
loop). The load (total or per-task) can be tracked in an arch-specific
way (using aperf/mperf on x86).
The difference from what intel_pstate.c does now is that it has a view
of the total load (across all CPUs) and the run-queue content. It can
"guide" the load balancer into favouring one or two CPUs and ignoring
the rest (using cpu_power).
If several CPUs have small aperf/mperf ratio, it can decide to use fewer
CPUs at a higher aperf/mperf by telling the load balancer not to use
them (cpu_power = 1). All of this is continuously re-adjusted to cope
with changes in the load and hardware variations like turbo boost.
Similarly, if a CPU has aperf/mperf >= 1, it keeps increasing the
P-state (depending on the policy). Once it got to the highest level,
depending on the number of threads in the run-queue (doesn't make sense
for only one), it can open up other CPUs and let the load balancer use
them.
> You can look in hindsight what kind of performance you got (from some basic
> counters in MSRs), and the scheduler can use that to account backwards to what some process
> got. But to predict what you will get in the future...... that's near impossible
> on any realistic system nowadays (and even more so in the future).
We don't need absolute figures matching load to P-states but we'll
continue with an adaptive system. What we have now is also an adaptive
system but with independent decisions taken by the load balancer and the
P-state driver. The load balancer can even get confused by the cpufreq
decisions and move tasks around unnecessarily. With Morten's proposal we
get the power scheduler to adjust the P-state while giving hints to the
load balancer at the same time (it adjusts both, it doesn't try to
re-adjust itself after the load balancer).
> Treating "frequency" (well "performance) and idle separately is also a false thing to do
> (yes I know in 3.9/3.10 we still do that for Intel hw, but we're working
> on fixing that). They are by no means separate things. One guy's idle state
> is the other guys power budget (and thus performance)!.
I agree.
--
Catalin
On 6/18/2013 10:47 AM, David Lang wrote:
>
> so this sounds to me like the process for changing settings on this Intel hardware is a two phase process
>
> something looks up what should be possible and says "switch to mode X"
more a case of "I would like to request X"
it's not a mandate, it's a polite request/suggestion
> after mode switch happens it then looks and finds "it's now in mode Y"
you don't really know what you are in, you can only really know on average what you were in over
some time in the past.
As such, Y is not really discrete/enumeratable (well since it's all fixed point math, it is, sure, in steps if 1 Hz)
the "current" thing is changing all the time on a very fine grained timescale, depending on
what the other cores in the system are doing, what graphics is doing, what the temperature is etc etc.
> And if you can't tell what mode you are in, or what the expected performance characteristics are, then you can't possibly do any intellegant allocations.
you can tell what you were in looking in the rear-view mirror. you have no idea what it'll be going forward.
>
> If Intel is doing this for current CPUs, I expect that they will fix this before too much longer.
I'm pretty sure that won't happen, and I'm also pretty sure the other CPU vendors are either there today (AMD) or
will be there in the next few years (ARM).
It's the nature of how CPUs do power and thermal management and the physics behind that.
>> You can look in hindsight what kind of performance you got (from some basic counters in MSRs), and the scheduler can use that to account backwards to what some process
>> got. But to predict what you will get in the future...... that's near impossible on any realistic system nowadays (and even more so in the future).
>
> If you have no way of knowing how much processing power you should expect to have on each core in the near future, then you have no way of allocating processes
> appropriately between the cores.
>
> It's bad enough trying to guess the needs of the processes, but if you also are reduced to guessing the capabilities of the cores, how can anything be made to work?
you can give some suggestions to the hardware. But how much you actually get can be off by 2x or more in either direction.
And most of that will depend on what other cores/graphics in the system are doing
(in terms of idle or their own requests and the amount of the total power budget they are consuming)
On Tue, Jun 18, 2013 at 06:39:27PM +0100, David Lang wrote:
> On Tue, 18 Jun 2013, Morten Rasmussen wrote:
>
> >> I don't think that you are passing nearly enough information around.
> >>
> >> A fairly simple example
> >>
> >> take a relatively modern 4-core system with turbo mode where speed controls
> >> affect two cores at a time (I don't know the details of the available CPUs to
> >> know if this is an exact fit to any existing system, but I think it's a
> >> reasonable fit)
> >>
> >> If you are running with a loadave of 2, should you power down 2 cores and run
> >> the other two in turbo mode, power down 2 cores and not increase the speed, or
> >> leave all 4 cores running as is.
> >>
> >> Depending on the mix of processes, I could see any one of the three being the
> >> right answer.
> >>
> >> If you have a process that's maxing out it's cpu time on one core, going to
> >> turbo mode is the right thing as the other processes should fit on the other
> >> core and that process will use more CPU (theoretically getting done sooner)
> >>
> >> If no process is close to maxing out the core, then if you are in power saving
> >> mode, you probably want to shut down two cores and run everything on the other
> >> two
> >>
> >> If you only have two processes eating almost all your CPU time, going to two
> >> cores is probably the right thing to do.
> >>
> >> If you have more processes, each eating a little bit of time, then continuing
> >> to run on all four cores uses more cache, and could let all of the tasks finish
> >> faster.
> >>
> >>
> >> So, how is the Power Scheduler going to get this level of information?
> >>
> >> It doesn't seem reasonable to either pass this much data around, or to try and
> >> give two independant tools access to the same raw data (since that data is so
> >> tied to the internal details of the scheduler). If we are talking two parts of
> >> the same thing, then it's perfectly legitimate to have this sort of intimate
> >> knowledge of the internal data structures.
> >
> > I realize that my description is not very clear about this point. Total
> > load is clearly not enough information for the power scheduler to take
> > any reasonable decisions. By current load, I mean per-cpu load, number
> > of tasks, and possibly more task statistics. Enough information to
> > determine the best use of the system cpus.
> >
> > As stated in my previous reply, this is not the ultimate design. It
> > expect to have many design iterations. If it turns out that it doesn't
> > make sense to have a separate power scheduler, then we should merge
> > them. I just propose to divide the design into manageable components. A
> > unified design covering the scheduler, two other policy frameworks, and
> > new policies is too complex in my opinion.
> >
> > The power scheduler may be viewed as an external extension to the
> > periodic scheduler load balance. I don't see a major problem in
> > accessing raw data in the scheduler. The power scheduler will live in
> > sched/power.c. In a unified solution where you put everything into
> > sched/fair.c you would still need access to the same raw data to make
> > the right power scheduling decisions. By having the power scheduler
> > separately we just attempt to minimize the entanglement.
>
> Why insist on this being treated as an external component that you have to pass
> messages to?
>
> If you allow it to be combined, then it can lookup the info it needs rather than
> trying to define an API between the two that accounts for everything that you
> need to know (now and in the future)
I don't see why you cannot read the internal scheduler data structures
from the power scheduler (with appropriate attention to locking). The
point of the proposed design is not to define interfaces, it is to divide
the problem into manageable components.
Let me repeat again, if we while developing the solution find out that
the separation doesn't make sense I have no problem merging them. I
don't insist on the separation, my point is that we need to partition
this very complex problem and let it evolve into a reasonable solution.
>
> This will mean that as the internals of one change it will affect the internals
> of the other, but it seems like this is far more likely to be successful.
>
That is no different from having a merged design. If you change
something in the scheduler you would have to consider all the power
implications anyway. The power scheduler design would give you at least
a vague separation and the possibility of not having a power scheduler
at all.
> If you have hundreds or thousands of processes, it's bad enough to lookup the
> data directly, but trying to marshal the infromation to send it to a separate
> component seems counterproductive.
I don't see why that should be necessary.
Morten
On 6/18/2013 10:47 AM, David Lang wrote:
>
> It's bad enough trying to guess the needs of the processes, but if you also are reduced to guessing the capabilities of the cores, how can anything be made to work?
btw one way to look at this is to assume that (with some minimal hinting)
the CPU driver will do the right thing and get you just about the best performance you can get
(that is appropriate for the task at hand)...
... and don't do anything in the scheduler proactively.
Now for big.little and other temporary or permanent asymmetries, we may want to
have a "max performance level" type indicator, and that's fair enough
(and this can be dynamic, since it for thermal reasons this can change over time,
but on a somewhat slower timescale)
the hints I have in mind are not all that complex; we have the biggest issues today
around task migration (the task migrates to a cold cpu... so a simple notifier chain
on the new cpu as it is accepting a task and we can bump it up), real time tasks
(again, simple notifier chain to get you to a predictably high performance level)
and we're a long way better than we are today in terms of actual problems.
For all the talk of ondemand (as ARM still uses that today)... that guy puts you in
either the lowest or highest frequency over 95% of the time. Other non-cpufreq solutions
like on Intel are bit more advanced (and will grow more so over time), but even there,
in the grand scheme of things, the scheduler shouldn't have to care anymore with those
two notifiers in place.
On Wed, Jun 19, 2013 at 04:39:39PM +0100, Arjan van de Ven wrote:
> On 6/18/2013 10:47 AM, David Lang wrote:
>
> >
> > It's bad enough trying to guess the needs of the processes, but if you also are reduced to guessing the capabilities of the cores, how can anything be made to work?
>
> btw one way to look at this is to assume that (with some minimal hinting)
> the CPU driver will do the right thing and get you just about the best performance you can get
> (that is appropriate for the task at hand)...
> ... and don't do anything in the scheduler proactively.
If I understand correctly, you mean if your hardware/firmware is fully
in control of the p-state selection and changes it fast enough to match
the current load, the scheduler doesn't have to care? By fast enough I
mean, faster than the scheduler would notice if a cpu was temporarily
overloaded at a low p-state. In that case, you wouldn't need
cpufreq/p-state hints, and the scheduler would only move tasks between
cpus when cpus are fully loaded at their max p-state.
>
> Now for big.little and other temporary or permanent asymmetries, we may want to
> have a "max performance level" type indicator, and that's fair enough
> (and this can be dynamic, since it for thermal reasons this can change over time,
> but on a somewhat slower timescale)
>
>
> the hints I have in mind are not all that complex; we have the biggest issues today
> around task migration (the task migrates to a cold cpu... so a simple notifier chain
> on the new cpu as it is accepting a task and we can bump it up), real time tasks
> (again, simple notifier chain to get you to a predictably high performance level)
> and we're a long way better than we are today in terms of actual problems.
>
> For all the talk of ondemand (as ARM still uses that today)... that guy puts you in
> either the lowest or highest frequency over 95% of the time. Other non-cpufreq solutions
> like on Intel are bit more advanced (and will grow more so over time), but even there,
> in the grand scheme of things, the scheduler shouldn't have to care anymore with those
> two notifiers in place.
You would need more than a few hints to implement more advanced capacity
management like proposed for the power scheduler. I believe that Intel
would benefit as well from guiding the scheduler to idle the right cpu
to enable deeper idle states and/or enable turbo-boost for other cpus.
Morten
On 6/19/2013 10:00 AM, Morten Rasmussen wrote:
> On Wed, Jun 19, 2013 at 04:39:39PM +0100, Arjan van de Ven wrote:
>> On 6/18/2013 10:47 AM, David Lang wrote:
>>
>>>
>>> It's bad enough trying to guess the needs of the processes, but if you also are reduced to guessing the capabilities of the cores, how can anything be made to work?
>>
>> btw one way to look at this is to assume that (with some minimal hinting)
>> the CPU driver will do the right thing and get you just about the best performance you can get
>> (that is appropriate for the task at hand)...
>> ... and don't do anything in the scheduler proactively.
>
> If I understand correctly, you mean if your hardware/firmware is fully
hardware, firmware and the driver
> in control of the p-state selection and changes it fast enough to match
> the current load, the scheduler doesn't have to care? By fast enough I
> mean, faster than the scheduler would notice if a cpu was temporarily
> overloaded at a low p-state. In that case, you wouldn't need
> cpufreq/p-state hints, and the scheduler would only move tasks between
> cpus when cpus are fully loaded at their max p-state.
with the migration hint, I'm pretty sure we'll be there today typically.
we'll notice within 10 msec regardless, but the migration hint will take
the edge of those 10 msec normally.
I would argue that the "at their max p-state" in your sentence needs to go away.
since you don't know what you actually are except in hindsight.
And even then you don't know if you could have gone higher or not.
>> the hints I have in mind are not all that complex; we have the biggest issues today
>> around task migration (the task migrates to a cold cpu... so a simple notifier chain
>> on the new cpu as it is accepting a task and we can bump it up), real time tasks
>> (again, simple notifier chain to get you to a predictably high performance level)
>> and we're a long way better than we are today in terms of actual problems.
>>
>> For all the talk of ondemand (as ARM still uses that today)... that guy puts you in
>> either the lowest or highest frequency over 95% of the time. Other non-cpufreq solutions
>> like on Intel are bit more advanced (and will grow more so over time), but even there,
>> in the grand scheme of things, the scheduler shouldn't have to care anymore with those
>> two notifiers in place.
>
> You would need more than a few hints to implement more advanced capacity
> management like proposed for the power scheduler. I believe that Intel
> would benefit as well from guiding the scheduler to idle the right cpu
> to enable deeper idle states and/or enable turbo-boost for other cpus.
that's an interesting theory.
I've yet to see any way to actually have that do something useful.
yes there is some value in grouping a lot of very short tasks together.
not a lot of value, but at least some.
and there is some value in the grouping within a package (to a degree) thing.
(both are basically "statistically, sort left" as policy)
more finegrained than that (esp tied to P states).. not so much.
* Morten Rasmussen <[email protected]> wrote:
> On Fri, May 31, 2013 at 11:52:04AM +0100, Ingo Molnar wrote:
> >
> > * Morten Rasmussen <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > A number of patch sets related to power-efficient scheduling have been
> > > posted over the last couple of months. Most of them do not have much
> > > data to back them up, so I decided to do some testing.
> >
> > Thanks, numbers are always welcome!
> >
> > > Measurement technique:
> > > Time spent non-idle (not in idle state) for each cpu based on cpuidle
> > > ftrace events. TC2 does not have per-core power-gating, so packing
> > > inside the A7 cluster does not lead to any significant power savings.
> > > Note that any product grade hardware (TC2 is a test-chip) will very
> > > likely have per-core power-gating, so in those cases packing will have
> > > an appreciable effect on power savings.
> > > Measuring non-idle time rather than power should give a more clear idea
> > > about the effect of the patch sets given that the idle back-end is
> > > highly implementation specific.
> >
> > Note that I still disagree with the whole design notion of having an "idle
> > back-end" (and a 'cpufreq back end') separate from scheduler power saving
> > policy, and none of the patch-sets offered so far solve this fundamental
> > design problem.
> >
> > PeterZ and me tried to point out the design requirements previously, but
> > it still does not appear to be clear enough to people, so let me spell it
> > out again, in a hopefully clearer fashion.
> >
> > The scheduler has valuable power saving information available:
> >
> > - when a CPU is busy: about how long the current task expects to run
> >
> > - when a CPU is idle: how long the current CPU expects _not_ to run
> >
> > - topology: it knows how the CPUs and caches interrelate and already
> > optimizes based on that
> >
> > - various high level and low level load averages and other metrics about
> > the recent past that show how busy a particular CPU is, how busy the
> > whole system is, and what the runtime properties of individual tasks is
> > (how often it sleeps, etc.)
> >
> > so the scheduler is in an _ideal_ position to do a judgement call about
> > the near future and estimate how deep an idle state a CPU core should
> > enter into and what frequency it should run at.
> >
> > The scheduler is also at a high enough level to host a "I want maximum
> > performance, power does not matter to me" user policy override switch and
> > similar user policy details.
> >
> > No ifs and whens about that.
> >
> > Today the power saving landscape is fragmented and sad: we just randomly
> > interface scheduler task packing changes with some idle policy (and
> > cpufreq policy), which might or might not combine correctly.
> >
> > Even when the numbers improve, it's an entirely random, essentially
> > unmaintainable property: because there's no clear split (possible) between
> > 'scheduler policy' and 'idle policy'. This is why we removed the old,
> > broken power saving scheduler code a year ago: to make room for something
> > _better_.
> >
> > So if we want to add back scheduler power saving then what should happen
> > is genuinely better code:
> >
> > To create a new low level idle driver mechanism the scheduler could use
> > and integrate proper power saving / idle policy into the scheduler.
> >
> > In that power saving framework the already existing scheduler topology
> > information should be extended with deep idle parameters:
> >
> > - enumeration of idle states
> >
> > - how long it takes to enter+exit a particular idle state
> >
> > - [ perhaps information about how destructive to CPU caches that
> > particular idle state is. ]
> >
> > - new driver entry point that allows the scheduler to enter any of the
> > enumerated idle states. Platform code will not change this state, all
> > policy decisions and the idle state is decided at the power saving
> > policy level.
> >
> > All of this combines into a 'cost to enter and exit an idle state'
> > estimation plus a way to enter idle states. It should be presented to the
> > scheduler in a platform independent fashion, but without policy embedded:
> > a low level platform driver interface in essence.
> >
> > Thomas Gleixner's recent work to generalize platform idle routines will
> > further help the implementation of this. (that code is upstream already)
> >
> > _All_ policy, all metrics, all averaging should happen at the scheduler
> > power saving level, in a single place, and then the scheduler should
> > directly drive the new low level idle state driver mechanism.
> >
> > 'scheduler power saving' and 'idle policy' are one and the same principle
> > and they should be handled in a single place to offer the best power
> > saving results.
> >
> > Note that any RFC patch-set that offers an implementation for this could
> > be structured in a gradual fashion: only implementing it for a limited CPU
> > range initially. The new framework can then be extended to more and more
> > CPUs and architectures, incorporating more complicated power saving
> > features gradually. (The old, existing idle policy code would remain
> > untouched and available - it would simply not be used when the new policy
> > is activated.)
> >
> > I.e. I'm not asking for a 'rewrite the world' kind of impossible task -
> > I'm providing an actionable path to get improved power saving upstream,
> > but it has to use a _sane design_.
> >
> > This is a "line in the sand", a 'must have' design property for any
> > scheduler power saving patches to be acceptable - and I'm NAK-ing
> > incomplete approaches that don't solve the root design cause of our power
> > saving troubles...
>
> Thanks for sharing your view.
>
> I agree with idea of having a high level user switch to change
> power/performance policy trade-offs for the system. Not only for
> scheduling. I also share your view that the scheduler is in the ideal
> place to drive the frequency scaling and idle policies.
>
> However, I think that an integrated solution with one unified policy
> implemented in the scheduler would take a significant rewrite of the
> scheduler and the power management frameworks even if we start with just
> a few SoCs.
>
> To reach an integrated solution that does better than the current
> approach there is a range of things that need to be considered:
>
> - Define a power-efficient scheduling policy. Depending on the power
> gating support on the particular system packing tasks may improve
> power-efficiency while spreading the tasks may be better for others.
>
> - Define how the user policy switch works. In previous discussions it
> was proposed to have a high level switch that allows specification of
> what the system should strive to achieve - power saving or performance.
> In those discussions, what power meant wasn't exactly defined.
>
> - Find a generic way to represent the power topology which includes
> power domains, voltage domains and frequency domains. Also, more
> importantly how we can derive the optimal power/performance policy for
> the specific platform. There may be dependencies between idle and
> frequency states like it is the case for frequency boost mode like Arjan
> mentions in his reply.
>
> - The fact that not all platforms expose all idle states to the OS and
> that closed firmware may do whatever it likes behind the scenes. There
> are various reasons to do this. Not all of them are bad.
>
> - Define a scheduler driven frequency scaling policy that at least
> matches the 'performance' of the current cpufreq policies and has
> potential for further improvements.
>
> - Match the power savings of the current cpuidle governors which are
> based on arcane heuristics developed over years to predict things like
> the occurrence of the next interrupt.
>
> - Thermal aspects add more complexity to the power/performance policy.
> Depending on the platform, overheating may be handled by frequency
> capping or restricting the number of active cpus.
>
> - Asymmetric/heterogeneous multi-processors need to be dealt with.
>
> This is not a complete list. My point is that moving all policy to the
> scheduler will significantly increase the complexity of the scheduler.
> It is my impression that the general opinion is that the scheduler is
> already too complicated. Correct me if I'm wrong.
The thing we care about is the net complexity of the kernel. Moving
related kernel code next to each other will in the _worst case_ result in
exactly the same complexity as we had before.
But even just a small number of unifications will decrease complexity and
give us a chance to implement a more workable, more maintainable, more
correct power saving policy.
The scheduler maintainers have no problem with going this way - we've
asked for such a design and approach for years.
> While the proposed task packing patches are not complete solutions, they
> address the first item on the above list and can be seen as a step
> towards the goal.
>
> Should I read your recommendation as you prefer a complete and
> potentially huge patch set over incremental patch sets?
I like incremental and see no reason why this couldn't be made
incremental, by adding the new facility for a smallish, manageable number
of supported configurations - then extending it gradually as it proves
itself.
> It would be good to have even a high level agreement on the path forward
> where the expectation first and foremost is to take advantage of the
> schedulers ideal position to drive the power management while
> simplifying the power management code.
I'd suggest to try a set of patches that implements this for the hw
configuration you are most interested in - then measure and see where we
stand.
It should be a non-disruptive approach: i.e. a new CONFIG_SCHED_POWER
.config switch, which, if turned off, makes the new code go away, and it
also won't do anything on platforms that don't (yet) support the driver
model where the scheduler determines idle and performance states.
On CONFIG_SCHED_POWER=y kernels the new policy activates if there's low
level support present.
There's no other mode of operation: either the new scheduling policy is
fully there, or it's totally inactive.
This makes it entirely non-disruptive and non-regressive, while still
providing a road towards goodness.
Thanks,
Ingo
On Wed, Jun 19, 2013 at 06:08:29PM +0100, Arjan van de Ven wrote:
> On 6/19/2013 10:00 AM, Morten Rasmussen wrote:
> > On Wed, Jun 19, 2013 at 04:39:39PM +0100, Arjan van de Ven wrote:
> >> On 6/18/2013 10:47 AM, David Lang wrote:
> >>
> >>>
> >>> It's bad enough trying to guess the needs of the processes, but if you also are reduced to guessing the capabilities of the cores, how can anything be made to work?
> >>
> >> btw one way to look at this is to assume that (with some minimal hinting)
> >> the CPU driver will do the right thing and get you just about the best performance you can get
> >> (that is appropriate for the task at hand)...
> >> ... and don't do anything in the scheduler proactively.
> >
> > If I understand correctly, you mean if your hardware/firmware is fully
>
> hardware, firmware and the driver
>
> > in control of the p-state selection and changes it fast enough to match
> > the current load, the scheduler doesn't have to care? By fast enough I
> > mean, faster than the scheduler would notice if a cpu was temporarily
> > overloaded at a low p-state. In that case, you wouldn't need
> > cpufreq/p-state hints, and the scheduler would only move tasks between
> > cpus when cpus are fully loaded at their max p-state.
>
> with the migration hint, I'm pretty sure we'll be there today typically.
A hint when a task is moved to a new cpu is too late if the migration
shouldn't have happened at all. If the scheduler knows that the cpu is
able to switch to a higher p-state it can decide to wait for the p-state
change instead of migrating the task and waking up another cpu.
> we'll notice within 10 msec regardless, but the migration hint will take
> the edge of those 10 msec normally.
I'm not sure if 10 msec is fast enough for the scheduler to not notice.
Real use-case studies will tell.
>
> I would argue that the "at their max p-state" in your sentence needs to go away.
> since you don't know what you actually are except in hindsight.
> And even then you don't know if you could have gone higher or not.
Yes. What I meant was that if your p-state selection is responsive
enough the scheduler would only see the cpu as overloaded when it is in
its highest available p-state. That may determined dynamically by power,
thermal, and other factors.
>
>
> >> the hints I have in mind are not all that complex; we have the biggest issues today
> >> around task migration (the task migrates to a cold cpu... so a simple notifier chain
> >> on the new cpu as it is accepting a task and we can bump it up), real time tasks
> >> (again, simple notifier chain to get you to a predictably high performance level)
> >> and we're a long way better than we are today in terms of actual problems.
> >>
> >> For all the talk of ondemand (as ARM still uses that today)... that guy puts you in
> >> either the lowest or highest frequency over 95% of the time. Other non-cpufreq solutions
> >> like on Intel are bit more advanced (and will grow more so over time), but even there,
> >> in the grand scheme of things, the scheduler shouldn't have to care anymore with those
> >> two notifiers in place.
> >
> > You would need more than a few hints to implement more advanced capacity
> > management like proposed for the power scheduler. I believe that Intel
> > would benefit as well from guiding the scheduler to idle the right cpu
> > to enable deeper idle states and/or enable turbo-boost for other cpus.
>
> that's an interesting theory.
> I've yet to see any way to actually have that do something useful.
>
> yes there is some value in grouping a lot of very short tasks together.
> not a lot of value, but at least some.
>
> and there is some value in the grouping within a package (to a degree) thing.
>
> (both are basically "statistically, sort left" as policy)
>
The proposed task packing patches have shown significant benefits for
scenarios with many short tasks. This is a typical scenario on android.
Morten
On Tue, Jun 18, 2013 at 04:20:28PM +0100, Arjan van de Ven wrote:
> On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
>
> > Looking at the discussion it seems that people have slightly different
> > views, but most agree that the goal is an integrated scheduling,
> > frequency, and idle policy like you pointed out from the beginning.
>
>
> ... except that such a solution does not really work for Intel hardware.
>
> The OS does not get to really pick the CPU "frequency" (never mind that
> frequency is not what gets controlled), the hardware picks the frequency.
> The OS can do some level of requests (best to think of this as a percentage
> more than frequency) but what you actually get is more often than not
> what you asked for.
>
> You can look in hindsight what kind of performance you got (from some basic
> counters in MSRs), and the scheduler can use that to account backwards to what some process
> got. But to predict what you will get in the future...... that's near impossible
> on any realistic system nowadays (and even more so in the future).
The proposed power scheduler doesn't have to drive p-state selection if
it doesn't make sense for the particular platform. The aim of the power
scheduler is integration of power policies in general.
>
> Treating "frequency" (well "performance) and idle separately is also a false thing to do
> (yes I know in 3.9/3.10 we still do that for Intel hw, but we're working
> on fixing that). They are by no means separate things. One guy's idle state
> is the other guys power budget (and thus performance)!.
>
I agree.
Based on our discussions so far, where it has become more clear where
Intel is heading, and Ingo's reply I think we have three ways to ahead
with the power-aware scheduling work. Each with their advantages and
disadvantages:
1. We work on a generic power scheduler with appropriate abstractions
that will work for all of us. Current and future Intel p-state policies
will be implemented through the power scheduler.
Pros: We can arrive at fairly standard solution with standard tunables.
There will be one interface to the scheduler.
Cons: Finding a suitable platform abstraction for the power scheduler.
2. Like 1, but we introduce a CONFIG_SCHED_POWER as suggested by Ingo,
that makes it all go away.
Pros: Intel can keep intel_pstate.c others can use the power scheduler
or their own driver.
Cons: Different platform specific drivers may need different interfaces
to the scheduler. Harder to define cross-platform tunables.
3. We go for independent platform specific power policy driver that may
or may not use existing frameworks, like intel_pstate.c.
Pros: No need to find common platform abstraction. Power policy is
implemented in arch/* and won't affect others.
Cons: Same as 2. Everybody would have to implement their own frequency,
idle and thermal solutions. Potential duplication of functionality.
In my opinion we should aim for 1., but start out with a
CONFIG_SCHED_POWER and see where we get to. Feedback from everybody is
essential to arrive at a generic solution.
Morten
On 6/21/2013 1:50 AM, Morten Rasmussen wrote:
>>> in control of the p-state selection and changes it fast enough to match
>>> the current load, the scheduler doesn't have to care? By fast enough I
>>> mean, faster than the scheduler would notice if a cpu was temporarily
>>> overloaded at a low p-state. In that case, you wouldn't need
>>> cpufreq/p-state hints, and the scheduler would only move tasks between
>>> cpus when cpus are fully loaded at their max p-state.
>>
>> with the migration hint, I'm pretty sure we'll be there today typically.
>
> A hint when a task is moved to a new cpu is too late if the migration
> shouldn't have happened at all. If the scheduler knows that the cpu is
> able to switch to a higher p-state it can decide to wait for the p-state
> change instead of migrating the task and waking up another cpu.
ok maybe I am missing something
but at least on the hardware I am familiar with (Intel and somewhat AMD),
the frequency (and voltage) when idle is ... 0 Hz... no matter what the OS chose for when the CPU is running.
And when coming out of idle, as part of the cost of that, is ramping up to something
appropriate.
And such ramps are FAST. Changing P state is as a result generally quite fast as well...
think "single digit microseconds" kind of fast.
Much faster than waking a CPU up in the first place (by design.. since a wakeup of a CPU
includes effectively a P state change)
I read your statement as "lets wait for the idle CPU to ramp its frequency up first",
which doesn't really make sense to me...
On 6/21/2013 1:50 AM, Morten Rasmussen wrote:
>> ypically.
> A hint when a task is moved to a new cpu is too late if the migration
> shouldn't have happened at all. If the scheduler knows that the cpu is
> able to switch to a higher p-state it can decide to wait for the p-state
> change instead of migrating the task and waking up another cpu.
>
oops sorry I misread your mail (lack of early coffee I suppose)
I can see your point of having a thing for "did we ask for all the performance
we could ask for" prior to doing a load balance (although, for power efficiency,
if you have two tasks that could run in parallel, it's usually better to
run them in parallel... so likely we should balance anyway)
On 21 June 2013 16:38, Arjan van de Ven <[email protected]> wrote:
> On 6/21/2013 1:50 AM, Morten Rasmussen wrote:
>> A hint when a task is moved to a new cpu is too late if the migration
>> shouldn't have happened at all. If the scheduler knows that the cpu is
>> able to switch to a higher p-state it can decide to wait for the p-state
>> change instead of migrating the task and waking up another cpu.
>
> oops sorry I misread your mail (lack of early coffee I suppose)
>
> I can see your point of having a thing for "did we ask for all the performance
> we could ask for" prior to doing a load balance (although, for power efficiency,
> if you have two tasks that could run in parallel, it's usually better to
> run them in parallel... so likely we should balance anyway)
Not necessarily, especially if parallel running implies powering up a
full cluster just for one CPU (it depends on the hardware but for
example a cluster may not be able to go in deeper sleep states unless
all the CPUs in that cluster are idle).
--
Catalin
On 6/21/2013 2:23 PM, Catalin Marinas wrote:
>>
>> oops sorry I misread your mail (lack of early coffee I suppose)
>>
>> I can see your point of having a thing for "did we ask for all the performance
>> we could ask for" prior to doing a load balance (although, for power efficiency,
>> if you have two tasks that could run in parallel, it's usually better to
>> run them in parallel... so likely we should balance anyway)
>
> Not necessarily, especially if parallel running implies powering up a
> full cluster just for one CPU (it depends on the hardware but for
> example a cluster may not be able to go in deeper sleep states unless
> all the CPUs in that cluster are idle).
I guess it depends on the system
the very first cpu needs to power on
* the core itself
* the "cluster" that you mention
* the memory controller
* the memory (out of self refresh)
while the second cpu needs
* the core itself
* maybe a second cluster
normally on Intel systems, the memory power delta is quite significant
which then means the efficiency of the second core is huge compared to
running things in sequence.
* Morten Rasmussen <[email protected]> wrote:
> On Tue, Jun 18, 2013 at 04:20:28PM +0100, Arjan van de Ven wrote:
> > On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
> >
> > > Looking at the discussion it seems that people have slightly different
> > > views, but most agree that the goal is an integrated scheduling,
> > > frequency, and idle policy like you pointed out from the beginning.
> >
> >
> > ... except that such a solution does not really work for Intel hardware.
> >
> > The OS does not get to really pick the CPU "frequency" (never mind
> > that frequency is not what gets controlled), the hardware picks the
> > frequency. The OS can do some level of requests (best to think of this
> > as a percentage more than frequency) but what you actually get is more
> > often than not what you asked for.
> >
> > You can look in hindsight what kind of performance you got (from some
> > basic counters in MSRs), and the scheduler can use that to account
> > backwards to what some process got. But to predict what you will get
> > in the future...... that's near impossible on any realistic system
> > nowadays (and even more so in the future).
>
> The proposed power scheduler doesn't have to drive p-state selection if
> it doesn't make sense for the particular platform. The aim of the power
> scheduler is integration of power policies in general.
Exactly.
> > Treating "frequency" (well "performance) and idle separately is also a
> > false thing to do (yes I know in 3.9/3.10 we still do that for Intel
> > hw, but we're working on fixing that). They are by no means separate
> > things. One guy's idle state is the other guys power budget (and thus
> > performance)!.
>
> I agree.
>
> Based on our discussions so far, where it has become more clear where
> Intel is heading, and Ingo's reply I think we have three ways to ahead
> with the power-aware scheduling work. Each with their advantages and
> disadvantages:
>
> 1. We work on a generic power scheduler with appropriate abstractions
> that will work for all of us. Current and future Intel p-state policies
> will be implemented through the power scheduler.
>
> Pros: We can arrive at fairly standard solution with standard tunables.
> There will be one interface to the scheduler.
This is what we prefer really, made available under CONFIG_SCHED_POWER=y.
With CONFIG_SCHED_POWER=y, or if low level facilities are not (yet)
available then the kernel falls back to legacy (current) behavior.
> Cons: Finding a suitable platform abstraction for the power scheduler.
Just do it incrementally. Start from the dumbest possible state: all CPUs
are powered up fully, there's no idle state selection essentially. Then go
for the biggest effect first and add the ability to idle in a lower power
state (with new functions and a low level driver that implements this for
the platform with no policy embedded into it - just p-state switching
logic), and combine that with task packing.
Then do small, measured steps to integrate more and more facilities, the
ability to turn off more and more hardware, etc. The more basic steps you
can figure out to iterate this, the better.
Important: it's not a problem that the initial code won't outperform the
current kernel's performance. It should outperform the _initial_ 'dumb'
code in the first step. Then the next step should outperform the previous
step, etc.
The quality of this iterative approach will eventually surpass the
combined effect of currently available but non-integrated facilities.
Since this can be done without touching all the other existing facilities
it's fundamentally non-intrusive.
An initial implementation should probably cover just two platforms, a
modern ARM platform and Intel - those two are far enough from each other
so that if a generic approach helps both we are reasonably certain that
the generalization makes sense.
The new code could live under a new file in kernel/sched/power.c, to
separate it out in a tidy fashion, and to make it easy to understand.
> 2. Like 1, but we introduce a CONFIG_SCHED_POWER as suggested by Ingo,
> that makes it all go away.
That's not really what CONFIG_SCHED_POWER should do: its purpose is to
allow a 'legacy power saving mode' that makes any new logic go away.
> Pros: Intel can keep intel_pstate.c others can use the power scheduler
> or their own driver.
>
> Cons: Different platform specific drivers may need different interfaces
> to the scheduler. Harder to define cross-platform tunables.
>
> 3. We go for independent platform specific power policy driver that may
> or may not use existing frameworks, like intel_pstate.c.
And that's a NAK from the scheduler maintainers.
Thanks,
Ingo
On Fri, 2013-06-21 at 14:34 -0700, Arjan van de Ven wrote:
> On 6/21/2013 2:23 PM, Catalin Marinas wrote:
> >>
> >> oops sorry I misread your mail (lack of early coffee I suppose)
> >>
> >> I can see your point of having a thing for "did we ask for all the performance
> >> we could ask for" prior to doing a load balance (although, for power efficiency,
> >> if you have two tasks that could run in parallel, it's usually better to
> >> run them in parallel... so likely we should balance anyway)
> >
> > Not necessarily, especially if parallel running implies powering up a
> > full cluster just for one CPU (it depends on the hardware but for
> > example a cluster may not be able to go in deeper sleep states unless
> > all the CPUs in that cluster are idle).
>
> I guess it depends on the system
Sort-of. We have something similar with threads on ppc. IE, the core can
only really stop if all threads are. From a Linux persepctive it's a
matter of how we define the scope of that 'cluster' Catalin is talking
about. I'm sure you do too.
Then there is the package, which adds MC etc...
> the very first cpu needs to power on
> * the core itself
> * the "cluster" that you mention
> * the memory controller
> * the memory (out of self refresh)
>
> while the second cpu needs
> * the core itself
> * maybe a second cluster
>
> normally on Intel systems, the memory power delta is quite significant
> which then means the efficiency of the second core is huge compared to
> running things in sequence.
What's your typical latency for bringing an MC back (and memory out of
self refresh) ? IE. Basically bringing a package back up ?
Cheers,
Ben.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Mon, Jun 24, 2013 at 12:32:00AM +0100, Benjamin Herrenschmidt wrote:
> On Fri, 2013-06-21 at 14:34 -0700, Arjan van de Ven wrote:
> > On 6/21/2013 2:23 PM, Catalin Marinas wrote:
> > >>
> > >> oops sorry I misread your mail (lack of early coffee I suppose)
> > >>
> > >> I can see your point of having a thing for "did we ask for all the performance
> > >> we could ask for" prior to doing a load balance (although, for power efficiency,
> > >> if you have two tasks that could run in parallel, it's usually better to
> > >> run them in parallel... so likely we should balance anyway)
> > >
> > > Not necessarily, especially if parallel running implies powering up a
> > > full cluster just for one CPU (it depends on the hardware but for
> > > example a cluster may not be able to go in deeper sleep states unless
> > > all the CPUs in that cluster are idle).
> >
> > I guess it depends on the system
>
> Sort-of. We have something similar with threads on ppc. IE, the core can
> only really stop if all threads are. From a Linux persepctive it's a
> matter of how we define the scope of that 'cluster' Catalin is talking
> about. I'm sure you do too.
>
> Then there is the package, which adds MC etc...
I think we can say cluster == package so that we use some common
terminology. On a big.little configuration (TC2), we have 3xA7 in one
package and 2xA15 in the other. So to efficiently stop an entire package
(cluster, multi-core etc.) we need to stop all the CPUs it has.
--
Catalin
>> I guess it depends on the system
>
> Sort-of. We have something similar with threads on ppc. IE, the core can
> only really stop if all threads are. From a Linux persepctive it's a
> matter of how we define the scope of that 'cluster' Catalin is talking
> about. I'm sure you do too.
>
> Then there is the package, which adds MC etc...
>
>> the very first cpu needs to power on
>> * the core itself
>> * the "cluster" that you mention
>> * the memory controller
>> * the memory (out of self refresh)
>>
>> while the second cpu needs
>> * the core itself
>> * maybe a second cluster
>>
>> normally on Intel systems, the memory power delta is quite significant
>> which then means the efficiency of the second core is huge compared to
>> running things in sequence.
>
> What's your typical latency for bringing an MC back (and memory out of
> self refresh) ? IE. Basically bringing a package back up ?
to bring the system back up if all cores in the whole system are idle and power gated,
memory in SR etc... is typically < 250 usec (depends on the exact version
of the cpu etc). But the moment even one core is running, that core will keep the system
out of such deep state, and waking up a consecutive entity is much faster
to bring just a core out of power gating is more in the 40 to 50 usec range
On Mon, 2013-06-24 at 08:26 -0700, Arjan van de Ven wrote:
>
> to bring the system back up if all cores in the whole system are idle and power gated,
> memory in SR etc... is typically < 250 usec (depends on the exact version
> of the cpu etc). But the moment even one core is running, that core will keep the system
> out of such deep state, and waking up a consecutive entity is much faster
>
> to bring just a core out of power gating is more in the 40 to 50 usec range
Out of curiosity, what happens to PCIe when you bring a package down
like this ?
Cheers,
Ben.
On 6/24/2013 2:59 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2013-06-24 at 08:26 -0700, Arjan van de Ven wrote:
>>
>> to bring the system back up if all cores in the whole system are idle and power gated,
>> memory in SR etc... is typically < 250 usec (depends on the exact version
>> of the cpu etc). But the moment even one core is running, that core will keep the system
>> out of such deep state, and waking up a consecutive entity is much faster
>>
>> to bring just a core out of power gating is more in the 40 to 50 usec range
>
> Out of curiosity, what happens to PCIe when you bring a package down
> like this ?
PCIe devices can communicate latency requirements (LTR) if they need something
more aggressive than this; otherwise 250 usec afaik falls within what doesn't
break (devices need to cope with arbitrage/etc delays anyway)
and with PCIe link power management there are delays regardless; once a PCIe link gets powered
back on the memory controller/etc also will come back online