2014-06-03 11:50:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> +static struct capacity_state cap_states_cluster_a7[] = {
> + /* Cluster only power */
> + { .cap = 358, .power = 2967, }, /* 350 MHz */
> + { .cap = 410, .power = 2792, }, /* 400 MHz */
> + { .cap = 512, .power = 2810, }, /* 500 MHz */
> + { .cap = 614, .power = 2815, }, /* 600 MHz */
> + { .cap = 717, .power = 2919, }, /* 700 MHz */
> + { .cap = 819, .power = 2847, }, /* 800 MHz */
> + { .cap = 922, .power = 3917, }, /* 900 MHz */
> + { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> + };

So one thing I remember was that we spoke about restricting this to
frequency levels where the voltage changed.

Because voltage jumps were the biggest factor to energy usage.

Any word on that?


Attachments:
(No filename) (782.00 B)
(No filename) (836.00 B)
Download all attachments

2014-06-04 16:02:36

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Tue, Jun 03, 2014 at 12:50:15PM +0100, Peter Zijlstra wrote:
> On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > +static struct capacity_state cap_states_cluster_a7[] = {
> > + /* Cluster only power */
> > + { .cap = 358, .power = 2967, }, /* 350 MHz */
> > + { .cap = 410, .power = 2792, }, /* 400 MHz */
> > + { .cap = 512, .power = 2810, }, /* 500 MHz */
> > + { .cap = 614, .power = 2815, }, /* 600 MHz */
> > + { .cap = 717, .power = 2919, }, /* 700 MHz */
> > + { .cap = 819, .power = 2847, }, /* 800 MHz */
> > + { .cap = 922, .power = 3917, }, /* 900 MHz */
> > + { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > + };
>
> So one thing I remember was that we spoke about restricting this to
> frequency levels where the voltage changed.
>
> Because voltage jumps were the biggest factor to energy usage.
>
> Any word on that?

Since we don't drive P-state changes from the scheduler, I think we
could leave out P-states from the table without too much trouble. Good
point.

TC2 is an early development platform and somewhat different from what
you find in end user products. TC2 actually uses the same voltage for
all states except the highest 2-3 states. That is not typical. The
voltage is typically slightly different for each state, however, the
difference get bigger for higher P-states. We could probably get away
with representing multiple states as one in the energy model if the
voltage change is minimal.

2014-06-04 17:27:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Wed, Jun 04, 2014 at 05:02:30PM +0100, Morten Rasmussen wrote:
> On Tue, Jun 03, 2014 at 12:50:15PM +0100, Peter Zijlstra wrote:
> > On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > > +static struct capacity_state cap_states_cluster_a7[] = {
> > > + /* Cluster only power */
> > > + { .cap = 358, .power = 2967, }, /* 350 MHz */
> > > + { .cap = 410, .power = 2792, }, /* 400 MHz */
> > > + { .cap = 512, .power = 2810, }, /* 500 MHz */
> > > + { .cap = 614, .power = 2815, }, /* 600 MHz */
> > > + { .cap = 717, .power = 2919, }, /* 700 MHz */
> > > + { .cap = 819, .power = 2847, }, /* 800 MHz */
> > > + { .cap = 922, .power = 3917, }, /* 900 MHz */
> > > + { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > > + };
> >
> > So one thing I remember was that we spoke about restricting this to
> > frequency levels where the voltage changed.
> >
> > Because voltage jumps were the biggest factor to energy usage.
> >
> > Any word on that?
>
> Since we don't drive P-state changes from the scheduler, I think we
> could leave out P-states from the table without too much trouble. Good
> point.

Well, we eventually want to go there I think. Although we still needed
to come up with something for Intel, because I'm not at all sure how all
that works.

> TC2 is an early development platform and somewhat different from what
> you find in end user products. TC2 actually uses the same voltage for
> all states except the highest 2-3 states. That is not typical. The
> voltage is typically slightly different for each state, however, the
> difference get bigger for higher P-states. We could probably get away
> with representing multiple states as one in the energy model if the
> voltage change is minimal.

So while I don't mind the full table, esp. if its fairly easy to
generate using that tool you spoke about, I just wondered if it made
sense to somewhat reduce it.

Now that I look at the actual .power values, you can indeed see that all
except the last two are pretty much similar in power usage.

On that, is that fluctuation measurement noise, or is that stable?

2014-06-04 21:39:41

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Wednesday, June 04, 2014 07:27:12 PM Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 05:02:30PM +0100, Morten Rasmussen wrote:
> > On Tue, Jun 03, 2014 at 12:50:15PM +0100, Peter Zijlstra wrote:
> > > On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > > > +static struct capacity_state cap_states_cluster_a7[] = {
> > > > + /* Cluster only power */
> > > > + { .cap = 358, .power = 2967, }, /* 350 MHz */
> > > > + { .cap = 410, .power = 2792, }, /* 400 MHz */
> > > > + { .cap = 512, .power = 2810, }, /* 500 MHz */
> > > > + { .cap = 614, .power = 2815, }, /* 600 MHz */
> > > > + { .cap = 717, .power = 2919, }, /* 700 MHz */
> > > > + { .cap = 819, .power = 2847, }, /* 800 MHz */
> > > > + { .cap = 922, .power = 3917, }, /* 900 MHz */
> > > > + { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > > > + };
> > >
> > > So one thing I remember was that we spoke about restricting this to
> > > frequency levels where the voltage changed.
> > >
> > > Because voltage jumps were the biggest factor to energy usage.
> > >
> > > Any word on that?
> >
> > Since we don't drive P-state changes from the scheduler, I think we
> > could leave out P-states from the table without too much trouble. Good
> > point.
>
> Well, we eventually want to go there I think. Although we still needed
> to come up with something for Intel, because I'm not at all sure how all
> that works.

Do you mean power numbers or how P-states work on Intel in general?

Rafael

2014-06-05 06:52:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Wed, Jun 04, 2014 at 11:56:55PM +0200, Rafael J. Wysocki wrote:
> On Wednesday, June 04, 2014 07:27:12 PM Peter Zijlstra wrote:

> > Well, we eventually want to go there I think. Although we still needed
> > to come up with something for Intel, because I'm not at all sure how all
> > that works.
>
> Do you mean power numbers or how P-states work on Intel in general?

P-states, I'm still not at all sure how all that works on Intel and what
we can sanely do with them.

Supposedly Intel has a means of setting P-states (there's a driver after
all), but then is completely free to totally ignore it and do something
entirely different anyhow.

And while APERF/MPERF allows observing what it did, its afaik, nigh on
impossible to predict wtf its going to do, and therefore any such energy
computation is going to be a PRNG at best.

Now, given all that I'm not sure what we need that P-state driver for,
so supposedly I'm missing something.

Ideally Len (or someone equally in-the-know) would explain to me how
exactly all that works and what we can rely upon. All I've gotten so far
is, you can't rely on anything, and magik. Which is entirely useless.


Attachments:
(No filename) (1.13 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-05 15:03:21

by Dirk Brandewie

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On 06/04/2014 11:52 PM, Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 11:56:55PM +0200, Rafael J. Wysocki wrote:
>> On Wednesday, June 04, 2014 07:27:12 PM Peter Zijlstra wrote:
>
>>> Well, we eventually want to go there I think. Although we still needed
>>> to come up with something for Intel, because I'm not at all sure how all
>>> that works.
>>
>> Do you mean power numbers or how P-states work on Intel in general?
>
> P-states, I'm still not at all sure how all that works on Intel and what
> we can sanely do with them.
>
> Supposedly Intel has a means of setting P-states (there's a driver after
> all), but then is completely free to totally ignore it and do something
> entirely different anyhow.

You can request a P state per core but the package does coordination at
a package level for the P state that will be used based on all requests.
This is due to the fact that most SKUs have a single VR and PLL. So
the highest P state wins. When a core goes idle it loses it's vote
for the current package P state and that cores clock it turned off.

>
> And while APERF/MPERF allows observing what it did, its afaik, nigh on
> impossible to predict wtf its going to do, and therefore any such energy
> computation is going to be a PRNG at best.
>
> Now, given all that I'm not sure what we need that P-state driver for,
> so supposedly I'm missing something.

intel_pstate tries to keep the core P state as low as possible to satisfy
the given load, so when various cores go idle the package P state can be
as low as possible. The big power win is a core going idle.

>
> Ideally Len (or someone equally in-the-know) would explain to me how
> exactly all that works and what we can rely upon. All I've gotten so far
> is, you can't rely on anything, and magik. Which is entirely useless.
>
The only thing you can rely on is that you will get "at least" the P state
requested in the presence of hardware coordination.

2014-06-06 04:33:51

by Yuyang Du

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Thu, Jun 05, 2014 at 08:03:15AM -0700, Dirk Brandewie wrote:
>
> You can request a P state per core but the package does coordination at
> a package level for the P state that will be used based on all requests.
> This is due to the fact that most SKUs have a single VR and PLL. So
> the highest P state wins. When a core goes idle it loses it's vote
> for the current package P state and that cores clock it turned off.
>

You need to differentiate Turbo and non-Turbo. The highest P state wins? Not
really. Actually, silicon supports indepdent non-Turbo pstate, but just not enabled.
For Turbo, it basically depends on power budget of both core and gfx (because
they share) for each core to get which Turbo point.

> >
> >And while APERF/MPERF allows observing what it did, its afaik, nigh on
> >impossible to predict wtf its going to do, and therefore any such energy
> >computation is going to be a PRNG at best.
> >
> >Now, given all that I'm not sure what we need that P-state driver for,
> >so supposedly I'm missing something.
>
> intel_pstate tries to keep the core P state as low as possible to satisfy
> the given load, so when various cores go idle the package P state can be
> as low as possible. The big power win is a core going idle.
>

In terms of prediction, it is definitely can't be 100% right. But the
performance of most workloads does scale with pstate (frequency), may not be
linearly. So it is to some point predictable FWIW. And this is all governors
and Intel_pstate's basic assumption.

Thanks,
Yuyang

2014-06-06 08:05:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Fri, Jun 06, 2014 at 04:29:30AM +0800, Yuyang Du wrote:
> On Thu, Jun 05, 2014 at 08:03:15AM -0700, Dirk Brandewie wrote:
> >
> > You can request a P state per core but the package does coordination at
> > a package level for the P state that will be used based on all requests.
> > This is due to the fact that most SKUs have a single VR and PLL. So
> > the highest P state wins. When a core goes idle it loses it's vote
> > for the current package P state and that cores clock it turned off.
> >
>
> You need to differentiate Turbo and non-Turbo. The highest P state wins? Not
> really.

*sigh* and here we go again.. someone please, write something coherent
and have all intel people sign off on it and stop saying different
things.

> Actually, silicon supports indepdent non-Turbo pstate, but just not enabled.

Then it doesn't exist, so no point in mentioning it.

> For Turbo, it basically depends on power budget of both core and gfx (because
> they share) for each core to get which Turbo point.

And RAPL controls can give preference of which gfx/core gets most,
right?

> > intel_pstate tries to keep the core P state as low as possible to satisfy
> > the given load, so when various cores go idle the package P state can be
> > as low as possible. The big power win is a core going idle.
> >
>
> In terms of prediction, it is definitely can't be 100% right. But the
> performance of most workloads does scale with pstate (frequency), may not be
> linearly. So it is to some point predictable FWIW. And this is all governors
> and Intel_pstate's basic assumption.

So frequency isn't _that_ interesting, voltage is. And while
predictability it might be their assumption, is it actually true? I
mean, there's really nothing else except to assume that, if its not you
can't do anything at all, so you _have_ to assume this.

But again, is the assumption true? Or just happy thoughts in an attempt
to do something.


Attachments:
(No filename) (1.89 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-06 08:39:22

by Yuyang Du

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Fri, Jun 06, 2014 at 10:05:43AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 06, 2014 at 04:29:30AM +0800, Yuyang Du wrote:
> > On Thu, Jun 05, 2014 at 08:03:15AM -0700, Dirk Brandewie wrote:
> > >
> > > You can request a P state per core but the package does coordination at
> > > a package level for the P state that will be used based on all requests.
> > > This is due to the fact that most SKUs have a single VR and PLL. So
> > > the highest P state wins. When a core goes idle it loses it's vote
> > > for the current package P state and that cores clock it turned off.
> > >
> >
> > You need to differentiate Turbo and non-Turbo. The highest P state wins? Not
> > really.
>
> *sigh* and here we go again.. someone please, write something coherent
> and have all intel people sign off on it and stop saying different
> things.
>
> > Actually, silicon supports indepdent non-Turbo pstate, but just not enabled.
>
> Then it doesn't exist, so no point in mentioning it.
>

Well, things actually get more complicated. Not-enabled is for Core. For Atom
Baytrail, each core indeed can operate on difference frequency. I am not sure for
Xeon, :)

> > For Turbo, it basically depends on power budget of both core and gfx (because
> > they share) for each core to get which Turbo point.
>
> And RAPL controls can give preference of which gfx/core gets most,
> right?
>

Maybe Jacob knows that.

> > > intel_pstate tries to keep the core P state as low as possible to satisfy
> > > the given load, so when various cores go idle the package P state can be
> > > as low as possible. The big power win is a core going idle.
> > >
> >
> > In terms of prediction, it is definitely can't be 100% right. But the
> > performance of most workloads does scale with pstate (frequency), may not be
> > linearly. So it is to some point predictable FWIW. And this is all governors
> > and Intel_pstate's basic assumption.
>
> So frequency isn't _that_ interesting, voltage is. And while
> predictability it might be their assumption, is it actually true? I
> mean, there's really nothing else except to assume that, if its not you
> can't do anything at all, so you _have_ to assume this.
>
> But again, is the assumption true? Or just happy thoughts in an attempt
> to do something.

Voltage is combined with frequency, roughly, voltage is proportional to freuquecy, so
roughly, power is proportionaly to voltage^3. You can't say which is more important,
or there is no reason to raise voltage without raising frequency.

If only one word to say: true of false, it is true. Because given any fixed
workload, I can't see why performance would be worse if frequency is higher.

The reality as opposed to the assumption is in two-fold:
1) if workload is CPU bound, performance scales with frequency absolutely. if workload is
memory bound, it does not scale. But from kernel, we don't know whether it is CPU bound
or not (or it is hard to know). uArch statistics can model that.
2) the workload is not fixed in real-time, changing all the time.

But still, the assumption is a must or no guilty, because we adjust frequency continuously,
for example, if the workload is fixed, and if the performance does not scale with freq we stop
increasing frequency. So a good frequency governor or driver should and can continuously
pursue "good" frequency with the changing workload. Therefore, in the long term, we will be
better off.

2014-06-06 10:50:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Fri, Jun 06, 2014 at 08:35:21AM +0800, Yuyang Du wrote:

> > > Actually, silicon supports indepdent non-Turbo pstate, but just not enabled.
> >
> > Then it doesn't exist, so no point in mentioning it.
> >
>
> Well, things actually get more complicated. Not-enabled is for Core. For Atom
> Baytrail, each core indeed can operate on difference frequency. I am not sure for
> Xeon, :)

Yes, I understand Atom is an entirely different thing.

> > So frequency isn't _that_ interesting, voltage is. And while
> > predictability it might be their assumption, is it actually true? I
> > mean, there's really nothing else except to assume that, if its not you
> > can't do anything at all, so you _have_ to assume this.
> >
> > But again, is the assumption true? Or just happy thoughts in an attempt
> > to do something.
>
> Voltage is combined with frequency, roughly, voltage is proportional
> to freuquecy, so roughly, power is proportionaly to voltage^3. You

P ~ V^2, last time I checked.

> can't say which is more important, or there is no reason to raise
> voltage without raising frequency.

Well, some chips have far fewer voltage steps than freq steps; or,
differently put, they have multiple freq steps for a single voltage
level.

And since the power (Watts) is proportional to Voltage squared, its the
biggest term.

If you have a distinct voltage level for each freq, it all doesn't
matter.

> If only one word to say: true of false, it is true. Because given any
> fixed workload, I can't see why performance would be worse if
> frequency is higher.

Well, our work here is to redefine performance as performance/watt. So
running at higher frequency (and thus likely higher voltage) is a
definite performance decrease in that sense.

> The reality as opposed to the assumption is in two-fold:
> 1) if workload is CPU bound, performance scales with frequency absolutely. if workload is
> memory bound, it does not scale. But from kernel, we don't know whether it is CPU bound
> or not (or it is hard to know). uArch statistics can model that.

Well, we could know for a number of archs, its just that these
statistics are expensive to track.

Also, lowering P-state is 'fine', as long as you can 'guarantee' you
don't loose IPC performance, since running at lower voltage for the same
IPC is actually better IPC/watt than estimated.

But what was said earlier is that P-state is a lower limit, not a higher
limit. In that case the core can run at higher voltage and the estimate
is just plain wrong.

> But still, the assumption is a must or no guilty, because we adjust
> frequency continuously, for example, if the workload is fixed, and if
> the performance does not scale with freq we stop increasing frequency.
> So a good frequency governor or driver should and can continuously
> pursue "good" frequency with the changing workload. Therefore, in the
> long term, we will be better off.

Sure, but realize that we must fully understand this governor and
integrate it in the scheduler if we're to attain the goal of IPC/watt
optimized scheduling behaviour.

So you (or rather Intel in general) will have to be very explicit on how
their stuff works and can no longer hide in some driver and do magic.
The same is true for all other vendors for that matter.

If you (vendors, not Yuyang in specific) do not want to play (and be
explicit and expose how your hardware functions) then you simply will
not get power efficient scheduling full stop.

There's no rocks to hide under, no magic veils to hide behind. You tell
_in_public_ or you get nothing.


Attachments:
(No filename) (3.49 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-06 12:13:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler


* Peter Zijlstra <[email protected]> wrote:

> > Voltage is combined with frequency, roughly, voltage is
> > proportional to freuquecy, so roughly, power is proportionaly to
> > voltage^3. You
>
> P ~ V^2, last time I checked.

Yes, that's a good approximation for CMOS gates:

The switching power dissipated by a chip using static CMOS gates is
C?V^2?f, where C is the capacitance being switched per clock cycle,
V is the supply voltage, and f is the switching frequency,[1] so
this part of the power consumption decreases quadratically with
voltage. The formula is not exact however, as many modern chips are
not implemented using 100% CMOS, but also use special memory
circuits, dynamic logic such as domino logic, etc. Moreover, there
is also a static leakage current, which has become more and more
accentuated as feature sizes have become smaller (below 90
nanometres) and threshold levels lower.

Accordingly, dynamic voltage scaling is widely used as part of
strategies to manage switching power consumption in battery powered
devices such as cell phones and laptop computers. Low voltage modes
are used in conjunction with lowered clock frequencies to minimize
power consumption associated with components such as CPUs and DSPs;
only when significant computational power is needed will the voltage
and frequency be raised.

Some peripherals also support low voltage operational modes. For
example, low power MMC and SD cards can run at 1.8 V as well as at
3.3 V, and driver stacks may conserve power by switching to the
lower voltage after detecting a card which supports it.

When leakage current is a significant factor in terms of power
consumption, chips are often designed so that portions of them can
be powered completely off. This is not usually viewed as being
dynamic voltage scaling, because it is not transparent to software.
When sections of chips can be turned off, as for example on TI OMAP3
processors, drivers and other support software need to support that.

http://en.wikipedia.org/wiki/Dynamic_voltage_scaling

Leakage current typically gets higher with higher frequencies, but
it's also highly process dependent AFAIK.

If switching power dissipation is the main factor in power use, then
we can essentially assume that P ~ V^2, at the same frequency - and
scales linearly with frequency - but real work performed also scales
semi-linearly with frequency for many workloads, so that's an
invariant for everything except highly memory bound workloads.

Thanks,

Ingo

2014-06-06 12:27:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler


* Ingo Molnar <[email protected]> wrote:

> * Peter Zijlstra <[email protected]> wrote:
>
> > > Voltage is combined with frequency, roughly, voltage is
> > > proportional to freuquecy, so roughly, power is proportionaly to
> > > voltage^3. You
> >
> > P ~ V^2, last time I checked.
>
> Yes, that's a good approximation for CMOS gates:
>
> The switching power dissipated by a chip using static CMOS gates is
> C?V^2?f, where C is the capacitance being switched per clock cycle,
> V is the supply voltage, and f is the switching frequency,[1] so
> this part of the power consumption decreases quadratically with
> voltage. The formula is not exact however, as many modern chips are
> not implemented using 100% CMOS, but also use special memory
> circuits, dynamic logic such as domino logic, etc. Moreover, there
> is also a static leakage current, which has become more and more
> accentuated as feature sizes have become smaller (below 90
> nanometres) and threshold levels lower.
>
> Accordingly, dynamic voltage scaling is widely used as part of
> strategies to manage switching power consumption in battery powered
> devices such as cell phones and laptop computers. Low voltage modes
> are used in conjunction with lowered clock frequencies to minimize
> power consumption associated with components such as CPUs and DSPs;
> only when significant computational power is needed will the voltage
> and frequency be raised.
>
> Some peripherals also support low voltage operational modes. For
> example, low power MMC and SD cards can run at 1.8 V as well as at
> 3.3 V, and driver stacks may conserve power by switching to the
> lower voltage after detecting a card which supports it.
>
> When leakage current is a significant factor in terms of power
> consumption, chips are often designed so that portions of them can
> be powered completely off. This is not usually viewed as being
> dynamic voltage scaling, because it is not transparent to software.
> When sections of chips can be turned off, as for example on TI OMAP3
> processors, drivers and other support software need to support that.
>
> http://en.wikipedia.org/wiki/Dynamic_voltage_scaling
>
> Leakage current typically gets higher with higher frequencies, but
> it's also highly process dependent AFAIK.
>
> If switching power dissipation is the main factor in power use, then
> we can essentially assume that P ~ V^2, at the same frequency - and
> scales linearly with frequency - but real work performed also scales
> semi-linearly with frequency for many workloads, so that's an
> invariant for everything except highly memory bound workloads.

So in practice this probably means that Turbo probably has a somewhat
super-linear power use factor.

At lower frequencies the leakage current difference is probably
negligible.

In any case, even with turbo frequencies, switching power use is
probably an order of magnitude higher than leakage current power use,
on any marketable chip, so we should concentrate on being able to
cover this first order effect (P/work ~ V^2), before considering any
second order effects (leakage current).

Thanks,

Ingo

2014-06-06 13:03:52

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Wed, Jun 04, 2014 at 06:27:12PM +0100, Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 05:02:30PM +0100, Morten Rasmussen wrote:
> > On Tue, Jun 03, 2014 at 12:50:15PM +0100, Peter Zijlstra wrote:
> > > On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > > > +static struct capacity_state cap_states_cluster_a7[] = {
> > > > + /* Cluster only power */
> > > > + { .cap = 358, .power = 2967, }, /* 350 MHz */
> > > > + { .cap = 410, .power = 2792, }, /* 400 MHz */
> > > > + { .cap = 512, .power = 2810, }, /* 500 MHz */
> > > > + { .cap = 614, .power = 2815, }, /* 600 MHz */
> > > > + { .cap = 717, .power = 2919, }, /* 700 MHz */
> > > > + { .cap = 819, .power = 2847, }, /* 800 MHz */
> > > > + { .cap = 922, .power = 3917, }, /* 900 MHz */
> > > > + { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > > > + };

[...]

> > TC2 is an early development platform and somewhat different from what
> > you find in end user products. TC2 actually uses the same voltage for
> > all states except the highest 2-3 states. That is not typical. The
> > voltage is typically slightly different for each state, however, the
> > difference get bigger for higher P-states. We could probably get away
> > with representing multiple states as one in the energy model if the
> > voltage change is minimal.
>
> So while I don't mind the full table, esp. if its fairly easy to
> generate using that tool you spoke about, I just wondered if it made
> sense to somewhat reduce it.
>
> Now that I look at the actual .power values, you can indeed see that all
> except the last two are pretty much similar in power usage.
>
> On that, is that fluctuation measurement noise, or is that stable?

It would make sense to reduce it for this particular platform. In fact
it is questionable whether we should use frequencies below 800 MHz at
all. On TC2 the voltage is same for 800 MHz and below and it seems that
leakage (static) power is dominating the power consumption. Since the
power is almost constant in the range 350 to 800 MHz energy-efficiency
(performance/watt ~ cap/power) is actually getting *better* as we run
faster until we get to 800 MHz. Beyond 800 MHz energy-efficiency goes
down due to increased voltages.

The proposed platform energy model is an extremely simplified view of
the platform. The numbers are pretty much the raw data normalized and
averaged as appropriate. I haven't tweaked them in any way to make them
look more perfect. So, the small variations (within 4%) may be
measurement noise and the fact I model something complex with a simple
model.

2014-06-06 14:12:01

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Fri, Jun 06, 2014 at 01:27:40PM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > > Voltage is combined with frequency, roughly, voltage is
> > > > proportional to freuquecy, so roughly, power is proportionaly to
> > > > voltage^3. You
> > >
> > > P ~ V^2, last time I checked.
> >
> > Yes, that's a good approximation for CMOS gates:
> >
> > The switching power dissipated by a chip using static CMOS gates is
> > C?V^2?f, where C is the capacitance being switched per clock cycle,
> > V is the supply voltage, and f is the switching frequency,[1] so
> > this part of the power consumption decreases quadratically with
> > voltage. The formula is not exact however, as many modern chips are
> > not implemented using 100% CMOS, but also use special memory
> > circuits, dynamic logic such as domino logic, etc. Moreover, there
> > is also a static leakage current, which has become more and more
> > accentuated as feature sizes have become smaller (below 90
> > nanometres) and threshold levels lower.
> >
> > Accordingly, dynamic voltage scaling is widely used as part of
> > strategies to manage switching power consumption in battery powered
> > devices such as cell phones and laptop computers. Low voltage modes
> > are used in conjunction with lowered clock frequencies to minimize
> > power consumption associated with components such as CPUs and DSPs;
> > only when significant computational power is needed will the voltage
> > and frequency be raised.
> >
> > Some peripherals also support low voltage operational modes. For
> > example, low power MMC and SD cards can run at 1.8 V as well as at
> > 3.3 V, and driver stacks may conserve power by switching to the
> > lower voltage after detecting a card which supports it.
> >
> > When leakage current is a significant factor in terms of power
> > consumption, chips are often designed so that portions of them can
> > be powered completely off. This is not usually viewed as being
> > dynamic voltage scaling, because it is not transparent to software.
> > When sections of chips can be turned off, as for example on TI OMAP3
> > processors, drivers and other support software need to support that.
> >
> > http://en.wikipedia.org/wiki/Dynamic_voltage_scaling
> >
> > Leakage current typically gets higher with higher frequencies, but
> > it's also highly process dependent AFAIK.

Strictly speaking leakage current gets higher with voltage, not
frequency (well, not to an extend where we should care). However,
frequency increase typically implies a voltage increase, so in that
sense I agree.

> >
> > If switching power dissipation is the main factor in power use, then
> > we can essentially assume that P ~ V^2, at the same frequency - and
> > scales linearly with frequency - but real work performed also scales
> > semi-linearly with frequency for many workloads, so that's an
> > invariant for everything except highly memory bound workloads.

AFAIK, there isn't much sense in running a slower frequency than the
highest one supported at a given voltage unless there are specific
reasons not to (peripherals that keeps the system up anyway and such).
In the general case, I think it is safe to assume that energy-efficiency
goes down for every increase in frequency. Modern ARM platforms
typically have different voltages for more or less all frequencies (TC2
is quite atypical). The voltage increases more rapidly than the
frequency which makes the higher frequencies extremely expensive in
terms of energy-efficiency.

All of this is of course without considering power gating which allow us
to eliminate the leakage power (or at least partially eliminate it)
when idle. So, while energy-efficiency is bad at high frequencies, it
might pay off overall to use them anyway if we can save more leakage
energy while idle than we burn extra to race to idle. This is where the
platform energy model becomes useful.

> So in practice this probably means that Turbo probably has a somewhat
> super-linear power use factor.

I'm not familiar with the voltage scaling on Intel platforms, but as
said above, I think power always scales up faster than performance. It
can probably be ignored for lower frequencies, but for the higher ones,
the extra energy per instruction executed is significant.

> At lower frequencies the leakage current difference is probably
> negligible.

It is still there, but it is smaller due to the reduced voltage and so
is the dynamic power.

> In any case, even with turbo frequencies, switching power use is
> probably an order of magnitude higher than leakage current power use,
> on any marketable chip,

That strongly depends on the process and the gate library used, but I
agree that dynamic power should be our primary focus.

> so we should concentrate on being able to
> cover this first order effect (P/work ~ V^2), before considering any
> second order effects (leakage current).

I think we should be fine as long as we include the leakage power in the
'busy' power consumption and know the idle-state power consumption in
the idle-states. I already do this in the TC2 model. That way we don't
have to distinguish between leakage and dynamic power.

Morten

2014-06-06 16:28:19

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Fri, 6 Jun 2014 08:35:21 +0800
Yuyang Du <[email protected]> wrote:

> On Fri, Jun 06, 2014 at 10:05:43AM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 06, 2014 at 04:29:30AM +0800, Yuyang Du wrote:
> > > On Thu, Jun 05, 2014 at 08:03:15AM -0700, Dirk Brandewie wrote:
> > > >
> > > > You can request a P state per core but the package does
> > > > coordination at a package level for the P state that will be
> > > > used based on all requests. This is due to the fact that most
> > > > SKUs have a single VR and PLL. So the highest P state wins.
> > > > When a core goes idle it loses it's vote for the current
> > > > package P state and that cores clock it turned off.
> > > >
> > >
> > > You need to differentiate Turbo and non-Turbo. The highest P
> > > state wins? Not really.
> >
> > *sigh* and here we go again.. someone please, write something
> > coherent and have all intel people sign off on it and stop saying
> > different things.
> >
> > > Actually, silicon supports indepdent non-Turbo pstate, but just
> > > not enabled.
> >
> > Then it doesn't exist, so no point in mentioning it.
> >
>
> Well, things actually get more complicated. Not-enabled is for Core.
> For Atom Baytrail, each core indeed can operate on difference
> frequency. I am not sure for Xeon, :)
>
> > > For Turbo, it basically depends on power budget of both core and
> > > gfx (because they share) for each core to get which Turbo point.
> >
> > And RAPL controls can give preference of which gfx/core gets most,
> > right?
> >
>
There are two controls can influence gfx and core power budge sharing:
1. set power limit on each RAPL domain
2. turbo power budge sharing
#2 is not implemented yet. default to CPU take all.

>
> > > > intel_pstate tries to keep the core P state as low as possible
> > > > to satisfy the given load, so when various cores go idle the
> > > > package P state can be as low as possible. The big power win
> > > > is a core going idle.
> > > >
> > >
> > > In terms of prediction, it is definitely can't be 100% right. But
> > > the performance of most workloads does scale with pstate
> > > (frequency), may not be linearly. So it is to some point
> > > predictable FWIW. And this is all governors and Intel_pstate's
> > > basic assumption.
> >
> > So frequency isn't _that_ interesting, voltage is. And while
> > predictability it might be their assumption, is it actually true? I
> > mean, there's really nothing else except to assume that, if its not
> > you can't do anything at all, so you _have_ to assume this.
> >
> > But again, is the assumption true? Or just happy thoughts in an
> > attempt to do something.
>
> Voltage is combined with frequency, roughly, voltage is proportional
> to freuquecy, so roughly, power is proportionaly to voltage^3. You
> can't say which is more important, or there is no reason to raise
> voltage without raising frequency.
>
> If only one word to say: true of false, it is true. Because given any
> fixed workload, I can't see why performance would be worse if
> frequency is higher.
>
> The reality as opposed to the assumption is in two-fold:
> 1) if workload is CPU bound, performance scales with frequency
> absolutely. if workload is memory bound, it does not scale. But from
> kernel, we don't know whether it is CPU bound or not (or it is hard
> to know). uArch statistics can model that. 2) the workload is not
> fixed in real-time, changing all the time.
>
> But still, the assumption is a must or no guilty, because we adjust
> frequency continuously, for example, if the workload is fixed, and if
> the performance does not scale with freq we stop increasing
> frequency. So a good frequency governor or driver should and can
> continuously pursue "good" frequency with the changing workload.
> Therefore, in the long term, we will be better off.
>

[Jacob Pan]

2014-06-07 02:34:03

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Fri, 6 Jun 2014, Ingo Molnar wrote:

> In any case, even with turbo frequencies, switching power use is
> probably an order of magnitude higher than leakage current power use,
> on any marketable chip, so we should concentrate on being able to
> cover this first order effect (P/work ~ V^2), before considering any
> second order effects (leakage current).

Just so that people are aware... We'll have to introduce thermal
constraint management into the scheduler mix as well at some point.
Right now what we have is an ad hoc subsystem that simply monitors
temperature and apply crude cooling strategies when some thresholds are
met. But a better strategy would imply thermal "provisioning".


Nicolas

2014-06-07 02:52:53

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Wed, 4 Jun 2014, Peter Zijlstra wrote:

> On Wed, Jun 04, 2014 at 05:02:30PM +0100, Morten Rasmussen wrote:
> > On Tue, Jun 03, 2014 at 12:50:15PM +0100, Peter Zijlstra wrote:
> > > On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > > > +static struct capacity_state cap_states_cluster_a7[] = {
> > > > + /* Cluster only power */
> > > > + { .cap = 358, .power = 2967, }, /* 350 MHz */
> > > > + { .cap = 410, .power = 2792, }, /* 400 MHz */
> > > > + { .cap = 512, .power = 2810, }, /* 500 MHz */
> > > > + { .cap = 614, .power = 2815, }, /* 600 MHz */
> > > > + { .cap = 717, .power = 2919, }, /* 700 MHz */
> > > > + { .cap = 819, .power = 2847, }, /* 800 MHz */
> > > > + { .cap = 922, .power = 3917, }, /* 900 MHz */
> > > > + { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > > > + };
> > >
> > > So one thing I remember was that we spoke about restricting this to
> > > frequency levels where the voltage changed.
> > >
> > > Because voltage jumps were the biggest factor to energy usage.
> > >
> > > Any word on that?
> >
> > Since we don't drive P-state changes from the scheduler, I think we
> > could leave out P-states from the table without too much trouble. Good
> > point.
>
> Well, we eventually want to go there I think.

People within Linaro have initial code for this. Should be posted as an
RFC soon.

> Although we still needed
> to come up with something for Intel, because I'm not at all sure how all
> that works.

Our initial code reuse whatever existing platform specific cpufreq
drivers. The idea is to bypass the cpufreq governors.

If Intel hardware doesn't provide/allow much control here then the
platform driver should already tell the cpufreq core (and by extension
the scheduler) about the extent of what can be done.


Nicolas

2014-06-08 07:30:24

by Yuyang Du

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Fri, Jun 06, 2014 at 12:50:36PM +0200, Peter Zijlstra wrote:
> > Voltage is combined with frequency, roughly, voltage is proportional
> > to freuquecy, so roughly, power is proportionaly to voltage^3. You
>
> P ~ V^2, last time I checked.
>
> > can't say which is more important, or there is no reason to raise
> > voltage without raising frequency.
>
> Well, some chips have far fewer voltage steps than freq steps; or,
> differently put, they have multiple freq steps for a single voltage
> level.
>
> And since the power (Watts) is proportional to Voltage squared, its the
> biggest term.
>
> If you have a distinct voltage level for each freq, it all doesn't
> matter.
>

Ok. I think we understand each other. But one more thing, I said P ~ V^3,
because P ~ V^2*f and f ~ V, so P ~ V^3. Maybe some frequencies share the same
voltage, but you can still safely assume V changes with f in general, and it
will be more and more so, since we do need finer control over power consumption.

> Sure, but realize that we must fully understand this governor and
> integrate it in the scheduler if we're to attain the goal of IPC/watt
> optimized scheduling behaviour.
>

Attain the goal of IPC/watt optimized?

I don't see how it can be done like this. As I said, what is unknown for
prediction is perf scaling *and* changing workload. So the challenge for pstate
control is in both. But I see more chanllenge in the changing workload than
in the performance scaling or the resulting IPC impact (if workload is
fixed).

Currently, all freq governors take CPU utilization (load%) as the indicator
(target), which can server both: workload and perf scaling.

As for IPC/watt optimized, I don't see how it can be practical. Too micro to
be used for the general well-being?

> So you (or rather Intel in general) will have to be very explicit on how
> their stuff works and can no longer hide in some driver and do magic.
> The same is true for all other vendors for that matter.
>
> If you (vendors, not Yuyang in specific) do not want to play (and be
> explicit and expose how your hardware functions) then you simply will
> not get power efficient scheduling full stop.
>
> There's no rocks to hide under, no magic veils to hide behind. You tell
> _in_public_ or you get nothing.

Better communication is good, especially for our increasingly iterated
products because the changing products do incur noises and inconsistency
in detail.

2014-06-08 07:57:04

by Yuyang Du

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Fri, Jun 06, 2014 at 02:13:05PM +0200, Ingo Molnar wrote:
>
> Leakage current typically gets higher with higher frequencies, but
> it's also highly process dependent AFAIK.
>

In general, you can assume leakage power ~ V^2.

> If switching power dissipation is the main factor in power use, then
> we can essentially assume that P ~ V^2, at the same frequency - and
> scales linearly with frequency - but real work performed also scales
> semi-linearly with frequency for many workloads, so that's an
> invariant for everything except highly memory bound workloads.
>

Agreed. Strictly, Energy ~ V^2.

2014-06-09 08:27:48

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> On Fri, 6 Jun 2014, Ingo Molnar wrote:
>
> > In any case, even with turbo frequencies, switching power use is
> > probably an order of magnitude higher than leakage current power use,
> > on any marketable chip, so we should concentrate on being able to
> > cover this first order effect (P/work ~ V^2), before considering any
> > second order effects (leakage current).
>
> Just so that people are aware... We'll have to introduce thermal
> constraint management into the scheduler mix as well at some point.
> Right now what we have is an ad hoc subsystem that simply monitors
> temperature and apply crude cooling strategies when some thresholds are
> met. But a better strategy would imply thermal "provisioning".

There is already work going on to improve thermal management:

http://lwn.net/Articles/599598/

The proposal is based on power/energy models (too). The goal is to
allocate power intelligently based on performance requirements.

While it is related to energy-aware scheduling and I fully agree that it
is something we need to consider, I think it is worth developing the two
ideas in parallel and look at sharing things like the power model later
once things mature. Energy-aware scheduling is complex enough on its
own to keep us entertained for a while :-)

Morten

2014-06-09 08:59:59

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Sun, Jun 08, 2014 at 12:26:29AM +0100, Yuyang Du wrote:
> On Fri, Jun 06, 2014 at 12:50:36PM +0200, Peter Zijlstra wrote:
> > > Voltage is combined with frequency, roughly, voltage is proportional
> > > to freuquecy, so roughly, power is proportionaly to voltage^3. You
> >
> > P ~ V^2, last time I checked.
> >
> > > can't say which is more important, or there is no reason to raise
> > > voltage without raising frequency.
> >
> > Well, some chips have far fewer voltage steps than freq steps; or,
> > differently put, they have multiple freq steps for a single voltage
> > level.
> >
> > And since the power (Watts) is proportional to Voltage squared, its the
> > biggest term.
> >
> > If you have a distinct voltage level for each freq, it all doesn't
> > matter.
> >
>
> Ok. I think we understand each other. But one more thing, I said P ~ V^3,
> because P ~ V^2*f and f ~ V, so P ~ V^3. Maybe some frequencies share the same
> voltage, but you can still safely assume V changes with f in general, and it
> will be more and more so, since we do need finer control over power consumption.

Agreed. Voltage typically changes with frequency.

>
> > Sure, but realize that we must fully understand this governor and
> > integrate it in the scheduler if we're to attain the goal of IPC/watt
> > optimized scheduling behaviour.
> >
>
> Attain the goal of IPC/watt optimized?
>
> I don't see how it can be done like this. As I said, what is unknown for
> prediction is perf scaling *and* changing workload. So the challenge for pstate
> control is in both. But I see more chanllenge in the changing workload than
> in the performance scaling or the resulting IPC impact (if workload is
> fixed).

IMHO, the per-entity load-tracking does a fair job representing the task
compute capacity requirements. Sure it isn't perfect, particularly not
for memory bound tasks, but it is way better than not having any task
history at all, which was the case before.

The story is more or less the same for performance scaling. It is not
taken into account at all in the scheduler at the moment. cpufreq is
actually messing up load-balancing decisions after task load-tracking
was introduced. Adding performance scaling awareness should only make
things better even if predictions are not accurate for all workloads. I
don't see why it shouldn't given the current state of energy-awareness
in the scheduler.

> Currently, all freq governors take CPU utilization (load%) as the indicator
> (target), which can server both: workload and perf scaling.

With a bunch of hacks on top to make it more reactive because the
current cpu utilization metric is not responsive enough to deal with
workload changes. That is at least the case for ondemand and interactive
(in Android).

> As for IPC/watt optimized, I don't see how it can be practical. Too micro to
> be used for the general well-being?

That is why I propose to have a platform specific energy model. You tell
the scheduler enough about your platform that it understands the most
basic power/performance trade-offs of your platform and thereby enable
the scheduler to make better decisions.

2014-06-09 10:19:12

by Yuyang Du

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Mon, Jun 09, 2014 at 09:59:52AM +0100, Morten Rasmussen wrote:
> IMHO, the per-entity load-tracking does a fair job representing the task
> compute capacity requirements. Sure it isn't perfect, particularly not
> for memory bound tasks, but it is way better than not having any task
> history at all, which was the case before.
>
> The story is more or less the same for performance scaling. It is not
> taken into account at all in the scheduler at the moment. cpufreq is
> actually messing up load-balancing decisions after task load-tracking
> was introduced. Adding performance scaling awareness should only make
> things better even if predictions are not accurate for all workloads. I
> don't see why it shouldn't given the current state of energy-awareness
> in the scheduler.
>

Optimized IPC is good for sure (with regard to pstate adjustment). My point is
how it is practical to rightly correlate to scheduler and pstate
power-efficiency. Put another way, with fixed workload, you really can do such
a thing by offline running the workload several times to conclude with a very
power-efficient solution which takes IPC into account. Actually, lots of
people have done that in papers/reports (for SPECXXX or TPC-X for example). But
I can't see how online realtime workload can be done like it.

> > Currently, all freq governors take CPU utilization (load%) as the indicator
> > (target), which can server both: workload and perf scaling.
>
> With a bunch of hacks on top to make it more reactive because the
> current cpu utilization metric is not responsive enough to deal with
> workload changes. That is at least the case for ondemand and interactive
> (in Android).
>

To what end it is not responsive enough? And how it is related here?

Thanks,
Yuyang

2014-06-09 13:22:53

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Mon, 9 Jun 2014, Morten Rasmussen wrote:

> On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> > On Fri, 6 Jun 2014, Ingo Molnar wrote:
> >
> > > In any case, even with turbo frequencies, switching power use is
> > > probably an order of magnitude higher than leakage current power use,
> > > on any marketable chip, so we should concentrate on being able to
> > > cover this first order effect (P/work ~ V^2), before considering any
> > > second order effects (leakage current).
> >
> > Just so that people are aware... We'll have to introduce thermal
> > constraint management into the scheduler mix as well at some point.
> > Right now what we have is an ad hoc subsystem that simply monitors
> > temperature and apply crude cooling strategies when some thresholds are
> > met. But a better strategy would imply thermal "provisioning".
>
> There is already work going on to improve thermal management:
>
> http://lwn.net/Articles/599598/
>
> The proposal is based on power/energy models (too). The goal is to
> allocate power intelligently based on performance requirements.

Ah, great! I missed that.

> While it is related to energy-aware scheduling and I fully agree that it
> is something we need to consider, I think it is worth developing the two
> ideas in parallel and look at sharing things like the power model later
> once things mature. Energy-aware scheduling is complex enough on its
> own to keep us entertained for a while :-)

Absolutely. This is why I said "at some point".


Nicolas

2014-06-10 10:16:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Sun, Jun 08, 2014 at 07:26:29AM +0800, Yuyang Du wrote:
> Ok. I think we understand each other. But one more thing, I said P ~ V^3,
> because P ~ V^2*f and f ~ V, so P ~ V^3. Maybe some frequencies share the same
> voltage, but you can still safely assume V changes with f in general, and it
> will be more and more so, since we do need finer control over power consumption.

I didn't know the frequency part was proportionate to another voltage
term, ok, then the cubic term makes sense.

> > Sure, but realize that we must fully understand this governor and
> > integrate it in the scheduler if we're to attain the goal of IPC/watt
> > optimized scheduling behaviour.
> >
>
> Attain the goal of IPC/watt optimized?
>
> I don't see how it can be done like this. As I said, what is unknown for
> prediction is perf scaling *and* changing workload. So the challenge for pstate
> control is in both. But I see more chanllenge in the changing workload than
> in the performance scaling or the resulting IPC impact (if workload is
> fixed).

But for the scheduler the workload change isn't that big a problem; we
know the history of each task, we know when tasks wake up and when we
move them around. Therefore we can fairly accurately predict this.

And given a simple P state model (like ARM) where the CPU simply does
what you tell it to, that all works out. We can change P-state at task
wakeup/sleep/migration and compute the most efficient P-state, and task
distribution, for the new task-set.

> Currently, all freq governors take CPU utilization (load%) as the indicator
> (target), which can server both: workload and perf scaling.

So the current cpufreq stuff is terminally broken in too many ways; its
sampling, so it misses a lot of changes, its strictly cpu local, so it
completely misses SMP information (like the migrations etc..)

If we move a 50% task from CPU1 to CPU0, a sampling thing takes time to
adjust on both CPUs, whereas if its scheduler driven, we can instantly
adjust and be done, because we _know_ what we moved.

Now some of that is due to hysterical raisins, and some of that due to
broken hardware (hardware that needs to schedule in order to change its
state because its behind some broken bus or other). But we should
basically kill off cpufreq for anything recent and sane.

> As for IPC/watt optimized, I don't see how it can be practical. Too micro to
> be used for the general well-being?

What other target would you optimize for? The purpose here is to build
an energy aware scheduler, one that schedules tasks so that the total
amount of energy, for the given amount of work, is minimal.

So we can't measure in Watt, since if we forced the CPU into the lowest
P-state (or even C-state for that matter) work would simply not
complete. So we need a complete energy term.

Now. IPC is instructions/cycle, Watt is Joule/second, so IPC/Watt is

instructions second
------------ * ------ ~ instructions / joule
cycle joule

Seeing how both cycles and seconds are time units.

So for any given amount of instructions, the work needs to be done, we
want the minimal amount of energy consumed, and IPC/Watt is the natural
metric to measure this over an entire workload.


Attachments:
(No filename) (3.14 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-10 17:01:47

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Tue, 10 Jun 2014, Peter Zijlstra wrote:

> So the current cpufreq stuff is terminally broken in too many ways; its
> sampling, so it misses a lot of changes, its strictly cpu local, so it
> completely misses SMP information (like the migrations etc..)
>
> If we move a 50% task from CPU1 to CPU0, a sampling thing takes time to
> adjust on both CPUs, whereas if its scheduler driven, we can instantly
> adjust and be done, because we _know_ what we moved.

Incidentally I submitted a LWN article highlighting those very issues
and the planned remedies. No confirmation of a publication date though.

> Now some of that is due to hysterical raisins, and some of that due to
> broken hardware (hardware that needs to schedule in order to change its
> state because its behind some broken bus or other). But we should
> basically kill off cpufreq for anything recent and sane.

EVen if some change has to happen through a kernel thread, you're still
far better with the scheduler requesting this change proactively than
waiting for both the cpufreq governor to catch up with the load and then
wait for the freq change thread to be scheduled.


Nicolas

2014-06-11 02:39:01

by Yuyang Du

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Tue, Jun 10, 2014 at 12:16:22PM +0200, Peter Zijlstra wrote:
> What other target would you optimize for? The purpose here is to build
> an energy aware scheduler, one that schedules tasks so that the total
> amount of energy, for the given amount of work, is minimal.
>
> So we can't measure in Watt, since if we forced the CPU into the lowest
> P-state (or even C-state for that matter) work would simply not
> complete. So we need a complete energy term.
>
> Now. IPC is instructions/cycle, Watt is Joule/second, so IPC/Watt is
>
> instructions second
> ------------ * ------ ~ instructions / joule
> cycle joule
>
> Seeing how both cycles and seconds are time units.
>
> So for any given amount of instructions, the work needs to be done, we
> want the minimal amount of energy consumed, and IPC/Watt is the natural
> metric to measure this over an entire workload.

Ok, I understand. Whether we take IPC/watt as an input metric in scheduler or
as a goal for scheduler, we definitely need to try both.

Thanks, Peter.

Yuyang

2014-06-11 11:03:05

by Eduardo Valentin

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

Hello,

On Mon, Jun 09, 2014 at 09:22:49AM -0400, Nicolas Pitre wrote:
> On Mon, 9 Jun 2014, Morten Rasmussen wrote:
>
> > On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> > > On Fri, 6 Jun 2014, Ingo Molnar wrote:
> > >
> > > > In any case, even with turbo frequencies, switching power use is
> > > > probably an order of magnitude higher than leakage current power use,
> > > > on any marketable chip, so we should concentrate on being able to
> > > > cover this first order effect (P/work ~ V^2), before considering any
> > > > second order effects (leakage current).
> > >
> > > Just so that people are aware... We'll have to introduce thermal
> > > constraint management into the scheduler mix as well at some point.
> > > Right now what we have is an ad hoc subsystem that simply monitors
> > > temperature and apply crude cooling strategies when some thresholds are
> > > met. But a better strategy would imply thermal "provisioning".
> >
> > There is already work going on to improve thermal management:
> >
> > http://lwn.net/Articles/599598/
> >
> > The proposal is based on power/energy models (too). The goal is to

Can you please point me to the other piece of code which is using
power/energy models too? We are considering having these models within
the thermal software compoenents. But if we already have more than one
user, might be worth considering a separate API.

> > allocate power intelligently based on performance requirements.
>
> Ah, great! I missed that.
>
> > While it is related to energy-aware scheduling and I fully agree that it
> > is something we need to consider, I think it is worth developing the two
> > ideas in parallel and look at sharing things like the power model later
> > once things mature. Energy-aware scheduling is complex enough on its
> > own to keep us entertained for a while :-)
>
> Absolutely. This is why I said "at some point".
>
>
> Nicolas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-06-11 11:42:11

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Wed, Jun 11, 2014 at 12:02:51PM +0100, Eduardo Valentin wrote:
> Hello,
>
> On Mon, Jun 09, 2014 at 09:22:49AM -0400, Nicolas Pitre wrote:
> > On Mon, 9 Jun 2014, Morten Rasmussen wrote:
> >
> > > On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> > > > On Fri, 6 Jun 2014, Ingo Molnar wrote:
> > > >
> > > > > In any case, even with turbo frequencies, switching power use is
> > > > > probably an order of magnitude higher than leakage current power use,
> > > > > on any marketable chip, so we should concentrate on being able to
> > > > > cover this first order effect (P/work ~ V^2), before considering any
> > > > > second order effects (leakage current).
> > > >
> > > > Just so that people are aware... We'll have to introduce thermal
> > > > constraint management into the scheduler mix as well at some point.
> > > > Right now what we have is an ad hoc subsystem that simply monitors
> > > > temperature and apply crude cooling strategies when some thresholds are
> > > > met. But a better strategy would imply thermal "provisioning".
> > >
> > > There is already work going on to improve thermal management:
> > >
> > > http://lwn.net/Articles/599598/
> > >
> > > The proposal is based on power/energy models (too). The goal is to
>
> Can you please point me to the other piece of code which is using
> power/energy models too? We are considering having these models within
> the thermal software compoenents. But if we already have more than one
> user, might be worth considering a separate API.

The link above is to the thermal management proposal which includes a
power model. This one might work better:

http://article.gmane.org/gmane.linux.power-management.general/45000

The power/energy model in this energy-aware scheduling proposal is
different. An example of the model data is in patch 6 (the start of this
thread) and the actual use of the model is in patch 11 and the following
patches. As said below, the two proposals are independent, but there
might be potential for merging the power/energy models once the
proposals are more mature.

Morten

>
> > > allocate power intelligently based on performance requirements.
> >
> > Ah, great! I missed that.
> >
> > > While it is related to energy-aware scheduling and I fully agree that it
> > > is something we need to consider, I think it is worth developing the two
> > > ideas in parallel and look at sharing things like the power model later
> > > once things mature. Energy-aware scheduling is complex enough on its
> > > own to keep us entertained for a while :-)
> >
> > Absolutely. This is why I said "at some point".
> >
> >
> > Nicolas

2014-06-11 12:02:41

by Eduardo Valentin

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Wed, Jun 11, 2014 at 12:42:18PM +0100, Morten Rasmussen wrote:
> On Wed, Jun 11, 2014 at 12:02:51PM +0100, Eduardo Valentin wrote:
> > Hello,
> >
> > On Mon, Jun 09, 2014 at 09:22:49AM -0400, Nicolas Pitre wrote:
> > > On Mon, 9 Jun 2014, Morten Rasmussen wrote:
> > >
> > > > On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> > > > > On Fri, 6 Jun 2014, Ingo Molnar wrote:
> > > > >
> > > > > > In any case, even with turbo frequencies, switching power use is
> > > > > > probably an order of magnitude higher than leakage current power use,
> > > > > > on any marketable chip, so we should concentrate on being able to
> > > > > > cover this first order effect (P/work ~ V^2), before considering any
> > > > > > second order effects (leakage current).
> > > > >
> > > > > Just so that people are aware... We'll have to introduce thermal
> > > > > constraint management into the scheduler mix as well at some point.
> > > > > Right now what we have is an ad hoc subsystem that simply monitors
> > > > > temperature and apply crude cooling strategies when some thresholds are
> > > > > met. But a better strategy would imply thermal "provisioning".
> > > >
> > > > There is already work going on to improve thermal management:
> > > >
> > > > http://lwn.net/Articles/599598/
> > > >
> > > > The proposal is based on power/energy models (too). The goal is to
> >
> > Can you please point me to the other piece of code which is using
> > power/energy models too? We are considering having these models within
> > the thermal software compoenents. But if we already have more than one
> > user, might be worth considering a separate API.
>
> The link above is to the thermal management proposal which includes a
> power model. This one might work better:
>
> http://article.gmane.org/gmane.linux.power-management.general/45000
>
> The power/energy model in this energy-aware scheduling proposal is
> different. An example of the model data is in patch 6 (the start of this
> thread) and the actual use of the model is in patch 11 and the following
> patches. As said below, the two proposals are independent, but there
> might be potential for merging the power/energy models once the
> proposals are more mature.

Morten,

For the power allocator thermal governor, I am aware, as I am reviewing
it. I am more interested in other users of power models, a part from
thermal subsystem.

>
> Morten
>
> >
> > > > allocate power intelligently based on performance requirements.
> > >
> > > Ah, great! I missed that.
> > >
> > > > While it is related to energy-aware scheduling and I fully agree that it
> > > > is something we need to consider, I think it is worth developing the two
> > > > ideas in parallel and look at sharing things like the power model later
> > > > once things mature. Energy-aware scheduling is complex enough on its
> > > > own to keep us entertained for a while :-)
> > >
> > > Absolutely. This is why I said "at some point".
> > >
> > >
> > > Nicolas

2014-06-11 13:37:51

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

On Wed, Jun 11, 2014 at 12:43:26PM +0100, Eduardo Valentin wrote:
> On Wed, Jun 11, 2014 at 12:42:18PM +0100, Morten Rasmussen wrote:
> > On Wed, Jun 11, 2014 at 12:02:51PM +0100, Eduardo Valentin wrote:
> > > Hello,
> > >
> > > On Mon, Jun 09, 2014 at 09:22:49AM -0400, Nicolas Pitre wrote:
> > > > On Mon, 9 Jun 2014, Morten Rasmussen wrote:
> > > >
> > > > > On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> > > > > > On Fri, 6 Jun 2014, Ingo Molnar wrote:
> > > > > >
> > > > > > > In any case, even with turbo frequencies, switching power use is
> > > > > > > probably an order of magnitude higher than leakage current power use,
> > > > > > > on any marketable chip, so we should concentrate on being able to
> > > > > > > cover this first order effect (P/work ~ V^2), before considering any
> > > > > > > second order effects (leakage current).
> > > > > >
> > > > > > Just so that people are aware... We'll have to introduce thermal
> > > > > > constraint management into the scheduler mix as well at some point.
> > > > > > Right now what we have is an ad hoc subsystem that simply monitors
> > > > > > temperature and apply crude cooling strategies when some thresholds are
> > > > > > met. But a better strategy would imply thermal "provisioning".
> > > > >
> > > > > There is already work going on to improve thermal management:
> > > > >
> > > > > http://lwn.net/Articles/599598/
> > > > >
> > > > > The proposal is based on power/energy models (too). The goal is to
> > >
> > > Can you please point me to the other piece of code which is using
> > > power/energy models too? We are considering having these models within
> > > the thermal software compoenents. But if we already have more than one
> > > user, might be worth considering a separate API.
> >
> > The link above is to the thermal management proposal which includes a
> > power model. This one might work better:
> >
> > http://article.gmane.org/gmane.linux.power-management.general/45000
> >
> > The power/energy model in this energy-aware scheduling proposal is
> > different. An example of the model data is in patch 6 (the start of this
> > thread) and the actual use of the model is in patch 11 and the following
> > patches. As said below, the two proposals are independent, but there
> > might be potential for merging the power/energy models once the
> > proposals are more mature.
>
> Morten,
>
> For the power allocator thermal governor, I am aware, as I am reviewing
> it. I am more interested in other users of power models, a part from
> thermal subsystem.

The user in this proposal is the scheduler. The intention is to
eventually tie cpuidle and cpufreq closer to the scheduler. When/if that
happens, they might become users too.