LinuxLists.cc - Re: [RFC v3 0/5] Add capacity capping support to the CPU controller

2017-03-15 11:47:51

Subject: Re: [RFC v3 0/5] Add capacity capping support to the CPU controller

On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote:
> Was: SchedTune: central, scheduler-driven, power-perfomance control
>
> This series presents a possible alternative design for what has been presented
> in the past as SchedTune. This redesign has been defined to address the main
> concerns and comments collected in the LKML discussion [1] as well at the last
> LPC [2].
> The aim of this posting is to present a working prototype which implements
> what has been discussed [2] with people like PeterZ, PaulT and TejunH.
>
> The main differences with respect to the previous proposal [1] are:
> 1. Task boosting/capping is now implemented as an extension on top of
> the existing CGroup CPU controller.
> 2. The previous boosting strategy, based on the inflation of the CPU's
> utilization, has been now replaced by a more simple yet effective set
> of capacity constraints.
>
> The proposed approach allows to constrain the minimum and maximum capacity
> of a CPU depending on the set of tasks currently RUNNABLE on that CPU.
> The set of active constraints are tracked by the core scheduler, thus they
> apply across all the scheduling classes. The value of the constraints are
> used to clamp the CPU utilization when the schedutil CPUFreq's governor
> selects a frequency for that CPU.
>
> This means that the new proposed approach allows to extend the concept of
> tasks classification to frequencies selection, thus allowing informed
> run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different
> optimization policies such as:
> a) Boosting of important tasks, by enforcing a minimum capacity in the
> CPUs where they are enqueued for execution.
> b) Capping of background tasks, by enforcing a maximum capacity.
> c) Containment of OPPs for RT tasks which cannot easily be switched to
> the usage of the DL class, but still don't need to run at the maximum
> frequency.

Do you have any practical examples of that, like for example what exactly
Android is going to use this for?

I gather that there is some experience with the current EAS implementation
there, so I wonder how this work is related to that.

Thanks,
Rafael

2017-03-15 13:00:43

by Patrick Bellasi

[permalink] [raw]

Subject: Re: [RFC v3 0/5] Add capacity capping support to the CPU controller

On 15-Mar 12:41, Rafael J. Wysocki wrote:
> On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote:
> > Was: SchedTune: central, scheduler-driven, power-perfomance control
> >
> > This series presents a possible alternative design for what has been presented
> > in the past as SchedTune. This redesign has been defined to address the main
> > concerns and comments collected in the LKML discussion [1] as well at the last
> > LPC [2].
> > The aim of this posting is to present a working prototype which implements
> > what has been discussed [2] with people like PeterZ, PaulT and TejunH.
> >
> > The main differences with respect to the previous proposal [1] are:
> > 1. Task boosting/capping is now implemented as an extension on top of
> > the existing CGroup CPU controller.
> > 2. The previous boosting strategy, based on the inflation of the CPU's
> > utilization, has been now replaced by a more simple yet effective set
> > of capacity constraints.
> >
> > The proposed approach allows to constrain the minimum and maximum capacity
> > of a CPU depending on the set of tasks currently RUNNABLE on that CPU.
> > The set of active constraints are tracked by the core scheduler, thus they
> > apply across all the scheduling classes. The value of the constraints are
> > used to clamp the CPU utilization when the schedutil CPUFreq's governor
> > selects a frequency for that CPU.
> >
> > This means that the new proposed approach allows to extend the concept of
> > tasks classification to frequencies selection, thus allowing informed
> > run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different
> > optimization policies such as:
> > a) Boosting of important tasks, by enforcing a minimum capacity in the
> > CPUs where they are enqueued for execution.
> > b) Capping of background tasks, by enforcing a maximum capacity.
> > c) Containment of OPPs for RT tasks which cannot easily be switched to
> > the usage of the DL class, but still don't need to run at the maximum
> > frequency.
>
> Do you have any practical examples of that, like for example what exactly
> Android is going to use this for?

In general, every "informed run-time" usually know quite a lot about
tasks requirements and how they impact the user experience.

In Android for example tasks are classified depending on their _current_
role. We can distinguish for example between:

- TOP_APP: which are tasks currently affecting the UI, i.e. part of
the app currently in foreground
- BACKGROUND: which are tasks not directly impacting the user
experience

Given these information it could make sense to adopt different
service/optimization policy for different tasks.
For example, we can be interested in
giving maximum responsiveness to TOP_APP tasks while we still want to
be able to save as much energy as possible for the BACKGROUND tasks.

That's where the proposal in this series (partially) comes on hand.

What we propose is a "standard" interface to collect sensible
information from "informed run-times" which can be used to:

a) classify tasks according to the main optimization goals:
performance boosting vs energy saving

b) support a more dynamic tuning of kernel side behaviors, mainly
OPPs selection and tasks placement

Regarding this last point, this series specifically represents a
proposal for the integration with schedutil. The main usages we are
looking for in Android are:

a) Boosting the OPP selected for certain critical tasks, with the goal
to speed-up their completion regardless of (potential) energy impacts.
A kind-of "race-to-idle" policy for certain tasks.

b) Capping the OPP selection for certain non critical tasks, which is
a major concerns especially for RT tasks in mobile context, but
it also apply to FAIR tasks representing background activities.

> I gather that there is some experience with the current EAS implementation
> there, so I wonder how this work is related to that.

You right. We started developing a task boosting strategy a couple of
years ago. The first implementation we did is what is currently in use
by the EAS version in used on Pixel smartphones.

Since the beginning our attitude has always been "mainline first".
However, we found it extremely valuable to proof both interface's
design and feature's benefits on real devices. That's why we keep
backporting these bits on different Android kernels.

Google, which primary representatives are in CC, is also quite focused
on using mainline solutions for their current and future solutions.
That's why, after the release of the Pixel devices end of last year,
we refreshed and posted the proposal on LKML [1] and collected a first
run of valuable feedbacks at LCP [2].

This posting is an expression of the feedbacks collected so far and
the main goal for us are:
1) validate once more the soundness of a scheduler-driven run-time
power-performance control which is based on information collected
from informed run-time
2) get an agreement on whether the current interface can be considered
sufficiently "mainline friendly" to have a chance to get merged
3) rework/refactor what is required if point 2 is not (yet) satisfied

It's worth to notice that these bits are completely independent from
EAS. OPP biasing (i.e. capping/boosting) is a feature which stand by
itself and it can be quite useful in many different scenarios where
EAS is not used at all. A simple example is making schedutil to behave
concurrently like the powersave governor for certain tasks and the
performance governor for other tasks.

As a final remark, this series is going to be a discussion topic in
the upcoming OSPM summit [3]. It would be nice if we can get there
with a sufficient knowledge of the main goals and the current status.
However, please let's keep discussing here about all the possible
concerns which can be raised about this proposal.

> Thanks,
> Rafael

Cheers Patrick

[1] https://lkml.org/lkml/2016/10/27/503
[2] https://lkml.org/lkml/2016/11/25/342
[3] http://retis.sssup.it/ospm-summit/

--
#include <best/regards.h>

Patrick Bellasi

2017-03-16 01:13:49

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [RFC v3 0/5] Add capacity capping support to the CPU controller

On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi
<[email protected]> wrote:
> On 15-Mar 12:41, Rafael J. Wysocki wrote:
>> On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote:
>> > Was: SchedTune: central, scheduler-driven, power-perfomance control
>> >
>> > This series presents a possible alternative design for what has been presented
>> > in the past as SchedTune. This redesign has been defined to address the main
>> > concerns and comments collected in the LKML discussion [1] as well at the last
>> > LPC [2].
>> > The aim of this posting is to present a working prototype which implements
>> > what has been discussed [2] with people like PeterZ, PaulT and TejunH.
>> >
>> > The main differences with respect to the previous proposal [1] are:
>> > 1. Task boosting/capping is now implemented as an extension on top of
>> > the existing CGroup CPU controller.
>> > 2. The previous boosting strategy, based on the inflation of the CPU's
>> > utilization, has been now replaced by a more simple yet effective set
>> > of capacity constraints.
>> >
>> > The proposed approach allows to constrain the minimum and maximum capacity
>> > of a CPU depending on the set of tasks currently RUNNABLE on that CPU.
>> > The set of active constraints are tracked by the core scheduler, thus they
>> > apply across all the scheduling classes. The value of the constraints are
>> > used to clamp the CPU utilization when the schedutil CPUFreq's governor
>> > selects a frequency for that CPU.
>> >
>> > This means that the new proposed approach allows to extend the concept of
>> > tasks classification to frequencies selection, thus allowing informed
>> > run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different
>> > optimization policies such as:
>> > a) Boosting of important tasks, by enforcing a minimum capacity in the
>> > CPUs where they are enqueued for execution.
>> > b) Capping of background tasks, by enforcing a maximum capacity.
>> > c) Containment of OPPs for RT tasks which cannot easily be switched to
>> > the usage of the DL class, but still don't need to run at the maximum
>> > frequency.
>>
>> Do you have any practical examples of that, like for example what exactly
>> Android is going to use this for?
>
> In general, every "informed run-time" usually know quite a lot about
> tasks requirements and how they impact the user experience.
>
> In Android for example tasks are classified depending on their _current_
> role. We can distinguish for example between:
>
> - TOP_APP: which are tasks currently affecting the UI, i.e. part of
> the app currently in foreground
> - BACKGROUND: which are tasks not directly impacting the user
> experience
>
> Given these information it could make sense to adopt different
> service/optimization policy for different tasks.
> For example, we can be interested in
> giving maximum responsiveness to TOP_APP tasks while we still want to
> be able to save as much energy as possible for the BACKGROUND tasks.
>
> That's where the proposal in this series (partially) comes on hand.

A question: Does "responsiveness" translate directly to "capacity" somehow?

Moreover, how exactly is "responsiveness" defined?

> What we propose is a "standard" interface to collect sensible
> information from "informed run-times" which can be used to:
>
> a) classify tasks according to the main optimization goals:
> performance boosting vs energy saving
>
> b) support a more dynamic tuning of kernel side behaviors, mainly
> OPPs selection and tasks placement
>
> Regarding this last point, this series specifically represents a
> proposal for the integration with schedutil. The main usages we are
> looking for in Android are:
>
> a) Boosting the OPP selected for certain critical tasks, with the goal
> to speed-up their completion regardless of (potential) energy impacts.
> A kind-of "race-to-idle" policy for certain tasks.

It looks like this could be addressed by adding a "this task should
race to idle" flag too.

> b) Capping the OPP selection for certain non critical tasks, which is
> a major concerns especially for RT tasks in mobile context, but
> it also apply to FAIR tasks representing background activities.

Well, is the information on how much CPU capacity assign to those
tasks really there in user space? What's the source of it if so?

>> I gather that there is some experience with the current EAS implementation
>> there, so I wonder how this work is related to that.
>
> You right. We started developing a task boosting strategy a couple of
> years ago. The first implementation we did is what is currently in use
> by the EAS version in used on Pixel smartphones.
>
> Since the beginning our attitude has always been "mainline first".
> However, we found it extremely valuable to proof both interface's
> design and feature's benefits on real devices. That's why we keep
> backporting these bits on different Android kernels.
>
> Google, which primary representatives are in CC, is also quite focused
> on using mainline solutions for their current and future solutions.
> That's why, after the release of the Pixel devices end of last year,
> we refreshed and posted the proposal on LKML [1] and collected a first
> run of valuable feedbacks at LCP [2].

Thanks for the info, but my question was more about how it was related
from the technical angle. IOW, there surely is some experience
related to how user space can deal with energy problems and I would
expect that experience to be an important factor in designing a kernel
interface for that user space, so I wonder if any particular needs of
the Android user space are addressed here.

I'm not intimately familiar with Android, so I guess I would like to
be educated somewhat on that. :-)

> This posting is an expression of the feedbacks collected so far and
> the main goal for us are:
> 1) validate once more the soundness of a scheduler-driven run-time
> power-performance control which is based on information collected
> from informed run-time
> 2) get an agreement on whether the current interface can be considered
> sufficiently "mainline friendly" to have a chance to get merged
> 3) rework/refactor what is required if point 2 is not (yet) satisfied

My definition of "mainline friendly" may be different from a someone
else's one, but I usually want to know two things:
1. What problem exactly is at hand.
2. What alternative ways of addressing it have been considered and
why the particular one proposed has been chosen over the other ones.

At the moment I don't feel like I have enough information in both aspects.

For example, if you said "Android wants to do XYZ because of ABC and
that's how we want to make that possible, and it also could be done in
the other GHJ ways, but they are not attractive and here's why etc"
that would help quite a bit from my POV.

> It's worth to notice that these bits are completely independent from
> EAS. OPP biasing (i.e. capping/boosting) is a feature which stand by
> itself and it can be quite useful in many different scenarios where
> EAS is not used at all. A simple example is making schedutil to behave
> concurrently like the powersave governor for certain tasks and the
> performance governor for other tasks.

That's fine in theory, but honestly an interface like this will be a
maintenance burden and adding it just because it may be useful to
somebody sounds not serious enough.

IOW, I'd like to be able to say "This is going to be used by user
space X to do A and that's how etc" is somebody asks me about that
which honestly I can't at this point.

>
> As a final remark, this series is going to be a discussion topic in
> the upcoming OSPM summit [3]. It would be nice if we can get there
> with a sufficient knowledge of the main goals and the current status.

I'm not sure what you mean here, sorry.

> However, please let's keep discussing here about all the possible
> concerns which can be raised about this proposal.

OK

Thanks,
Rafael

2017-03-16 03:16:01

by Joel Fernandes

[permalink] [raw]

Subject: Re: [RFC v3 0/5] Add capacity capping support to the CPU controller

Hi Rafael,

On Wed, Mar 15, 2017 at 6:04 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi
>>> Do you have any practical examples of that, like for example what exactly
>>> Android is going to use this for?
>>
>> In general, every "informed run-time" usually know quite a lot about
>> tasks requirements and how they impact the user experience.
>>
>> In Android for example tasks are classified depending on their _current_
>> role. We can distinguish for example between:
>>
>> - TOP_APP: which are tasks currently affecting the UI, i.e. part of
>> the app currently in foreground
>> - BACKGROUND: which are tasks not directly impacting the user
>> experience
>>
>> Given these information it could make sense to adopt different
>> service/optimization policy for different tasks.
>> For example, we can be interested in
>> giving maximum responsiveness to TOP_APP tasks while we still want to
>> be able to save as much energy as possible for the BACKGROUND tasks.
>>
>> That's where the proposal in this series (partially) comes on hand.
>
> A question: Does "responsiveness" translate directly to "capacity" somehow?
>
> Moreover, how exactly is "responsiveness" defined?

Responsiveness is basically how quickly the UI is responding to user
interaction after doing its computation, application-logic and
rendering. Android apps have 2 important threads, the main thread (or
UI thread) which does all the work and computation for the app, and a
Render thread which does the rendering and submission of frames to
display pipeline for further composition and display.

We wish to bias towards performance than energy for this work since
this front facing to the user and we don't care about much about
energy for these tasks at this point, what's most critical is
completion as quickly as possible so the user experience doesn't
suffer from a performance issue that is noticeable.

One metric to define this is "Jank" where we drop frames and aren't
able to render on time. One of the reasons this can happen because the
main thread (UI thread) took longer than expected for some
computation. Whatever the interface - we'd just like to bias the
scheduling and frequency guidance to be more concerned with
performance and less with energy. And use this information for both
frequency selection and task placement. 'What we need' is also app
dependent since every app has its own main thread and is free to
compute whatever it needs. So Android can't estimate this - but we do
know that this app is user facing so in broad terms the interface is
used to say please don't sacrifice performance for these top-apps -
without accurately defining what these performance needs really are
because we don't know it.
For YouTube app for example, the complexity of the video decoding and
the frame rate are very variable depending on the encoding scheme and
the video being played. The flushing of the frames through the display
pipeline is also variable (frame rate depends on the video being
decoded), so this work is variable and we can't say for sure in
definitive terms how much capacity we need.

What we can do is with Patrick's work, we can take the worst case
based on measurements and specify say we need atleast this much
capacity regardless of what load-tracking thinks we need and then we
can scale frequency accordingly. This is the usecase for the minimum
capacity in his clamping patch. This is still not perfect in terms of
defining something accurately because - we don't even know how much we
need, but atleast in broad terms we have some way of telling the
governor to maintain atleast X capacity.

For the clamping of maximum capacity, there are usecases like
background tasks like Patrick said, but also usecases where we don't
want to run at max frequency even though load-tracking thinks that we
need to. For example, there are case where for foreground camera
tasks, where we want to provide sustainable performance without
entering thermal throttling, so the capping will help there.

>> What we propose is a "standard" interface to collect sensible
>> information from "informed run-times" which can be used to:
>>
>> a) classify tasks according to the main optimization goals:
>> performance boosting vs energy saving
>>
>> b) support a more dynamic tuning of kernel side behaviors, mainly
>> OPPs selection and tasks placement
>>
>> Regarding this last point, this series specifically represents a
>> proposal for the integration with schedutil. The main usages we are
>> looking for in Android are:
>>
>> a) Boosting the OPP selected for certain critical tasks, with the goal
>> to speed-up their completion regardless of (potential) energy impacts.
>> A kind-of "race-to-idle" policy for certain tasks.
>
> It looks like this could be addressed by adding a "this task should
> race to idle" flag too.

But he said 'kind-of' race-to-idle. Racing to idle all the time for
ex. at max frequency will be wasteful of energy so although we don't
care about energy much for top-apps, we do care a bit.

>
>> b) Capping the OPP selection for certain non critical tasks, which is
>> a major concerns especially for RT tasks in mobile context, but
>> it also apply to FAIR tasks representing background activities.
>
> Well, is the information on how much CPU capacity assign to those
> tasks really there in user space? What's the source of it if so?

I believe this is just a matter of tuning and modeling for what is
needed. For ex. to prevent thermal throttling as I mentioned and also
to ensure background activities aren't running at highest frequency
and consuming excessive energy (since racing to idle at higher
frequency is more expensive energy than running slower to idle since
we run at higher voltages at higher frequency and the slow of the
perf/W curve is steeper - p = c * V^2 * F. So the V component being
higher just drains more power quadratic-ally which is of no use to
background tasks - infact in some tests, we're just as happy with
setting them at much lower frequencies than what load-tracking thinks
is needed.

>>> I gather that there is some experience with the current EAS implementation
>>> there, so I wonder how this work is related to that.
>>
>> You right. We started developing a task boosting strategy a couple of
>> years ago. The first implementation we did is what is currently in use
>> by the EAS version in used on Pixel smartphones.
>>
>> Since the beginning our attitude has always been "mainline first".
>> However, we found it extremely valuable to proof both interface's
>> design and feature's benefits on real devices. That's why we keep
>> backporting these bits on different Android kernels.
>>
>> Google, which primary representatives are in CC, is also quite focused
>> on using mainline solutions for their current and future solutions.
>> That's why, after the release of the Pixel devices end of last year,
>> we refreshed and posted the proposal on LKML [1] and collected a first
>> run of valuable feedbacks at LCP [2].
>
> Thanks for the info, but my question was more about how it was related
> from the technical angle. IOW, there surely is some experience
> related to how user space can deal with energy problems and I would
> expect that experience to be an important factor in designing a kernel
> interface for that user space, so I wonder if any particular needs of
> the Android user space are addressed here.
>
> I'm not intimately familiar with Android, so I guess I would like to
> be educated somewhat on that. :-)

Hope this sheds some light into the Android side of things a bit.

Regards,
Joel

2017-03-16 12:23:33

by Patrick Bellasi

[permalink] [raw]

Subject: Re: [RFC v3 0/5] Add capacity capping support to the CPU controller

On 16-Mar 02:04, Rafael J. Wysocki wrote:
> On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi
> <[email protected]> wrote:
> > On 15-Mar 12:41, Rafael J. Wysocki wrote:
> >> On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote:
> >> > Was: SchedTune: central, scheduler-driven, power-perfomance control
> >> >
> >> > This series presents a possible alternative design for what has been presented
> >> > in the past as SchedTune. This redesign has been defined to address the main
> >> > concerns and comments collected in the LKML discussion [1] as well at the last
> >> > LPC [2].
> >> > The aim of this posting is to present a working prototype which implements
> >> > what has been discussed [2] with people like PeterZ, PaulT and TejunH.
> >> >
> >> > The main differences with respect to the previous proposal [1] are:
> >> > 1. Task boosting/capping is now implemented as an extension on top of
> >> > the existing CGroup CPU controller.
> >> > 2. The previous boosting strategy, based on the inflation of the CPU's
> >> > utilization, has been now replaced by a more simple yet effective set
> >> > of capacity constraints.
> >> >
> >> > The proposed approach allows to constrain the minimum and maximum capacity
> >> > of a CPU depending on the set of tasks currently RUNNABLE on that CPU.
> >> > The set of active constraints are tracked by the core scheduler, thus they
> >> > apply across all the scheduling classes. The value of the constraints are
> >> > used to clamp the CPU utilization when the schedutil CPUFreq's governor
> >> > selects a frequency for that CPU.
> >> >
> >> > This means that the new proposed approach allows to extend the concept of
> >> > tasks classification to frequencies selection, thus allowing informed
> >> > run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different
> >> > optimization policies such as:
> >> > a) Boosting of important tasks, by enforcing a minimum capacity in the
> >> > CPUs where they are enqueued for execution.
> >> > b) Capping of background tasks, by enforcing a maximum capacity.
> >> > c) Containment of OPPs for RT tasks which cannot easily be switched to
> >> > the usage of the DL class, but still don't need to run at the maximum
> >> > frequency.
> >>
> >> Do you have any practical examples of that, like for example what exactly
> >> Android is going to use this for?
> >
> > In general, every "informed run-time" usually know quite a lot about
> > tasks requirements and how they impact the user experience.
> >
> > In Android for example tasks are classified depending on their _current_
> > role. We can distinguish for example between:
> >
> > - TOP_APP: which are tasks currently affecting the UI, i.e. part of
> > the app currently in foreground
> > - BACKGROUND: which are tasks not directly impacting the user
> > experience
> >
> > Given these information it could make sense to adopt different
> > service/optimization policy for different tasks.
> > For example, we can be interested in
> > giving maximum responsiveness to TOP_APP tasks while we still want to
> > be able to save as much energy as possible for the BACKGROUND tasks.
> >
> > That's where the proposal in this series (partially) comes on hand.
>
> A question: Does "responsiveness" translate directly to "capacity" somehow?
>
> Moreover, how exactly is "responsiveness" defined?

A) "responsiveness" correlates somehow with "capacity". It's subject
to profiling which, for some critical system components, can be
done in an app-independent way.

Optimization of the rendering pipeline is an example. Other system
services, which are provided by Android to all applications, are
also examples of where the integrator can tune and optimize to
give benefits across all apps.

B) the definition of "responsiveness", from a certain perspective, is
more "qualitative" than "quantitative".

Android is aware about different "application contexts", TOP_APP vs
FOREGROUND is just an example (there are others).
Thus, the run-time has the knowledge about the "qualitative
responsiveness" required by each context.

Moreover, Andoid integrators knows about the specific HW they are
targeting. This knowledge in addition to the "application
contexts", in our experience, it allows Android to feed valuable
input to both the scheduler and schedutil.

Of course, as Joel pointed out in his previous response,
responsiveness has also a "quantitative" definition, where "jank
frames" is the main metric in the Android world. With the help of the
propose interface we provide a useful interface for integrators to
tune their platform for the power-vs-performance trade-off they most
like.

> > What we propose is a "standard" interface to collect sensible
> > information from "informed run-times" which can be used to:
> >
> > a) classify tasks according to the main optimization goals:
> > performance boosting vs energy saving
> >
> > b) support a more dynamic tuning of kernel side behaviors, mainly
> > OPPs selection and tasks placement
> >
> > Regarding this last point, this series specifically represents a
> > proposal for the integration with schedutil. The main usages we are
> > looking for in Android are:
> >
> > a) Boosting the OPP selected for certain critical tasks, with the goal
> > to speed-up their completion regardless of (potential) energy impacts.
> > A kind-of "race-to-idle" policy for certain tasks.
>
> It looks like this could be addressed by adding a "this task should
> race to idle" flag too.

With the proposed interface we don't need an additional flag. If you
set capacity_min=capacity_max=1024 then you are informing schedutil,
and the scheduler as well, that this task would like to race-to-idle.

I say "would like" because here we are not proposing a mandatory
interface but we are still in the domain of "best effort" guarantees.

> > b) Capping the OPP selection for certain non critical tasks, which is
> > a major concerns especially for RT tasks in mobile context, but
> > it also apply to FAIR tasks representing background activities.
>
> Well, is the information on how much CPU capacity assign to those
> tasks really there in user space? What's the source of it if so?

I think my previous comment, two paragraphs above, should have
contributed to address this question.

I'm still wondering if you are after a formal, scientific and
mathematical definition of CPU capacity demands?
Because in that case it's worth to stress that this is not the aim of
the proposed interface.

If you have such detailed information you are probably better
positioned to got for a different solution, perhaps using DEADLINE.
If instead you are dealing with FAIR tasks but still find not
sufficient the (completely application-context transparent) in-kernel
utilization tracking mechanism, than you can give value to any kind of
user-space input about tasks requirements in each and every instant.

Notice that these requirements are not set by tasks themselves but
instead they come from the run-time knowledge.
Thus, the main point is not "how to precisely measure CPU demands" but
how to feed additional and useful _context sensitive_ information from
user-space to kernel-space.

> >> I gather that there is some experience with the current EAS implementation
> >> there, so I wonder how this work is related to that.
> >
> > You right. We started developing a task boosting strategy a couple of
> > years ago. The first implementation we did is what is currently in use
> > by the EAS version in used on Pixel smartphones.
> >
> > Since the beginning our attitude has always been "mainline first".
> > However, we found it extremely valuable to proof both interface's
> > design and feature's benefits on real devices. That's why we keep
> > backporting these bits on different Android kernels.
> >
> > Google, which primary representatives are in CC, is also quite focused
> > on using mainline solutions for their current and future solutions.
> > That's why, after the release of the Pixel devices end of last year,
> > we refreshed and posted the proposal on LKML [1] and collected a first
> > run of valuable feedbacks at LCP [2].
>
> Thanks for the info, but my question was more about how it was related
> from the technical angle. IOW, there surely is some experience
> related to how user space can deal with energy problems and I would
> expect that experience to be an important factor in designing a kernel
> interface for that user space, so I wonder if any particular needs of
> the Android user space are addressed here.

We are not addressing specific needs of the Android user-space,
although we used Android as our main design and testing support
vehicle.
Still, the concepts covered by this proposal aims to be suitable for a
better integration of each "informed run-times" running on top of the
Linux kernel.

> I'm not intimately familiar with Android, so I guess I would like to
> be educated somewhat on that. :-)

Android is just one of such possible run-times, and a notable
representative of the mobile world.

ChromeOS is another notable potential user, which is mainly
representative of the laptops/clamshell world.

Finally, every "container manager", mainly used in server domain,
can potentially get benefits from the proposed interface (e.g.
kubernets).

The point here is that we have many different instances of user-space
run-times which know a lot more about the "user-space contexts" than
what we can aim to figure out by just working in kernel-space.

What we propose is a simple, best-effort and generic interface to feed
some of these information to kernel-space, thus supporting and
integrating already available policies and mechanisms.

> > This posting is an expression of the feedbacks collected so far and
> > the main goal for us are:
> > 1) validate once more the soundness of a scheduler-driven run-time
> > power-performance control which is based on information collected
> > from informed run-time
> > 2) get an agreement on whether the current interface can be considered
> > sufficiently "mainline friendly" to have a chance to get merged
> > 3) rework/refactor what is required if point 2 is not (yet) satisfied
>
> My definition of "mainline friendly" may be different from a someone
> else's one, but I usually want to know two things:
> 1. What problem exactly is at hand.

Feed "context aware" information about tasks requirements from
"informed run-times" to kernel-space to integrate/improve existing
decision policies for OPPs selections and tasks placement.

> 2. What alternative ways of addressing it have been considered and

We initially considered and evaluated what was possible to achieve by
just using existing APIs.
For example, we considered different combinations of:

- tuning task-affinity: which sounds too much like scheduling from
user-space and does not have biasing on OPPs selection.

- tuning tasks-priorities: which is a concept mainly devoted to
partitioning of the available bandwidth among RUNNABLE tasks within
the same CPU.

- tuning 'cpusets' and/or 'cpu' controllers: which can be used to bias
task placement but still it sounds like scheduling from user-space
and they are missing the biasing on OPPs selection.

All these interfaces was not completely satisfying mainly because it
seemed to abuse their usage for a different scope.

Since the main goals are to bias OPP selection and tasks placement
based on application context, what we identified _initially_ was a
new CGroup based interface to tag tasks with a "boost" value.
That proposal [1] has been considered not suitable for a proper
kernel integration and thus, discussing with PeterZ, Tejun and PaulT
we identified a different proposal [2] which is what this series
implements.

> why the particular one proposed has been chosen over the other ones.

The current proposal has been chosen because:

1) it satisfy the main goal to have a simple interface which allows
"informed run-time" (like Android but not limited to it) to feed
"context aware" information related to user-space applications.

2) it allows to use this information to bias existing policies for
both "OPP selection" (presented in this series) as well as "task
placement" (as an extension on top of this series).

3) it extend the existing CPU controller, which is already devoted to
control the available CPU bandwidth, thus allowing for a consistent
view on how this resource is allocated to tasks.

4) it does not enforce by default any new/different behaviors (for
example on OPP selection) but it just open possibilities for finer
tuning whenever necessary.

5) it has almost negligible run-time overhead, mainly defined by the
complexity of a couple of RBTree operations per each task
wakeup/suspend.

> At the moment I don't feel like I have enough information in both aspects.

Hope the previous points cast some light on both aspects.

> For example, if you said "Android wants to do XYZ because of ABC and
> that's how we want to make that possible, and it also could be done in
> the other GHJ ways, but they are not attractive and here's why etc"
> that would help quite a bit from my POV.

Main issue for others solutions we evaluated so far is that they are
missing a clean and simple interface to express "context awareness"
at a task group level.

CGroups is the Linux framework devoted to the collection and tracking
of task group's properties. What we propose leverage this concept by
extending it just as much as required to support the dual goal of
biasing "OPPs selection" and "tasks placement" without really
requiring to re-implement these concepts in user-space.

Do you see other possible solutions?

> > It's worth to notice that these bits are completely independent from
> > EAS. OPP biasing (i.e. capping/boosting) is a feature which stand by
> > itself and it can be quite useful in many different scenarios where
> > EAS is not used at all. A simple example is making schedutil to behave
> > concurrently like the powersave governor for certain tasks and the
> > performance governor for other tasks.
>
> That's fine in theory, but honestly an interface like this will be a
> maintenance burden and adding it just because it may be useful to
> somebody sounds not serious enough.

Actually, it is already useful to "someone". Google is using something
similar on Pixel devices and in the future it will be likely adopted
by other smartphones.

Here we are just trying to push it mainline to make it available also
to all the other potential clients I've described before.

> IOW, I'd like to be able to say "This is going to be used by user
> space X to do A and that's how etc" is somebody asks me about that
> which honestly I can't at this point.

In that case, again I think we have a strong case for "this is going
to be used by".

> > As a final remark, this series is going to be a discussion topic in
> > the upcoming OSPM summit [3]. It would be nice if we can get there
> > with a sufficient knowledge of the main goals and the current status.
>
> I'm not sure what you mean here, sorry.

Just that I like this discussion and I would like to get some sort of
initial agreement at least on basic concepts, requirements and
use-cases before OSPM.

That would allow us to be more active on the technical details side
during the summit and, hopefully, come to the definition of a roadmap
detailing the required steps to get merged a suitable interface,
whether is the one proposed by this series or another achieving the
same goals.

> > However, please let's keep discussing here about all the possible
> > concerns which can be raised about this proposal.
>
> OK
>
> Thanks,
> Rafael

[1] https://lkml.org/lkml/2016/10/27/503
[2] https://lkml.org/lkml/2016/11/25/342

--
#include <best/regards.h>

Patrick Bellasi

2017-03-20 22:51:43

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [RFC v3 0/5] Add capacity capping support to the CPU controller

On Thu, Mar 16, 2017 at 4:15 AM, Joel Fernandes <[email protected]> wrote:
> Hi Rafael,

Hi,

> On Wed, Mar 15, 2017 at 6:04 PM, Rafael J. Wysocki <[email protected]> wrote:
>> On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi
>>>> Do you have any practical examples of that, like for example what exactly
>>>> Android is going to use this for?
>>>
>>> In general, every "informed run-time" usually know quite a lot about
>>> tasks requirements and how they impact the user experience.
>>>
>>> In Android for example tasks are classified depending on their _current_
>>> role. We can distinguish for example between:
>>>
>>> - TOP_APP: which are tasks currently affecting the UI, i.e. part of
>>> the app currently in foreground
>>> - BACKGROUND: which are tasks not directly impacting the user
>>> experience
>>>
>>> Given these information it could make sense to adopt different
>>> service/optimization policy for different tasks.
>>> For example, we can be interested in
>>> giving maximum responsiveness to TOP_APP tasks while we still want to
>>> be able to save as much energy as possible for the BACKGROUND tasks.
>>>
>>> That's where the proposal in this series (partially) comes on hand.
>>
>> A question: Does "responsiveness" translate directly to "capacity" somehow?
>>
>> Moreover, how exactly is "responsiveness" defined?
>
> Responsiveness is basically how quickly the UI is responding to user
> interaction after doing its computation, application-logic and
> rendering. Android apps have 2 important threads, the main thread (or
> UI thread) which does all the work and computation for the app, and a
> Render thread which does the rendering and submission of frames to
> display pipeline for further composition and display.
>
> We wish to bias towards performance than energy for this work since
> this front facing to the user and we don't care about much about
> energy for these tasks at this point, what's most critical is
> completion as quickly as possible so the user experience doesn't
> suffer from a performance issue that is noticeable.
>
> One metric to define this is "Jank" where we drop frames and aren't
> able to render on time. One of the reasons this can happen because the
> main thread (UI thread) took longer than expected for some
> computation. Whatever the interface - we'd just like to bias the
> scheduling and frequency guidance to be more concerned with
> performance and less with energy. And use this information for both
> frequency selection and task placement. 'What we need' is also app
> dependent since every app has its own main thread and is free to
> compute whatever it needs. So Android can't estimate this - but we do
> know that this app is user facing so in broad terms the interface is
> used to say please don't sacrifice performance for these top-apps -
> without accurately defining what these performance needs really are
> because we don't know it.
> For YouTube app for example, the complexity of the video decoding and
> the frame rate are very variable depending on the encoding scheme and
> the video being played. The flushing of the frames through the display
> pipeline is also variable (frame rate depends on the video being
> decoded), so this work is variable and we can't say for sure in
> definitive terms how much capacity we need.
>
> What we can do is with Patrick's work, we can take the worst case
> based on measurements and specify say we need atleast this much
> capacity regardless of what load-tracking thinks we need and then we
> can scale frequency accordingly. This is the usecase for the minimum
> capacity in his clamping patch. This is still not perfect in terms of
> defining something accurately because - we don't even know how much we
> need, but atleast in broad terms we have some way of telling the
> governor to maintain atleast X capacity.

First off, it all seems to depend a good deal on what your
expectations regarding the in-kernel performance scaling are.

You seem to be expecting it to decide whether or not to sacrifice some
performance for energy savings, but it can't do that really, simply
because it has no guidance on that. It doesn't know how much
performance (or capacity) it can trade for a given amount of energy,
for example.

What it can do and what I expect it to be doing is to avoid
maintaining excess capacity (maintaining capacity is expensive in
general and a clear waste if the capacity is not actually used).

For instance, if you take the schedutil governor, it doesn't do
anything really fancy. It just attempts to set a frequency sufficient
to run the given workload without slowing it down artificially, but
not much higher than that, and that's not based on any arcane
energy-vs-performance considerations. It's based on an (arguably
vague) idea about how fast should be sufficient.

So if you want to say "please don't sacrifice performance for these
top-apps" to it, chances are it will not understand what you are
asking it for. :-)

It only may take the minimum capacity limit for a task as a correction
to its idea about how fast is sufficient in this particular case (and
energy doesn't even enter the picture at this point). Now, of course,
its idea about what should be sufficient may be entirely incorrect for
some reason, but then the question really is: why? And whether or not
it can be fixed without supplying corrections from user space in a
very direct way.

What you are saying generally indicates that you see under-provisioned
tasks and that's rather nor because the kernel tries to sacrifice
performance for energy. Maybe the CPU utilization is under-estimated
by schedutil or the scheduler doesn't give enough time to these
particular tasks for some reason. In any case, having a way to set a
limit from user space may allow you to work around these issues quite
bluntly and is not a solution. And even if the underlying problems
are solved, the user space interface will stay there and will have to
be maintained going forward.

Also when you set a minimum frequency limit from user space, you may
easily over-provision the task and that would defeat the purpose of
what the kernel tries to achieve.

> For the clamping of maximum capacity, there are usecases like
> background tasks like Patrick said, but also usecases where we don't
> want to run at max frequency even though load-tracking thinks that we
> need to. For example, there are case where for foreground camera
> tasks, where we want to provide sustainable performance without
> entering thermal throttling, so the capping will help there.

Fair enough.

To me, that case is more compelling than the previous one, but again
I'm not sure if the ability to set a specific capacity limit may fit
the bill entirely. You need to know what limit to set in the first
place (and that may depend on multiple factors in principle) and then
you may need to adjust it over time and so on.

>>> What we propose is a "standard" interface to collect sensible
>>> information from "informed run-times" which can be used to:
>>>
>>> a) classify tasks according to the main optimization goals:
>>> performance boosting vs energy saving
>>>
>>> b) support a more dynamic tuning of kernel side behaviors, mainly
>>> OPPs selection and tasks placement
>>>
>>> Regarding this last point, this series specifically represents a
>>> proposal for the integration with schedutil. The main usages we are
>>> looking for in Android are:
>>>
>>> a) Boosting the OPP selected for certain critical tasks, with the goal
>>> to speed-up their completion regardless of (potential) energy impacts.
>>> A kind-of "race-to-idle" policy for certain tasks.
>>
>> It looks like this could be addressed by adding a "this task should
>> race to idle" flag too.
>
> But he said 'kind-of' race-to-idle. Racing to idle all the time for
> ex. at max frequency will be wasteful of energy so although we don't
> care about energy much for top-apps, we do care a bit.

You actually don't know whether or not it will be wasteful and there
may even be differences from workload to workload on the same system
in that respect.

>>
>>> b) Capping the OPP selection for certain non critical tasks, which is
>>> a major concerns especially for RT tasks in mobile context, but
>>> it also apply to FAIR tasks representing background activities.
>>
>> Well, is the information on how much CPU capacity assign to those
>> tasks really there in user space? What's the source of it if so?
>
> I believe this is just a matter of tuning and modeling for what is
> needed. For ex. to prevent thermal throttling as I mentioned and also
> to ensure background activities aren't running at highest frequency
> and consuming excessive energy (since racing to idle at higher
> frequency is more expensive energy than running slower to idle since
> we run at higher voltages at higher frequency and the slow of the
> perf/W curve is steeper - p = c * V^2 * F. So the V component being
> higher just drains more power quadratic-ally which is of no use to
> background tasks - infact in some tests, we're just as happy with
> setting them at much lower frequencies than what load-tracking thinks
> is needed.

As I said, I actually can see a need to go lower than what performance
scaling thinks, because the way it tries to estimate the sufficient
capacity is by checking how much utilization is there for the
currently provided capacity and adjusting if necessary. OTOH, there
are applications aggressive enough to be able to utilize *any*
capacity provided to them.

>>>> I gather that there is some experience with the current EAS implementation
>>>> there, so I wonder how this work is related to that.
>>>
>>> You right. We started developing a task boosting strategy a couple of
>>> years ago. The first implementation we did is what is currently in use
>>> by the EAS version in used on Pixel smartphones.
>>>
>>> Since the beginning our attitude has always been "mainline first".
>>> However, we found it extremely valuable to proof both interface's
>>> design and feature's benefits on real devices. That's why we keep
>>> backporting these bits on different Android kernels.
>>>
>>> Google, which primary representatives are in CC, is also quite focused
>>> on using mainline solutions for their current and future solutions.
>>> That's why, after the release of the Pixel devices end of last year,
>>> we refreshed and posted the proposal on LKML [1] and collected a first
>>> run of valuable feedbacks at LCP [2].
>>
>> Thanks for the info, but my question was more about how it was related
>> from the technical angle. IOW, there surely is some experience
>> related to how user space can deal with energy problems and I would
>> expect that experience to be an important factor in designing a kernel
>> interface for that user space, so I wonder if any particular needs of
>> the Android user space are addressed here.
>>
>> I'm not intimately familiar with Android, so I guess I would like to
>> be educated somewhat on that. :-)
>
> Hope this sheds some light into the Android side of things a bit.

Yes, it does, thanks!

Best regards,
Rafael

2017-03-21 11:10:46

by Patrick Bellasi

[permalink] [raw]

Subject: Re: [RFC v3 0/5] Add capacity capping support to the CPU controller

On 20-Mar 23:51, Rafael J. Wysocki wrote:
> On Thu, Mar 16, 2017 at 4:15 AM, Joel Fernandes <[email protected]> wrote:
> > Hi Rafael,
>
> Hi,
>
> > On Wed, Mar 15, 2017 at 6:04 PM, Rafael J. Wysocki <[email protected]> wrote:
> >> On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi
> >>>> Do you have any practical examples of that, like for example what exactly
> >>>> Android is going to use this for?
> >>>
> >>> In general, every "informed run-time" usually know quite a lot about
> >>> tasks requirements and how they impact the user experience.
> >>>
> >>> In Android for example tasks are classified depending on their _current_
> >>> role. We can distinguish for example between:
> >>>
> >>> - TOP_APP: which are tasks currently affecting the UI, i.e. part of
> >>> the app currently in foreground
> >>> - BACKGROUND: which are tasks not directly impacting the user
> >>> experience
> >>>
> >>> Given these information it could make sense to adopt different
> >>> service/optimization policy for different tasks.
> >>> For example, we can be interested in
> >>> giving maximum responsiveness to TOP_APP tasks while we still want to
> >>> be able to save as much energy as possible for the BACKGROUND tasks.
> >>>
> >>> That's where the proposal in this series (partially) comes on hand.
> >>
> >> A question: Does "responsiveness" translate directly to "capacity" somehow?
> >>
> >> Moreover, how exactly is "responsiveness" defined?
> >
> > Responsiveness is basically how quickly the UI is responding to user
> > interaction after doing its computation, application-logic and
> > rendering. Android apps have 2 important threads, the main thread (or
> > UI thread) which does all the work and computation for the app, and a
> > Render thread which does the rendering and submission of frames to
> > display pipeline for further composition and display.
> >
> > We wish to bias towards performance than energy for this work since
> > this front facing to the user and we don't care about much about
> > energy for these tasks at this point, what's most critical is
> > completion as quickly as possible so the user experience doesn't
> > suffer from a performance issue that is noticeable.
> >
> > One metric to define this is "Jank" where we drop frames and aren't
> > able to render on time. One of the reasons this can happen because the
> > main thread (UI thread) took longer than expected for some
> > computation. Whatever the interface - we'd just like to bias the
> > scheduling and frequency guidance to be more concerned with
> > performance and less with energy. And use this information for both
> > frequency selection and task placement. 'What we need' is also app
> > dependent since every app has its own main thread and is free to
> > compute whatever it needs. So Android can't estimate this - but we do
> > know that this app is user facing so in broad terms the interface is
> > used to say please don't sacrifice performance for these top-apps -
> > without accurately defining what these performance needs really are
> > because we don't know it.
> > For YouTube app for example, the complexity of the video decoding and
> > the frame rate are very variable depending on the encoding scheme and
> > the video being played. The flushing of the frames through the display
> > pipeline is also variable (frame rate depends on the video being
> > decoded), so this work is variable and we can't say for sure in
> > definitive terms how much capacity we need.
> >
> > What we can do is with Patrick's work, we can take the worst case
> > based on measurements and specify say we need atleast this much
> > capacity regardless of what load-tracking thinks we need and then we
> > can scale frequency accordingly. This is the usecase for the minimum
> > capacity in his clamping patch. This is still not perfect in terms of
> > defining something accurately because - we don't even know how much we
> > need, but atleast in broad terms we have some way of telling the
> > governor to maintain atleast X capacity.
>
> First off, it all seems to depend a good deal on what your
> expectations regarding the in-kernel performance scaling are.
>
> You seem to be expecting it to decide whether or not to sacrifice some
> performance for energy savings, but it can't do that really, simply
> because it has no guidance on that. It doesn't know how much
> performance (or capacity) it can trade for a given amount of energy,
> for example.

That's true, right now. But in ARM we are working since a cpuple of
years to refine the concept of an energy model which improves the
scheduler knowledge about the energy-vs-performance trade-off.

> What it can do and what I expect it to be doing is to avoid
> maintaining excess capacity (maintaining capacity is expensive in
> general and a clear waste if the capacity is not actually used).
>
> For instance, if you take the schedutil governor, it doesn't do
> anything really fancy. It just attempts to set a frequency sufficient
> to run the given workload without slowing it down artificially, but
> not much higher than that, and that's not based on any arcane
> energy-vs-performance considerations. It's based on an (arguably
> vague) idea about how fast should be sufficient.
>
> So if you want to say "please don't sacrifice performance for these
> top-apps" to it, chances are it will not understand what you are
> asking it for. :-)

Actually, this series are the foundation bits of a more complete
solution, already in use on Pixel phones.

While this proposal focuses just on "OPP biasing", some additional
bits (not yet posted to keep things simple) exploit the Energy Model
information to provide support for "task placement biasing".

Those bits address also the concept of:

how much energy I want to sacrifice to get a certain speedup?

> It only may take the minimum capacity limit for a task as a correction
> to its idea about how fast is sufficient in this particular case (and
> energy doesn't even enter the picture at this point). Now, of course,
> its idea about what should be sufficient may be entirely incorrect for
> some reason, but then the question really is: why? And whether or not
> it can be fixed without supplying corrections from user space in a
> very direct way.

- Why the estimation is incorrect?

Because, looking at CFS tasks for example, PELT is a "running
estimator". Its view about how much capacity a task needs changes
continuously over time. In short it is missing an aggregation and
consolidation mechanism which allows to exploit better information on
task's past activations.
We have a proposal to possibly fix that and we will post if soonish.

However, still it can be that for a certain task you want to add some
"safety margin" to accommodate for possible workload variations.
That's required also if you have a perfect knowledge about task
requirements for a task, which has been built entirely in kernel
space, based on past activations.
if your task is such important, you don't care to give it "just
enough". You need to know how much more to give him, and this
information can come only from user-space where someone with more
information can use a properly defined API to feed them to the
scheduler using a per-task interface.

- Can it be fixed without corrections from user-space?

Not completely, more details hereafter.

> What you are saying generally indicates that you see under-provisioned
> tasks and that's rather nor because the kernel tries to sacrifice
> performance for energy. Maybe the CPU utilization is under-estimated
> by schedutil or the scheduler doesn't give enough time to these
> particular tasks for some reason. In any case, having a way to set a
> limit from user space may allow you to work around these issues quite
> bluntly and is not a solution. And even if the underlying problems
> are solved, the user space interface will stay there and will have to
> be maintained going forward.

I don't agree on that point, mainly because I don't see that as a
workaround. In your view you it seems that everything can be solved
entirely in kernel space. In my view instead what we are after is a
properly defined interface where kernel-space and user-space can
potentially close a control loop where:
a) user-space, which has much more a-priori information about tasks
requirements can feed some constraints to kernel-space.
b) kernel-space, which has optimized end efficient mechanisms, enforce
these constraints on a per task basis.

After all this is not a new concept on OS design, we already have
different interfaces which allows to tune scheduler behaviors on a
per-task bias. What we are missing right now is a similar _per-task
interface_ to bias OPP selection and a slightly improved/alternative
way to bias task placement _without_ doing scheduling decisions in
user-space.

Here is a graphical representation of these concepts:

+-------------+ +-------------+ +-------------+
| App1 Tasks ++ | App2 Tasks ++ | App3 Tasks ++
| || | || | ||
+--------------| +--------------| +--------------|
+-------------+ +-------------+ +-------------+
| | |
+----------------------------------------------------------+
| |
| +--------------------------------------------+ |
| | +-------------------------------------+ | |
| | | Run-Time Optimized Services | | |
| | | (e.g. execution model) | | |
| | +-------------------------------------+ | |
| | | |
| | Informed Run-Time Resource Manager | |
| | (Android, ChromeOS, Kubernets, etc...) | |
| +------------------------------------------^-+ |
| | | |
| |Constraints | |
| |(OPP and Task Placement biasing) | |
| | | |
| | Monitoring | |
| +-v------------------------------------------+ |
| | Linux Kernel | |
| | (Scheduler, schedutil, ...) | |
| +--------------------------------------------+ |
| |
| Closed control and optimization loop |
+----------------------------------------------------------+

What is important to notice is that there is a middleware, in between
the kernel and the applications. This is a special kind of user-space
where it is still safe for the kernel to delegate some "decisions".

> Also when you set a minimum frequency limit from user space, you may
> easily over-provision the task and that would defeat the purpose of
> what the kernel tries to achieve.

No, if an "informed user-space" wants to over-provision a task it's
because it has already decided that it makes sense to limit the kernel
energy optimization for that specific class of tasks.
It is not necessarily kernel business to know why, it is just required
to do its best within the provided constraints.

> > For the clamping of maximum capacity, there are usecases like
> > background tasks like Patrick said, but also usecases where we don't
> > want to run at max frequency even though load-tracking thinks that we
> > need to. For example, there are case where for foreground camera
> > tasks, where we want to provide sustainable performance without
> > entering thermal throttling, so the capping will help there.
>
> Fair enough.
>
> To me, that case is more compelling than the previous one, but again
> I'm not sure if the ability to set a specific capacity limit may fit
> the bill entirely. You need to know what limit to set in the first
> place (and that may depend on multiple factors in principle) and then
> you may need to adjust it over time and so on.

Exactly and again, the informed run-time knows which limits to set,
on which tasks and when change/update/tune them.

> >>> What we propose is a "standard" interface to collect sensible
> >>> information from "informed run-times" which can be used to:
> >>>
> >>> a) classify tasks according to the main optimization goals:
> >>> performance boosting vs energy saving
> >>>
> >>> b) support a more dynamic tuning of kernel side behaviors, mainly
> >>> OPPs selection and tasks placement
> >>>
> >>> Regarding this last point, this series specifically represents a
> >>> proposal for the integration with schedutil. The main usages we are
> >>> looking for in Android are:
> >>>
> >>> a) Boosting the OPP selected for certain critical tasks, with the goal
> >>> to speed-up their completion regardless of (potential) energy impacts.
> >>> A kind-of "race-to-idle" policy for certain tasks.
> >>
> >> It looks like this could be addressed by adding a "this task should
> >> race to idle" flag too.
> >
> > But he said 'kind-of' race-to-idle. Racing to idle all the time for
> > ex. at max frequency will be wasteful of energy so although we don't
> > care about energy much for top-apps, we do care a bit.
>
> You actually don't know whether or not it will be wasteful and there
> may even be differences from workload to workload on the same system
> in that respect.

The workload dependencies are solved by the "informed run-time",
that's why what we are proposing is a per-task interface.
Moreover, notice that most of the optimization can still be targeted
to services provided by the "informed run-time". Thus the dependencies
on the actual applications are kind-of limited and still they can be
factored in by properly defined interfaces exposed by the "informed
run-time".

> >>> b) Capping the OPP selection for certain non critical tasks, which is
> >>> a major concerns especially for RT tasks in mobile context, but
> >>> it also apply to FAIR tasks representing background activities.
> >>
> >> Well, is the information on how much CPU capacity assign to those
> >> tasks really there in user space? What's the source of it if so?
> >
> > I believe this is just a matter of tuning and modeling for what is
> > needed. For ex. to prevent thermal throttling as I mentioned and also
> > to ensure background activities aren't running at highest frequency
> > and consuming excessive energy (since racing to idle at higher
> > frequency is more expensive energy than running slower to idle since
> > we run at higher voltages at higher frequency and the slow of the
> > perf/W curve is steeper - p = c * V^2 * F. So the V component being
> > higher just drains more power quadratic-ally which is of no use to
> > background tasks - infact in some tests, we're just as happy with
> > setting them at much lower frequencies than what load-tracking thinks
> > is needed.
>
> As I said, I actually can see a need to go lower than what performance
> scaling thinks, because the way it tries to estimate the sufficient
> capacity is by checking how much utilization is there for the
> currently provided capacity and adjusting if necessary. OTOH, there
> are applications aggressive enough to be able to utilize *any*
> capacity provided to them.

Here you are not considering the control role exercised by the
middleware layer. Apps cannot really do whatever they want, they get
only what the "informed run-time" considers it sufficient for them.

IOW, they live in a "managed user-space".

> >>>> I gather that there is some experience with the current EAS implementation
> >>>> there, so I wonder how this work is related to that.
> >>>
> >>> You right. We started developing a task boosting strategy a couple of
> >>> years ago. The first implementation we did is what is currently in use
> >>> by the EAS version in used on Pixel smartphones.
> >>>
> >>> Since the beginning our attitude has always been "mainline first".
> >>> However, we found it extremely valuable to proof both interface's
> >>> design and feature's benefits on real devices. That's why we keep
> >>> backporting these bits on different Android kernels.
> >>>
> >>> Google, which primary representatives are in CC, is also quite focused
> >>> on using mainline solutions for their current and future solutions.
> >>> That's why, after the release of the Pixel devices end of last year,
> >>> we refreshed and posted the proposal on LKML [1] and collected a first
> >>> run of valuable feedbacks at LCP [2].
> >>
> >> Thanks for the info, but my question was more about how it was related
> >> from the technical angle. IOW, there surely is some experience
> >> related to how user space can deal with energy problems and I would
> >> expect that experience to be an important factor in designing a kernel
> >> interface for that user space, so I wonder if any particular needs of
> >> the Android user space are addressed here.
> >>
> >> I'm not intimately familiar with Android, so I guess I would like to
> >> be educated somewhat on that. :-)
> >
> > Hope this sheds some light into the Android side of things a bit.
>
> Yes, it does, thanks!

Interesting discussion, thanks! ;-)

> Best regards,
> Rafael

--
#include <best/regards.h>

Patrick Bellasi

2017-03-24 23:52:15

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [RFC v3 0/5] Add capacity capping support to the CPU controller

On Tue, Mar 21, 2017 at 12:01 PM, Patrick Bellasi
<[email protected]> wrote:
> On 20-Mar 23:51, Rafael J. Wysocki wrote:

[cut]

>> So if you want to say "please don't sacrifice performance for these
>> top-apps" to it, chances are it will not understand what you are
>> asking it for. :-)
>
> Actually, this series are the foundation bits of a more complete
> solution, already in use on Pixel phones.
>
> While this proposal focuses just on "OPP biasing", some additional
> bits (not yet posted to keep things simple) exploit the Energy Model
> information to provide support for "task placement biasing".
>
> Those bits address also the concept of:
>
> how much energy I want to sacrifice to get a certain speedup?

Well, OK, but this reads somewhat like "you can't appreciate that
fully, because you don't know the whole picture". :-)

Which very well may be the case and which is why I'm asking all of
these questions about the motivation etc.: I want to know the whole
picture, because I need context to make up my mind about this
particular part of it in a reasonable way.

[cut]

>> What you are saying generally indicates that you see under-provisioned
>> tasks and that's rather not because the kernel tries to sacrifice
>> performance for energy. Maybe the CPU utilization is under-estimated
>> by schedutil or the scheduler doesn't give enough time to these
>> particular tasks for some reason. In any case, having a way to set a
>> limit from user space may allow you to work around these issues quite
>> bluntly and is not a solution. And even if the underlying problems
>> are solved, the user space interface will stay there and will have to
>> be maintained going forward.
>
> I don't agree on that point, mainly because I don't see that as a
> workaround. In your view you it seems that everything can be solved
> entirely in kernel space.

Now, I haven't said that and it doesn't really reflect my view.

What I actually had in mind was that the particular problems mentioned
by Joel might very well be consequences of what the kernel did even
though it shouldn't be doing that. If so, then fixing the kernel may
eliminate the problems in question and there may be nothing left on
the table to address with the minimum capacity limit.

> In my view instead what we are after is a
> properly defined interface where kernel-space and user-space can
> potentially close a control loop where:
> a) user-space, which has much more a-priori information about tasks
> requirements can feed some constraints to kernel-space.
> b) kernel-space, which has optimized end efficient mechanisms, enforce
> these constraints on a per task basis.

I can agree in principle that *some* kind of interface between the
kernel and user space would be good to have in this area, but I'm not
quite sure about how that interface should look like.

It seems that what needs to be passed is information on what user
space regards as a reasonable energy-for-performance tradeoff,
per-task or overall.

I'm not convinced about the suitability of min/max capacity for this
purpose in general.

> After all this is not a new concept on OS design, we already have
> different interfaces which allows to tune scheduler behaviors on a
> per-task bias. What we are missing right now is a similar _per-task
> interface_ to bias OPP selection and a slightly improved/alternative
> way to bias task placement _without_ doing scheduling decisions in
> user-space.
>
> Here is a graphical representation of these concepts:
>
> +-------------+ +-------------+ +-------------+
> | App1 Tasks ++ | App2 Tasks ++ | App3 Tasks ++
> | || | || | ||
> +--------------| +--------------| +--------------|
> +-------------+ +-------------+ +-------------+
> | | |
> +----------------------------------------------------------+
> | |
> | +--------------------------------------------+ |
> | | +-------------------------------------+ | |
> | | | Run-Time Optimized Services | | |
> | | | (e.g. execution model) | | |
> | | +-------------------------------------+ | |
> | | | |
> | | Informed Run-Time Resource Manager | |
> | | (Android, ChromeOS, Kubernets, etc...) | |
> | +------------------------------------------^-+ |
> | | | |
> | |Constraints | |
> | |(OPP and Task Placement biasing) | |
> | | | |
> | | Monitoring | |
> | +-v------------------------------------------+ |
> | | Linux Kernel | |
> | | (Scheduler, schedutil, ...) | |
> | +--------------------------------------------+ |
> | |
> | Closed control and optimization loop |
> +----------------------------------------------------------+
>
> What is important to notice is that there is a middleware, in between
> the kernel and the applications. This is a special kind of user-space
> where it is still safe for the kernel to delegate some "decisions".

So having spent a good part of the last 10 years on writing kernel
code that, among other things, talks to these middlewares (like
autosleep and the support for wakelocks for an obvious example), I'm
quite aware of all that and also quite familiar with the diagram
above.

And while I don't want to start a discussion about whether or not
these middlewares are really as smart as the claims go, let me share a
personal opinion here. In my experience, they usually tend to be
quite well-informed about the applications shipped along with them,
but not so much about stuff installed by users later, which sometimes
ruins the party like a motorcycle gang dropping in without invitation.

>> Also when you set a minimum frequency limit from user space, you may
>> easily over-provision the task and that would defeat the purpose of
>> what the kernel tries to achieve.
>
> No, if an "informed user-space" wants to over-provision a task it's
> because it has already decided that it makes sense to limit the kernel
> energy optimization for that specific class of tasks.
> It is not necessarily kernel business to know why, it is just required
> to do its best within the provided constraints.

My point is that if user space sets the limit to over-provision a
task, then having the kernel do the whole work to prevent that from
happening is rather pointless.

[cut]

>
>> >>> b) Capping the OPP selection for certain non critical tasks, which is
>> >>> a major concerns especially for RT tasks in mobile context, but
>> >>> it also apply to FAIR tasks representing background activities.
>> >>
>> >> Well, is the information on how much CPU capacity assign to those
>> >> tasks really there in user space? What's the source of it if so?
>> >
>> > I believe this is just a matter of tuning and modeling for what is
>> > needed. For ex. to prevent thermal throttling as I mentioned and also
>> > to ensure background activities aren't running at highest frequency
>> > and consuming excessive energy (since racing to idle at higher
>> > frequency is more expensive energy than running slower to idle since
>> > we run at higher voltages at higher frequency and the slow of the
>> > perf/W curve is steeper - p = c * V^2 * F. So the V component being
>> > higher just drains more power quadratic-ally which is of no use to
>> > background tasks - infact in some tests, we're just as happy with
>> > setting them at much lower frequencies than what load-tracking thinks
>> > is needed.
>>
>> As I said, I actually can see a need to go lower than what performance
>> scaling thinks, because the way it tries to estimate the sufficient
>> capacity is by checking how much utilization is there for the
>> currently provided capacity and adjusting if necessary. OTOH, there
>> are applications aggressive enough to be able to utilize *any*
>> capacity provided to them.
>
> Here you are not considering the control role exercised by the
> middleware layer.

Indeed. I was describing what happened without it. :-)

[cut]

>
> Interesting discussion, thanks! ;-)

Yup, thanks!

Take care,
Rafael