LinuxLists.cc - Re: [PATCH RESEND v9 05/10] sched: make scale

2015-02-19 16:52:55

Subject: Re: [PATCH RESEND v9 05/10] sched: make scale_rt invariant with frequency

On Thu, Jan 15, 2015 at 11:09:25AM +0100, Vincent Guittot wrote:
> The average running time of RT tasks is used to estimate the remaining compute
> capacity for CFS tasks. This remaining capacity is the original capacity scaled
> down by a factor (aka scale_rt_capacity). This estimation of available capacity
> must also be invariant with frequency scaling.
>
> A frequency scaling factor is applied on the running time of the RT tasks for
> computing scale_rt_capacity.
>
> In sched_rt_avg_update, we scale the RT execution time like below:
> rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
>
> Then, scale_rt_capacity can be summarized by:
> scale_rt_capacity = SCHED_CAPACITY_SCALE -
> ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
>
> We can optimize by removing right and left shift in the computation of rq->rt_avg
> and scale_rt_capacity

So far so good..

> The call to arch_scale_frequency_capacity in the rt scheduling path might be
> a concern for RT folks because I'm not sure whether we can rely on
> arch_scale_freq_capacity to be short and efficient ?

No, that is, arch_scale_frequency_capacity() _must_ be short and
efficient, event for the fair class, its called in very hot paths.

I think we've talked about this before; this function should basically
only return a cached value, which is periodically updated through some
means.

But lets see, I've yet to see an actual implementation of it; and its
got that sd argument, curious what you're going to do with that.

> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 17 +++++------------
> kernel/sched/sched.h | 4 +++-
> 2 files changed, 8 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a5039da..b37c27b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
>
> total = sched_avg_period() + delta;
>
> - if (unlikely(total < avg)) {
> - /* Ensures that capacity won't end up being negative */
> - available = 0;
> - } else {
> - available = total - avg;
> - }
> + used = div_u64(avg, total);
>
> - if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
> - total = SCHED_CAPACITY_SCALE;
> + if (likely(used < SCHED_CAPACITY_SCALE))
> + return SCHED_CAPACITY_SCALE - used;
>
> - total >>= SCHED_CAPACITY_SHIFT;
> -
> - return div_u64(available, total);
> + return 1;
> }

This makes a lot of changes not commented on in the Changelog.

I think it makes sense, but this has always been horrible.

2015-02-19 17:17:38

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [PATCH RESEND v9 05/10] sched: make scale_rt invariant with frequency

On Thu, Feb 19, 2015 at 04:52:41PM +0000, Peter Zijlstra wrote:
> On Thu, Jan 15, 2015 at 11:09:25AM +0100, Vincent Guittot wrote:
> > The average running time of RT tasks is used to estimate the remaining compute
> > capacity for CFS tasks. This remaining capacity is the original capacity scaled
> > down by a factor (aka scale_rt_capacity). This estimation of available capacity
> > must also be invariant with frequency scaling.
> >
> > A frequency scaling factor is applied on the running time of the RT tasks for
> > computing scale_rt_capacity.
> >
> > In sched_rt_avg_update, we scale the RT execution time like below:
> > rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
> >
> > Then, scale_rt_capacity can be summarized by:
> > scale_rt_capacity = SCHED_CAPACITY_SCALE -
> > ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
> >
> > We can optimize by removing right and left shift in the computation of rq->rt_avg
> > and scale_rt_capacity
>
> So far so good..
>
> > The call to arch_scale_frequency_capacity in the rt scheduling path might be
> > a concern for RT folks because I'm not sure whether we can rely on
> > arch_scale_freq_capacity to be short and efficient ?
>
> No, that is, arch_scale_frequency_capacity() _must_ be short and
> efficient, event for the fair class, its called in very hot paths.

... and very frequently too.

> I think we've talked about this before; this function should basically
> only return a cached value, which is periodically updated through some
> means.

Agreed. I think it is reasonable to assume that the arch code
implementing arch_scale_freq_capacity() does it's best to make it fast
for the particular architecture. Since the scaling factor to be returned
by the function may be obtained in different ways for different
architectures the caching should be done on the arch side.

> But lets see, I've yet to see an actual implementation of it; and its
> got that sd argument, curious what you're going to do with that.

So we do have an RFC implementation for ARM already which I posted in
December and is also included in the rather large RFC posting I did some
weeks ago. That one basically reads two atomic variables and returns the
ratio between the two. I have yet to benchmark how horribly expensive it
is though. The sd argument is ignored. We might actually not need it at
all?

2015-02-24 10:22:19

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH RESEND v9 05/10] sched: make scale_rt invariant with frequency

On 19 February 2015 at 18:18, Morten Rasmussen <[email protected]> wrote:
> On Thu, Feb 19, 2015 at 04:52:41PM +0000, Peter Zijlstra wrote:
>> On Thu, Jan 15, 2015 at 11:09:25AM +0100, Vincent Guittot wrote:
>> > The average running time of RT tasks is used to estimate the remaining compute
>> > capacity for CFS tasks. This remaining capacity is the original capacity scaled
>> > down by a factor (aka scale_rt_capacity). This estimation of available capacity
>> > must also be invariant with frequency scaling.
>> >
>> > A frequency scaling factor is applied on the running time of the RT tasks for
>> > computing scale_rt_capacity.
>> >
>> > In sched_rt_avg_update, we scale the RT execution time like below:
>> > rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
>> >
>> > Then, scale_rt_capacity can be summarized by:
>> > scale_rt_capacity = SCHED_CAPACITY_SCALE -
>> > ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
>> >
>> > We can optimize by removing right and left shift in the computation of rq->rt_avg
>> > and scale_rt_capacity
>>
>> So far so good..
>>
>> > The call to arch_scale_frequency_capacity in the rt scheduling path might be
>> > a concern for RT folks because I'm not sure whether we can rely on
>> > arch_scale_freq_capacity to be short and efficient ?
>>
>> No, that is, arch_scale_frequency_capacity() _must_ be short and
>> efficient, event for the fair class, its called in very hot paths.
>
> ... and very frequently too.
>
>> I think we've talked about this before; this function should basically
>> only return a cached value, which is periodically updated through some
>> means.
>
> Agreed. I think it is reasonable to assume that the arch code
> implementing arch_scale_freq_capacity() does it's best to make it fast
> for the particular architecture. Since the scaling factor to be returned
> by the function may be obtained in different ways for different
> architectures the caching should be done on the arch side.
>
>> But lets see, I've yet to see an actual implementation of it; and its
>> got that sd argument, curious what you're going to do with that.
>
> So we do have an RFC implementation for ARM already which I posted in
> December and is also included in the rather large RFC posting I did some
> weeks ago. That one basically reads two atomic variables and returns the
> ratio between the two. I have yet to benchmark how horribly expensive it
> is though. The sd argument is ignored. We might actually not need it at
> all?

For consistency across all arch_scale_xx_capacity, i would prefer to
keep the same prototype interface (struct sched_domain *sd, int cpu)
even if it's not used ofr now

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-02-24 11:33:35

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH RESEND v9 05/10] sched: make scale_rt invariant with frequency

On 24/02/15 10:21, Vincent Guittot wrote:
> On 19 February 2015 at 18:18, Morten Rasmussen <[email protected]> wrote:
>> On Thu, Feb 19, 2015 at 04:52:41PM +0000, Peter Zijlstra wrote:
>>> On Thu, Jan 15, 2015 at 11:09:25AM +0100, Vincent Guittot wrote:

[...]

>> Agreed. I think it is reasonable to assume that the arch code
>> implementing arch_scale_freq_capacity() does it's best to make it fast
>> for the particular architecture. Since the scaling factor to be returned
>> by the function may be obtained in different ways for different
>> architectures the caching should be done on the arch side.
>>
>>> But lets see, I've yet to see an actual implementation of it; and its
>>> got that sd argument, curious what you're going to do with that.
>>
>> So we do have an RFC implementation for ARM already which I posted in
>> December and is also included in the rather large RFC posting I did some
>> weeks ago. That one basically reads two atomic variables and returns the
>> ratio between the two. I have yet to benchmark how horribly expensive it
>> is though. The sd argument is ignored. We might actually not need it at
>> all?
>
> For consistency across all arch_scale_xx_capacity, i would prefer to
> keep the same prototype interface (struct sched_domain *sd, int cpu)
> even if it's not used ofr now

Agreed.

Once we call arch_scale_xx_capacity [xx = freq, cpu] in the PELT code
(i.e. w/ sd = NULL) we have to make sure that
default_scale_cpu_capacity() can be called w/ sd = NULL too for
platforms which are not providing their own arch_scale_cpu_capacity()
function.

It's already part of '[RFCv3 PATCH 00/48] sched: Energy cost model for
energy-aware scheduling'

https://lkml.org/lkml/2015/2/4/573

[...]