2014-01-06 13:41:53

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On 23 December 2013 18:22, Dietmar Eggemann <[email protected]> wrote:
> Hi Vincent,
>
>
> On 18/12/13 14:13, Vincent Guittot wrote:
>>
>> This patch applies on top of the two patches [1][2] that have been
>> proposed by
>> Peter for creating a new way to initialize sched_domain. It includes some
>> minor
>> compilation fixes and a trial of using this new method on ARM platform.
>> [1] https://lkml.org/lkml/2013/11/5/239
>> [2] https://lkml.org/lkml/2013/11/5/449
>
>
> I came up w/ a similar implementation proposal for an arch specific
> interface for scheduler domain set-up a couple of days ago:
>
> [1] https://lkml.org/lkml/2013/12/13/182
>
> I had the following requirements in mind:
>
> 1) The arch should not be able to fine tune individual scheduler behaviour,
> i.e. get rid of the arch specific SD_FOO_INIT macros.
>
> 2) Unify the set-up code for conventional and NUMA scheduler domains.
>
> 3) The arch is able to specify additional scheduler domain level, other than
> SMT, MC, BOOK, and CPU.
>
> 4) Allow to integrate the provision of additional topology related data
> (e.g. energy information) to the scheduler.
>
> Moreover, I think now that:
>
> 5) Something like the existing default set-up via default_topology[] is
> needed to avoid code duplication for archs not interested in (3) or (4).

Hi Dietmar,

I agree. This default array is available in Peter's patch and my
patches overwrites the default array only if it wants to add more/new
levels

[snip]

>>
>> CPU2:
>> domain 0: span 2-3 level: SMT
>> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES |
>> SD_SHARE_POWERDOMAIN
>> groups: 0 1
>> domain 1: span 2-7 level: MC
>> flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>> groups: 2-7 4-5 6-7
>> domain 2: span 0-7 level: MC
>> flags: SD_SHARE_PKG_RESOURCES
>> groups: 2-7 0-1
>> domain 3: span 0-15 level: CPU
>> flags:
>> groups: 0-7 8-15
>>
>> In this case, we have an aditionnal sched_domain MC level for this subset
>> (2-7)
>> of cores so we can trigger some load balance in this subset before doing
>> that
>> on the complete cluster (which is the last level of cache in my example)
>
>
> I think the weakest point right now is the condition in sd_init() where we
> convert the topology flags into scheduler behaviour. We not only introduce a
> very tight coupling between topology flags and scheduler domain level but
> also we need to follow a certain order in the initialization. This bit needs
> more thinking.

IMHO, these settings will disappear sooner or later, as an example the
idle/busy _idx are going to be removed by Alex's patch.

>
>
>>
>> We can add more levels that will describe other dependency/independency
>> like
>> the frequency scaling dependency and as a result the final sched_domain
>> topology will have additional levels (if they have not been removed during
>> the degenerate sequence)
>>
>> My concern is about the configuration of the table that is used to create
>> the
>> sched_domain. Some levels are "duplicated" with different flags
>> configuration
>> which make the table not easily readable and we must also take care of the
>> order because parents have to gather all cpus of its childs. So we must
>> choose which capabilities will be a subset of the other one. The order is
>> almost straight forward when we describe 1 or 2 kind of capabilities
>> (package ressource sharing and power sharing) but it can become complex if
>> we
>> want to add more.
>
>
> I'm not sure if the idea to create a dedicated sched_domain level for every
> topology flag representing a specific functionality will scale. From the

It's up to the arch to decide how many levels they want to add; if a
dedicated level is needed or if it can gather some features/flags.
IMHO, having sub structs for energy information like what we have for
the cpu/group capacity will not prevent from having a 1st and quick
topology tree description

> perspective of energy-aware scheduling we need e.g. energy costs (P and C
> state) which can only be populated towards the scheduler via an additional
> sub-struct and additional function arch_sd_energy() like depicted in
> Morten's email:
>
> [2] lkml.org/lkml/2013/11/14/102
>

[snip]

>> +
>> +static int __init arm_sched_topology(void)
>> +{
>> + sched_domain_topology = arm_topology;
>
>
> return missing

good catch

Thanks

Vincent


2014-01-06 16:31:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
> IMHO, these settings will disappear sooner or later, as an example the
> idle/busy _idx are going to be removed by Alex's patch.

Well I'm still entirely unconvinced by them..

removing the cpu_load array makes sense, but I'm starting to doubt the
removal of the _idx things.. I think we want to retain them in some
form, it simply makes sense to look at longer term averages when looking
at larger CPU groups.

So maybe we can express the things in log_2(group-span) or so, but we
need a working replacement for the cpu_load array. Ideally some
expression involving the blocked load.

Its another one of those things I need to ponder more :-)

2014-01-07 08:32:26

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On 6 January 2014 17:31, Peter Zijlstra <[email protected]> wrote:
> On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
>> IMHO, these settings will disappear sooner or later, as an example the
>> idle/busy _idx are going to be removed by Alex's patch.
>
> Well I'm still entirely unconvinced by them..
>
> removing the cpu_load array makes sense, but I'm starting to doubt the
> removal of the _idx things.. I think we want to retain them in some
> form, it simply makes sense to look at longer term averages when looking
> at larger CPU groups.
>
> So maybe we can express the things in log_2(group-span) or so, but we
> need a working replacement for the cpu_load array. Ideally some
> expression involving the blocked load.

Using the blocked load can surely give benefit in the load balance
because it gives a view of potential load on a core but it still decay
with the same speed than runnable load average so it doesn't solve the
issue for longer term average. One way is to have a runnable average
load with longer time window

>
> Its another one of those things I need to ponder more :-)

2014-01-07 13:22:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Tue, Jan 07, 2014 at 09:32:04AM +0100, Vincent Guittot wrote:
> On 6 January 2014 17:31, Peter Zijlstra <[email protected]> wrote:
> > On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
> >> IMHO, these settings will disappear sooner or later, as an example the
> >> idle/busy _idx are going to be removed by Alex's patch.
> >
> > Well I'm still entirely unconvinced by them..
> >
> > removing the cpu_load array makes sense, but I'm starting to doubt the
> > removal of the _idx things.. I think we want to retain them in some
> > form, it simply makes sense to look at longer term averages when looking
> > at larger CPU groups.
> >
> > So maybe we can express the things in log_2(group-span) or so, but we
> > need a working replacement for the cpu_load array. Ideally some
> > expression involving the blocked load.
>
> Using the blocked load can surely give benefit in the load balance
> because it gives a view of potential load on a core but it still decay
> with the same speed than runnable load average so it doesn't solve the
> issue for longer term average. One way is to have a runnable average
> load with longer time window

Ah, another way of looking at it is that the avg without blocked
component is a 'now' picture. It is the load we are concerned with right
now.

The more blocked we add the further out we look; with the obvious limit
of the entire averaging period.

So the avg that is runnable is right now, t_0; the avg that is runnable +
blocked is t_0 + p, where p is the avg period over which we expect the
blocked contribution to appear.

So something like:

avg = runnable + p(i) * blocked; where p(i) \e [0,1]

could maybe be used to replace the cpu_load array and still represent
the concept of looking at a bigger picture for larger sets. Leaving open
the details of the map p.

2014-01-07 14:11:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Tue, Jan 07, 2014 at 02:22:20PM +0100, Peter Zijlstra wrote:

I just realized there's two different p's in there.

> Ah, another way of looking at it is that the avg without blocked
> component is a 'now' picture. It is the load we are concerned with right
> now.
>
> The more blocked we add the further out we look; with the obvious limit
> of the entire averaging period.
>
> So the avg that is runnable is right now, t_0; the avg that is runnable +
> blocked is t_0 + p, where p is the avg period over which we expect the
> blocked contribution to appear.

So the above p for period, is unrelated to the below p which is a
probability function.

> So something like:
>
> avg = runnable + p(i) * blocked; where p(i) \e [0,1]
>
> could maybe be used to replace the cpu_load array and still represent
> the concept of looking at a bigger picture for larger sets. Leaving open
> the details of the map p.

We probably want to assume task wakeup is constant over time, so p (our
probability function) should probably be an exponential distribution.

2014-01-07 14:11:48

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On 7 January 2014 14:22, Peter Zijlstra <[email protected]> wrote:
> On Tue, Jan 07, 2014 at 09:32:04AM +0100, Vincent Guittot wrote:
>> On 6 January 2014 17:31, Peter Zijlstra <[email protected]> wrote:
>> > On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
>> >> IMHO, these settings will disappear sooner or later, as an example the
>> >> idle/busy _idx are going to be removed by Alex's patch.
>> >
>> > Well I'm still entirely unconvinced by them..
>> >
>> > removing the cpu_load array makes sense, but I'm starting to doubt the
>> > removal of the _idx things.. I think we want to retain them in some
>> > form, it simply makes sense to look at longer term averages when looking
>> > at larger CPU groups.
>> >
>> > So maybe we can express the things in log_2(group-span) or so, but we
>> > need a working replacement for the cpu_load array. Ideally some
>> > expression involving the blocked load.
>>
>> Using the blocked load can surely give benefit in the load balance
>> because it gives a view of potential load on a core but it still decay
>> with the same speed than runnable load average so it doesn't solve the
>> issue for longer term average. One way is to have a runnable average
>> load with longer time window
>
> Ah, another way of looking at it is that the avg without blocked
> component is a 'now' picture. It is the load we are concerned with right
> now.
>
> The more blocked we add the further out we look; with the obvious limit
> of the entire averaging period.
>
> So the avg that is runnable is right now, t_0; the avg that is runnable +
> blocked is t_0 + p, where p is the avg period over which we expect the
> blocked contribution to appear.
>
> So something like:
>
> avg = runnable + p(i) * blocked; where p(i) \e [0,1]
>
> could maybe be used to replace the cpu_load array and still represent
> the concept of looking at a bigger picture for larger sets. Leaving open
> the details of the map p.

That needs to be studied more deeply but that could be a way to have a
larger picture

Another point is that we are using runnable and blocked load average
which are the sum of load_avg_contrib of tasks but we are not using
the runnable_avg_sum of the cpus which is not the now picture but a
average of the past running time (without taking into account task
weight)

Vincent

2014-01-07 15:37:19

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Tue, Jan 07, 2014 at 02:11:22PM +0000, Vincent Guittot wrote:
> On 7 January 2014 14:22, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Jan 07, 2014 at 09:32:04AM +0100, Vincent Guittot wrote:
> >> On 6 January 2014 17:31, Peter Zijlstra <[email protected]> wrote:
> >> > On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
> >> >> IMHO, these settings will disappear sooner or later, as an example the
> >> >> idle/busy _idx are going to be removed by Alex's patch.
> >> >
> >> > Well I'm still entirely unconvinced by them..
> >> >
> >> > removing the cpu_load array makes sense, but I'm starting to doubt the
> >> > removal of the _idx things.. I think we want to retain them in some
> >> > form, it simply makes sense to look at longer term averages when looking
> >> > at larger CPU groups.
> >> >
> >> > So maybe we can express the things in log_2(group-span) or so, but we
> >> > need a working replacement for the cpu_load array. Ideally some
> >> > expression involving the blocked load.
> >>
> >> Using the blocked load can surely give benefit in the load balance
> >> because it gives a view of potential load on a core but it still decay
> >> with the same speed than runnable load average so it doesn't solve the
> >> issue for longer term average. One way is to have a runnable average
> >> load with longer time window

The blocked load discussion comes up again :)

I totally agree that blocked load would be useful, but only if we get
the priority problem sorted out. Blocked load is the sum of load_contrib
of blocked tasks, which means that a tiny high priority task can have a
massive contribution to the blocked load.

> >
> > Ah, another way of looking at it is that the avg without blocked
> > component is a 'now' picture. It is the load we are concerned with right
> > now.
> >
> > The more blocked we add the further out we look; with the obvious limit
> > of the entire averaging period.
> >
> > So the avg that is runnable is right now, t_0; the avg that is runnable +
> > blocked is t_0 + p, where p is the avg period over which we expect the
> > blocked contribution to appear.
> >
> > So something like:
> >
> > avg = runnable + p(i) * blocked; where p(i) \e [0,1]
> >
> > could maybe be used to replace the cpu_load array and still represent
> > the concept of looking at a bigger picture for larger sets. Leaving open
> > the details of the map p.

Figuring out p is the difficult bit. AFAIK, with blocked load in its
current form we don't have any clue when a task will reappear.

>
> That needs to be studied more deeply but that could be a way to have a
> larger picture

Agree.

>
> Another point is that we are using runnable and blocked load average
> which are the sum of load_avg_contrib of tasks but we are not using
> the runnable_avg_sum of the cpus which is not the now picture but a
> average of the past running time (without taking into account task
> weight)

Yes. The rq runnable_avg_sum is an excellent longer term load indicator.
It can't be compared with the runnable and blocked load though. The
other alternative that I can think of is to introduce an unweighted
alternative to blocked load. That is, sum of load_contrib/priority.

Morten

2014-01-07 15:42:03

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Tue, Jan 07, 2014 at 02:10:59PM +0000, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 02:22:20PM +0100, Peter Zijlstra wrote:
>
> I just realized there's two different p's in there.
>
> > Ah, another way of looking at it is that the avg without blocked
> > component is a 'now' picture. It is the load we are concerned with right
> > now.
> >
> > The more blocked we add the further out we look; with the obvious limit
> > of the entire averaging period.
> >
> > So the avg that is runnable is right now, t_0; the avg that is runnable +
> > blocked is t_0 + p, where p is the avg period over which we expect the
> > blocked contribution to appear.
>
> So the above p for period, is unrelated to the below p which is a
> probability function.
>
> > So something like:
> >
> > avg = runnable + p(i) * blocked; where p(i) \e [0,1]
> >
> > could maybe be used to replace the cpu_load array and still represent
> > the concept of looking at a bigger picture for larger sets. Leaving open
> > the details of the map p.
>
> We probably want to assume task wakeup is constant over time, so p (our
> probability function) should probably be an exponential distribution.

Ah, makes more sense now.

You propose that we don't actually try keep track of which tasks that
might wake up when, but just factor in more and more of the blocked load
depending on how conservative the load estimate we want?

I think that could work if we sort of the priority scaling issue that I
mentioned before.

2014-01-07 20:50:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Tue, Jan 07, 2014 at 03:41:54PM +0000, Morten Rasmussen wrote:
> I think that could work if we sort of the priority scaling issue that I
> mentioned before.

We talked a bit about this on IRC a month or so ago, right? My memories
from that are that your main complaint is that we don't detect the
overload scenario right.

That is; the point at which we should start caring about SMP-nice is
when all our CPUs are fully occupied, because up to that point we're
under utilized and work preservation mandates we utilize idle time.

Currently we detect overload by sg.nr_running >= sg.capacity, which can
be very misleading because while a cpu might have a task running 'now'
it might be 99% idle.

At which point I argued we should change the capacity thing anyhow. Ever
since the runnable_avg patch set I've been arguing to change that into
an actual utilization test.

So I think that if we measure overload by something like >95% utilization
on the entire group the load scaling again makes perfect sense.

Given the 3 task {A,B,C} workload where A and B are niced, to land on a
symmetric dual CPU system like: {A,B}+{C}, assuming they're all while(1)
loops :-).

The harder case is where all 3 tasks are of equal weight; in which case
fairness would mandate we (slowly) rotate the tasks such that they all
get 2/3 time -- we also horribly fail at this :-)

2014-01-08 08:32:42

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On 01/08/2014 04:49 AM, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 03:41:54PM +0000, Morten Rasmussen wrote:
>> I think that could work if we sort of the priority scaling issue that I
>> mentioned before.
>
> We talked a bit about this on IRC a month or so ago, right? My memories
> from that are that your main complaint is that we don't detect the
> overload scenario right.
>
> That is; the point at which we should start caring about SMP-nice is
> when all our CPUs are fully occupied, because up to that point we're
> under utilized and work preservation mandates we utilize idle time.
>
> Currently we detect overload by sg.nr_running >= sg.capacity, which can
> be very misleading because while a cpu might have a task running 'now'
> it might be 99% idle.
>
> At which point I argued we should change the capacity thing anyhow. Ever
> since the runnable_avg patch set I've been arguing to change that into
> an actual utilization test.
>
> So I think that if we measure overload by something like >95% utilization
> on the entire group the load scaling again makes perfect sense.

In my old power aware scheduling patchset, I had tried the 95 to 99. But
all those values will lead imbalance when we test while(1) like cases.
like in a 24LCPUs groups, 24*5% > 1. So, finally use 100% as overload
indicator. And in testing 100% works well to find overload since few
system service involved. :)
>
> Given the 3 task {A,B,C} workload where A and B are niced, to land on a
> symmetric dual CPU system like: {A,B}+{C}, assuming they're all while(1)
> loops :-).
>
> The harder case is where all 3 tasks are of equal weight; in which case
> fairness would mandate we (slowly) rotate the tasks such that they all
> get 2/3 time -- we also horribly fail at this :-)
>


--
Thanks
Alex

2014-01-08 08:37:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Wed, Jan 08, 2014 at 04:32:18PM +0800, Alex Shi wrote:
> In my old power aware scheduling patchset, I had tried the 95 to 99. But
> all those values will lead imbalance when we test while(1) like cases.
> like in a 24LCPUs groups, 24*5% > 1. So, finally use 100% as overload
> indicator. And in testing 100% works well to find overload since few
> system service involved. :)

Ah indeed, so 100% it is ;-)

2014-01-08 08:38:12

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On 01/07/2014 11:37 PM, Morten Rasmussen wrote:
>>> > >
>>> > > So something like:
>>> > >
>>> > > avg = runnable + p(i) * blocked; where p(i) \e [0,1]
>>> > >
>>> > > could maybe be used to replace the cpu_load array and still represent
>>> > > the concept of looking at a bigger picture for larger sets. Leaving open
>>> > > the details of the map p.
> Figuring out p is the difficult bit. AFAIK, with blocked load in its
> current form we don't have any clue when a task will reappear.

Yes, that's why we can not find a suitable way to consider the blocked
load in load balance.


--
Thanks
Alex

2014-01-08 12:35:37

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Tue, Jan 07, 2014 at 08:49:51PM +0000, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 03:41:54PM +0000, Morten Rasmussen wrote:
> > I think that could work if we sort of the priority scaling issue that I
> > mentioned before.
>
> We talked a bit about this on IRC a month or so ago, right? My memories
> from that are that your main complaint is that we don't detect the
> overload scenario right.
>
> That is; the point at which we should start caring about SMP-nice is
> when all our CPUs are fully occupied, because up to that point we're
> under utilized and work preservation mandates we utilize idle time.

Yes. I think I stated the problem differently, but I think we talk about
the same thing. Basically, priority-scaling in task load_contrib means
that runnable_load_avg and blocked_load_avg are poor indicators of cpu
load (available idle time). Priority scaling only makes sense when the
system is fully utilized. When that is not the case, it just gives us a
potentially very inaccurate picture of the load (available idle time).

Pretty much what you just said :-)

> Currently we detect overload by sg.nr_running >= sg.capacity, which can
> be very misleading because while a cpu might have a task running 'now'
> it might be 99% idle.
>
> At which point I argued we should change the capacity thing anyhow. Ever
> since the runnable_avg patch set I've been arguing to change that into
> an actual utilization test.
>
> So I think that if we measure overload by something like >95% utilization
> on the entire group the load scaling again makes perfect sense.

I agree that it make more sense to change the overload test to be based
on some tracked load. How about the non-overloaded case? Load balancing
would have to be based on unweighted task loads in that case?

>
> Given the 3 task {A,B,C} workload where A and B are niced, to land on a
> symmetric dual CPU system like: {A,B}+{C}, assuming they're all while(1)
> loops :-).
>
> The harder case is where all 3 tasks are of equal weight; in which case
> fairness would mandate we (slowly) rotate the tasks such that they all
> get 2/3 time -- we also horribly fail at this :-)

I have encountered that one a number of times. All the middleware noise
in Android sometimes give that effect.

I'm not sure if the NUMA guy would like rotating scheduler though :-)

2014-01-08 12:43:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Wed, Jan 08, 2014 at 12:35:34PM +0000, Morten Rasmussen wrote:
> > The harder case is where all 3 tasks are of equal weight; in which case
> > fairness would mandate we (slowly) rotate the tasks such that they all
> > get 2/3 time -- we also horribly fail at this :-)
>
> I have encountered that one a number of times. All the middleware noise
> in Android sometimes give that effect.

You've got a typo there: s/middleware/muddleware/ :-)

> I'm not sure if the NUMA guy would like rotating scheduler though :-)

Hurmph ;-) But yes, the N+1 tasks on a N cpu system is rotten; any
static solution gets 2 tasks that run at 50%, any dynamic solution gets
the migration overhead issue.

So while the dynamic solution would indeed allow each task to (on
average) receive N/N+1 time -- a vast improvement over the 50% thing, it
doesn't come without down sides.

2014-01-08 12:46:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Wed, Jan 08, 2014 at 12:35:34PM +0000, Morten Rasmussen wrote:
> > Currently we detect overload by sg.nr_running >= sg.capacity, which can
> > be very misleading because while a cpu might have a task running 'now'
> > it might be 99% idle.
> >
> > At which point I argued we should change the capacity thing anyhow. Ever
> > since the runnable_avg patch set I've been arguing to change that into
> > an actual utilization test.
> >
> > So I think that if we measure overload by something like >95% utilization
> > on the entire group the load scaling again makes perfect sense.
>
> I agree that it make more sense to change the overload test to be based
> on some tracked load. How about the non-overloaded case? Load balancing
> would have to be based on unweighted task loads in that case?

Yeah, until we're overloaded our goal is to minimize idle time.

2014-01-08 12:52:31

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Wed, Jan 08, 2014 at 08:37:16AM +0000, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 04:32:18PM +0800, Alex Shi wrote:
> > In my old power aware scheduling patchset, I had tried the 95 to 99. But
> > all those values will lead imbalance when we test while(1) like cases.
> > like in a 24LCPUs groups, 24*5% > 1. So, finally use 100% as overload
> > indicator. And in testing 100% works well to find overload since few
> > system service involved. :)
>
> Ah indeed, so 100% it is ;-)

If I remember correctly, Alex used the rq runnable_avg_sum (in rq->avg)
for this. It is the most obvious choice, but it takes ages to reach
100%.

#define LOAD_AVG_MAX_N 345

Worst case it takes 345 ms from the system is becomes fully utilized
after a long period of idle until the rq runnable_avg_sum reaches 100%.

An unweigthed version of cfs_rq->runnable_load_avg and blocked_load_avg
wouldn't have that delay.

Also, if we are changing the load balance behavior when all cpus are
fully utilized we may need to think about cases where the load is
hovering around the saturation threshold. But I don't think that is
important yet.

2014-01-08 13:04:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Wed, Jan 08, 2014 at 12:52:28PM +0000, Morten Rasmussen wrote:

> If I remember correctly, Alex used the rq runnable_avg_sum (in rq->avg)
> for this. It is the most obvious choice, but it takes ages to reach
> 100%.
>
> #define LOAD_AVG_MAX_N 345
>
> Worst case it takes 345 ms from the system is becomes fully utilized
> after a long period of idle until the rq runnable_avg_sum reaches 100%.
>
> An unweigthed version of cfs_rq->runnable_load_avg and blocked_load_avg
> wouldn't have that delay.

Right.. not sure we want to involve blocked load on the utilization
metric, but who knows maybe that does make sense.

But yes, we need unweighted runnable_avg.

> Also, if we are changing the load balance behavior when all cpus are
> fully utilized

We already have this tipping point. See all the has_capacity bits. But
yes, it'd get more involved I suppose.

> we may need to think about cases where the load is
> hovering around the saturation threshold. But I don't think that is
> important yet.

Yah.. I'm going to wait until we have a fail case that can give us
some guidance before really pondering this though :-)

2014-01-08 13:27:41

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Wed, Jan 08, 2014 at 12:45:34PM +0000, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 12:35:34PM +0000, Morten Rasmussen wrote:
> > > Currently we detect overload by sg.nr_running >= sg.capacity, which can
> > > be very misleading because while a cpu might have a task running 'now'
> > > it might be 99% idle.
> > >
> > > At which point I argued we should change the capacity thing anyhow. Ever
> > > since the runnable_avg patch set I've been arguing to change that into
> > > an actual utilization test.
> > >
> > > So I think that if we measure overload by something like >95% utilization
> > > on the entire group the load scaling again makes perfect sense.
> >
> > I agree that it make more sense to change the overload test to be based
> > on some tracked load. How about the non-overloaded case? Load balancing
> > would have to be based on unweighted task loads in that case?
>
> Yeah, until we're overloaded our goal is to minimize idle time.

I would say, make the most of the available cpu cycles. Minimizing idle
time is not always the right thing to do when considering power
awareness.

If we know the actual load of the tasks, we may be able to consolidate
them on fewer cpus and save power by idling cpus. In that case the idle
time (total) is unchanged (unless the P-state is changed). Somewhat
similar to the video use-case running on 1, 2, and 4 cpu that I reposted
yesterday.

2014-01-08 13:33:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Wed, Jan 08, 2014 at 01:27:39PM +0000, Morten Rasmussen wrote:
> On Wed, Jan 08, 2014 at 12:45:34PM +0000, Peter Zijlstra wrote:
> > On Wed, Jan 08, 2014 at 12:35:34PM +0000, Morten Rasmussen wrote:
> > > > Currently we detect overload by sg.nr_running >= sg.capacity, which can
> > > > be very misleading because while a cpu might have a task running 'now'
> > > > it might be 99% idle.
> > > >
> > > > At which point I argued we should change the capacity thing anyhow. Ever
> > > > since the runnable_avg patch set I've been arguing to change that into
> > > > an actual utilization test.
> > > >
> > > > So I think that if we measure overload by something like >95% utilization
> > > > on the entire group the load scaling again makes perfect sense.
> > >
> > > I agree that it make more sense to change the overload test to be based
> > > on some tracked load. How about the non-overloaded case? Load balancing
> > > would have to be based on unweighted task loads in that case?
> >
> > Yeah, until we're overloaded our goal is to minimize idle time.
>
> I would say, make the most of the available cpu cycles. Minimizing idle
> time is not always the right thing to do when considering power
> awareness.
>
> If we know the actual load of the tasks, we may be able to consolidate

I think we must start to be careful with the word load, I think you
meant to say utilization.

> them on fewer cpus and save power by idling cpus. In that case the idle
> time (total) is unchanged (unless the P-state is changed). Somewhat
> similar to the video use-case running on 1, 2, and 4 cpu that I reposted
> yesterday.

But fair enough.. Its idle time when you consider CPUs to always run at
max frequency, but clearly I must stop thinking about CPUs like that :-)

2014-01-08 13:33:32

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Wed, Jan 08, 2014 at 01:04:07PM +0000, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 12:52:28PM +0000, Morten Rasmussen wrote:
>
> > If I remember correctly, Alex used the rq runnable_avg_sum (in rq->avg)
> > for this. It is the most obvious choice, but it takes ages to reach
> > 100%.
> >
> > #define LOAD_AVG_MAX_N 345
> >
> > Worst case it takes 345 ms from the system is becomes fully utilized
> > after a long period of idle until the rq runnable_avg_sum reaches 100%.
> >
> > An unweigthed version of cfs_rq->runnable_load_avg and blocked_load_avg
> > wouldn't have that delay.
>
> Right.. not sure we want to involve blocked load on the utilization
> metric, but who knows maybe that does make sense.
>
> But yes, we need unweighted runnable_avg.

I'm not sure about the blocked load either.

>
> > Also, if we are changing the load balance behavior when all cpus are
> > fully utilized
>
> We already have this tipping point. See all the has_capacity bits. But
> yes, it'd get more involved I suppose.

I'll have a look.

2014-01-08 13:45:21

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC] sched: CPU topology try

On Wed, Jan 08, 2014 at 01:32:57PM +0000, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 01:27:39PM +0000, Morten Rasmussen wrote:
> > On Wed, Jan 08, 2014 at 12:45:34PM +0000, Peter Zijlstra wrote:
> > > On Wed, Jan 08, 2014 at 12:35:34PM +0000, Morten Rasmussen wrote:
> > > > > Currently we detect overload by sg.nr_running >= sg.capacity, which can
> > > > > be very misleading because while a cpu might have a task running 'now'
> > > > > it might be 99% idle.
> > > > >
> > > > > At which point I argued we should change the capacity thing anyhow. Ever
> > > > > since the runnable_avg patch set I've been arguing to change that into
> > > > > an actual utilization test.
> > > > >
> > > > > So I think that if we measure overload by something like >95% utilization
> > > > > on the entire group the load scaling again makes perfect sense.
> > > >
> > > > I agree that it make more sense to change the overload test to be based
> > > > on some tracked load. How about the non-overloaded case? Load balancing
> > > > would have to be based on unweighted task loads in that case?
> > >
> > > Yeah, until we're overloaded our goal is to minimize idle time.
> >
> > I would say, make the most of the available cpu cycles. Minimizing idle
> > time is not always the right thing to do when considering power
> > awareness.
> >
> > If we know the actual load of the tasks, we may be able to consolidate
>
> I think we must start to be careful with the word load, I think you
> meant to say utilization.

Indeed, I meant utilization.

>
> > them on fewer cpus and save power by idling cpus. In that case the idle
> > time (total) is unchanged (unless the P-state is changed). Somewhat
> > similar to the video use-case running on 1, 2, and 4 cpu that I reposted
> > yesterday.
>
> But fair enough.. Its idle time when you consider CPUs to always run at
> max frequency, but clearly I must stop thinking about CPUs like that :-)

Yes, it opens a whole new world of problems to be solved :-)