LinuxLists.cc - Re: [RFC] sched: CPU topology try

2014-01-01 05:04:04

Subject: Re: [RFC] sched: CPU topology try

Hi Vincent,

On 12/18/2013 06:43 PM, Vincent Guittot wrote:
> This patch applies on top of the two patches [1][2] that have been proposed by
> Peter for creating a new way to initialize sched_domain. It includes some minor
> compilation fixes and a trial of using this new method on ARM platform.
> [1] https://lkml.org/lkml/2013/11/5/239
> [2] https://lkml.org/lkml/2013/11/5/449
>
> Based on the results of this tests, my feeling about this new way to init the
> sched_domain is a bit mitigated.
>
> The good point is that I have been able to create the same sched_domain
> topologies than before and even more complex ones (where a subset of the cores
> in a cluster share their powergating capabilities). I have described various
> topology results below.
>
> I use a system that is made of a dual cluster of quad cores with hyperthreading
> for my examples.
>
> If one cluster (0-7) can powergate its cores independantly but not the other
> cluster (8-15) we have the following topology, which is equal to what I had
> previously:
>
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
> domain 1: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 0-1 2-3 4-5 6-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
>
> CPU8
> domain 0: span 8-9 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 8 9
> domain 1: span 8-15 level: MC
> flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 8-9 10-11 12-13 14-15
> domain 2: span 0-15 level CPU
> flags:
> groups: 8-15 0-7
>
> We can even describe some more complex topologies if a susbset (2-7) of the
> cluster can't powergate independatly:
>
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
> domain 1: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 0-1 2-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
>
> CPU2:
> domain 0: span 2-3 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
> domain 1: span 2-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 2-7 4-5 6-7
> domain 2: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 2-7 0-1
> domain 3: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
>
> In this case, we have an aditionnal sched_domain MC level for this subset (2-7)
> of cores so we can trigger some load balance in this subset before doing that
> on the complete cluster (which is the last level of cache in my example)
>
> We can add more levels that will describe other dependency/independency like
> the frequency scaling dependency and as a result the final sched_domain
> topology will have additional levels (if they have not been removed during
> the degenerate sequence)

The design looks good to me. In my opinion information like P-states and
C-states dependency can be kept separate from the topology levels, it
might get too complicated unless the information is tightly coupled to
the topology.

>
> My concern is about the configuration of the table that is used to create the
> sched_domain. Some levels are "duplicated" with different flags configuration

I do not feel this is a problem since the levels are not duplicated,
rather they have different properties within them which is best
represented by flags like you have introduced in this patch.

> which make the table not easily readable and we must also take care of the
> order because parents have to gather all cpus of its childs. So we must
> choose which capabilities will be a subset of the other one. The order is

The sched domain levels which have SD_SHARE_POWERDOMAIN set is expected
to have cpus which are a subset of the cpus that this domain would have
included had this flag not been set. In addition to this every higher
domain, irrespective of SD_SHARE_POWERDOMAIN being set, will include all
cpus of the lower domains. As far as I see, this patch does not change
these assumptions. Hence I am unable to imagine a scenario when the
parent might not include all cpus of its children domain. Do you have
such a scenario in mind which can arise due to this patch ?

Thanks

Regards
Preeti U Murthy

2014-01-06 16:34:09

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: CPU topology try

On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:
> The design looks good to me. In my opinion information like P-states and
> C-states dependency can be kept separate from the topology levels, it
> might get too complicated unless the information is tightly coupled to
> the topology.

I'm not entirely convinced we can keep them separated, the moment we
have multiple CPUs sharing a P or C state we need somewhere to manage
the shared state and the domain tree seems like the most natural place
for this.

Now it might well be both P and C states operate at 'natural' domains
which we already have so it might be 'easy'.

2014-01-06 16:37:20

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC] sched: CPU topology try

On 1/6/2014 8:33 AM, Peter Zijlstra wrote:
> On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:
>> The design looks good to me. In my opinion information like P-states and
>> C-states dependency can be kept separate from the topology levels, it
>> might get too complicated unless the information is tightly coupled to
>> the topology.
>
> I'm not entirely convinced we can keep them separated, the moment we
> have multiple CPUs sharing a P or C state we need somewhere to manage
> the shared state and the domain tree seems like the most natural place
> for this.
>
> Now it might well be both P and C states operate at 'natural' domains
> which we already have so it might be 'easy'.

more than that though.. P and C state sharing is mostly hidden from the OS
(because the OS does not have the ability to do this; e.g. there are things
that do "if THIS cpu goes idle, the OTHER cpu P state changes automatic".

that's not just on x86, the ARM guys (iirc at least the latest snapdragon) are going in that
direction as well.....

for those systems, the OS really should just make local decisions and let the hardware
cope with hardware grouping.
>

2014-01-06 16:49:08

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: CPU topology try

On Mon, Jan 06, 2014 at 08:37:13AM -0800, Arjan van de Ven wrote:
> On 1/6/2014 8:33 AM, Peter Zijlstra wrote:
> >On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:
> >>The design looks good to me. In my opinion information like P-states and
> >>C-states dependency can be kept separate from the topology levels, it
> >>might get too complicated unless the information is tightly coupled to
> >>the topology.
> >
> >I'm not entirely convinced we can keep them separated, the moment we
> >have multiple CPUs sharing a P or C state we need somewhere to manage
> >the shared state and the domain tree seems like the most natural place
> >for this.
> >
> >Now it might well be both P and C states operate at 'natural' domains
> >which we already have so it might be 'easy'.
>
> more than that though.. P and C state sharing is mostly hidden from the OS
> (because the OS does not have the ability to do this; e.g. there are things
> that do "if THIS cpu goes idle, the OTHER cpu P state changes automatic".
>
> that's not just on x86, the ARM guys (iirc at least the latest snapdragon) are going in that
> direction as well.....
>
> for those systems, the OS really should just make local decisions and let the hardware
> cope with hardware grouping.

AFAICT this is a chicken-egg problem, the OS never did anything useful
with it so the hardware guys are now trying to do something with it, but
this also means that if we cannot predict what the hardware will do
under certain circumstances the OS really cannot do anything smart
anymore.

So yes, for certain hardware we'll just have to give up and not do
anything.

That said, some hardware still does allow us to do something and for
those we do need some of this.

Maybe if the OS becomes smart enough the hardware guys will give us some
control again, who knows.

So yes, I'm entirely fine saying that some chips are fucked and we can't
do anything sane with them.. Fine they get to sort things themselves.

2014-01-06 16:55:27

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] sched: CPU topology try

On Mon, Jan 06, 2014 at 05:48:38PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2014 at 08:37:13AM -0800, Arjan van de Ven wrote:
> > On 1/6/2014 8:33 AM, Peter Zijlstra wrote:
> > >On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:
> > >>The design looks good to me. In my opinion information like P-states and
> > >>C-states dependency can be kept separate from the topology levels, it
> > >>might get too complicated unless the information is tightly coupled to
> > >>the topology.
> > >
> > >I'm not entirely convinced we can keep them separated, the moment we
> > >have multiple CPUs sharing a P or C state we need somewhere to manage
> > >the shared state and the domain tree seems like the most natural place
> > >for this.
> > >
> > >Now it might well be both P and C states operate at 'natural' domains
> > >which we already have so it might be 'easy'.
> >
> > more than that though.. P and C state sharing is mostly hidden from the OS
> > (because the OS does not have the ability to do this; e.g. there are things
> > that do "if THIS cpu goes idle, the OTHER cpu P state changes automatic".
> >
> > that's not just on x86, the ARM guys (iirc at least the latest snapdragon) are going in that
> > direction as well.....
> >
> > for those systems, the OS really should just make local decisions and let the hardware
> > cope with hardware grouping.
>
> AFAICT this is a chicken-egg problem, the OS never did anything useful
> with it so the hardware guys are now trying to do something with it, but
> this also means that if we cannot predict what the hardware will do
> under certain circumstances the OS really cannot do anything smart
> anymore.
>
> So yes, for certain hardware we'll just have to give up and not do
> anything.
>
> That said, some hardware still does allow us to do something and for
> those we do need some of this.
>
> Maybe if the OS becomes smart enough the hardware guys will give us some
> control again, who knows.
>
> So yes, I'm entirely fine saying that some chips are fucked and we can't
> do anything sane with them.. Fine they get to sort things themselves.

That is; you're entirely unhelpful and I'm tempting to stop listening
to whatever you have to say on the subject.

Most of your emails are about how stuff cannot possibly work; without
saying how things can work.

The entire point of adding P and C state information to the scheduler is
so that we CAN do cross cpu decisions, but if you're saying we shouldn't
attempt because you can't say how the hardware will react anyway; fine
we'll ignore Intel hardware from now on.

So bloody stop saying what cannot work and start telling how we can make
useful cross cpu decisions.

2014-01-06 17:13:33

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC] sched: CPU topology try

>> AFAICT this is a chicken-egg problem, the OS never did anything useful
>> with it so the hardware guys are now trying to do something with it, but
>> this also means that if we cannot predict what the hardware will do
>> under certain circumstances the OS really cannot do anything smart
>> anymore.
>>
>> So yes, for certain hardware we'll just have to give up and not do
>> anything.
>>
>> That said, some hardware still does allow us to do something and for
>> those we do need some of this.
>>
>> Maybe if the OS becomes smart enough the hardware guys will give us some
>> control again, who knows.
>>
>> So yes, I'm entirely fine saying that some chips are fucked and we can't
>> do anything sane with them.. Fine they get to sort things themselves.
>
> That is; you're entirely unhelpful and I'm tempting to stop listening
> to whatever you have to say on the subject.
>
> Most of your emails are about how stuff cannot possibly work; without
> saying how things can work.
>
> The entire point of adding P and C state information to the scheduler is
> so that we CAN do cross cpu decisions, but if you're saying we shouldn't
> attempt because you can't say how the hardware will react anyway; fine
> we'll ignore Intel hardware from now on.

that's not what I'm trying to say.

if we as OS want to help make such decisions, we also need to face reality of what it means,
and see how we can get there.

let me give a simple but common example case, of a 2 core system where the cores share P state.
one task (A) is high priority/high utilization/whatever
(e.g. causes the OS to ask for high performance from the CPU if by itself)
the other task (B), on the 2nd core, is not that high priority/utilization/etc
(e.g. would cause the OS to ask for max power savings from the CPU if by itself)

time core 0 core 1 what the combined probably should be
0 task A idle max performance
1 task A task B max performance
2 idle (disk IO) task B least power
3 task A task B max performance

e.g. a simple case of task A running, and task B coming in... but then task A blocks briefly,
on say disk IO or some mutex or whatever.

we as OS will need to figure out how to get to the combined result, in a way that's relatively race free,
with two common races to take care of:
* knowing if another core is idle at any time is inherently racey.. it may wake up or go idle the next cycle
* in hardware modes where the OS controls all, the P state registers tend to be "the last one to write on any
core controls them all" way; we need to make sure we don't fight ourselves here and assign a core to do
this decision/communication to hardware on behalf of the whole domain (even if the core that's
assigned may move around when the assigned core goes idle) rather than the various cores doing it themselves async.
This tends to be harder than it seems if you also don't want to lose efficiency (e.g. no significant extra
wakeups from idle and also not missing opportunities to go to "least power" in the "time 2" scenario above)

x86 and modern ARM (snapdragon at least) do this kind of coordination in hardware/microcontroller (with an opt in for the OS to
do it itself on x86 and likely snapdragon) which means the race conditions are not really there.

2014-01-07 12:40:31

by Vincent Guittot

[permalink] [raw]

Subject: Re: [RFC] sched: CPU topology try

On 1 January 2014 06:00, Preeti U Murthy <[email protected]> wrote:
> Hi Vincent,
>
> On 12/18/2013 06:43 PM, Vincent Guittot wrote:
>> This patch applies on top of the two patches [1][2] that have been proposed by
>> Peter for creating a new way to initialize sched_domain. It includes some minor
>> compilation fixes and a trial of using this new method on ARM platform.
>> [1] https://lkml.org/lkml/2013/11/5/239
>> [2] https://lkml.org/lkml/2013/11/5/449
>>
>> Based on the results of this tests, my feeling about this new way to init the
>> sched_domain is a bit mitigated.
>>
>> The good point is that I have been able to create the same sched_domain
>> topologies than before and even more complex ones (where a subset of the cores
>> in a cluster share their powergating capabilities). I have described various
>> topology results below.
>>
>> I use a system that is made of a dual cluster of quad cores with hyperthreading
>> for my examples.
>>
>> If one cluster (0-7) can powergate its cores independantly but not the other
>> cluster (8-15) we have the following topology, which is equal to what I had
>> previously:
>>
>> CPU0:
>> domain 0: span 0-1 level: SMT
>> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>> groups: 0 1
>> domain 1: span 0-7 level: MC
>> flags: SD_SHARE_PKG_RESOURCES
>> groups: 0-1 2-3 4-5 6-7
>> domain 2: span 0-15 level: CPU
>> flags:
>> groups: 0-7 8-15
>>
>> CPU8
>> domain 0: span 8-9 level: SMT
>> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>> groups: 8 9
>> domain 1: span 8-15 level: MC
>> flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>> groups: 8-9 10-11 12-13 14-15
>> domain 2: span 0-15 level CPU
>> flags:
>> groups: 8-15 0-7
>>
>> We can even describe some more complex topologies if a susbset (2-7) of the
>> cluster can't powergate independatly:
>>
>> CPU0:
>> domain 0: span 0-1 level: SMT
>> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>> groups: 0 1
>> domain 1: span 0-7 level: MC
>> flags: SD_SHARE_PKG_RESOURCES
>> groups: 0-1 2-7
>> domain 2: span 0-15 level: CPU
>> flags:
>> groups: 0-7 8-15
>>
>> CPU2:
>> domain 0: span 2-3 level: SMT
>> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>> groups: 0 1
>> domain 1: span 2-7 level: MC
>> flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>> groups: 2-7 4-5 6-7
>> domain 2: span 0-7 level: MC
>> flags: SD_SHARE_PKG_RESOURCES
>> groups: 2-7 0-1
>> domain 3: span 0-15 level: CPU
>> flags:
>> groups: 0-7 8-15
>>
>> In this case, we have an aditionnal sched_domain MC level for this subset (2-7)
>> of cores so we can trigger some load balance in this subset before doing that
>> on the complete cluster (which is the last level of cache in my example)
>>
>> We can add more levels that will describe other dependency/independency like
>> the frequency scaling dependency and as a result the final sched_domain
>> topology will have additional levels (if they have not been removed during
>> the degenerate sequence)
>
> The design looks good to me. In my opinion information like P-states and
> C-states dependency can be kept separate from the topology levels, it
> might get too complicated unless the information is tightly coupled to
> the topology.
>
>>
>> My concern is about the configuration of the table that is used to create the
>> sched_domain. Some levels are "duplicated" with different flags configuration
>
> I do not feel this is a problem since the levels are not duplicated,
> rather they have different properties within them which is best
> represented by flags like you have introduced in this patch.
>
>> which make the table not easily readable and we must also take care of the
>> order because parents have to gather all cpus of its childs. So we must
>> choose which capabilities will be a subset of the other one. The order is
>
> The sched domain levels which have SD_SHARE_POWERDOMAIN set is expected
> to have cpus which are a subset of the cpus that this domain would have
> included had this flag not been set. In addition to this every higher
> domain, irrespective of SD_SHARE_POWERDOMAIN being set, will include all
> cpus of the lower domains. As far as I see, this patch does not change
> these assumptions. Hence I am unable to imagine a scenario when the
> parent might not include all cpus of its children domain. Do you have
> such a scenario in mind which can arise due to this patch ?

My patch doesn't have issue because i have added only 1 layer which is
always a subset of the current cache level topology but if we add
another feature with another layer, we have to decide which feature
will be a subset of the other one.

Vincent

>
> Thanks
>
> Regards
> Preeti U Murthy
>