Hi Vincent, Peter,
On 12/18/2013 06:43 PM, Vincent Guittot wrote:
> This patch applies on top of the two patches [1][2] that have been proposed by
> Peter for creating a new way to initialize sched_domain. It includes some minor
> compilation fixes and a trial of using this new method on ARM platform.
> [1] https://lkml.org/lkml/2013/11/5/239
> [2] https://lkml.org/lkml/2013/11/5/449
>
> Based on the results of this tests, my feeling about this new way to init the
> sched_domain is a bit mitigated.
>
> The good point is that I have been able to create the same sched_domain
> topologies than before and even more complex ones (where a subset of the cores
> in a cluster share their powergating capabilities). I have described various
> topology results below.
>
> I use a system that is made of a dual cluster of quad cores with hyperthreading
> for my examples.
>
> If one cluster (0-7) can powergate its cores independantly but not the other
> cluster (8-15) we have the following topology, which is equal to what I had
> previously:
>
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
> domain 1: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 0-1 2-3 4-5 6-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
>
> CPU8
> domain 0: span 8-9 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 8 9
> domain 1: span 8-15 level: MC
> flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 8-9 10-11 12-13 14-15
> domain 2: span 0-15 level CPU
> flags:
> groups: 8-15 0-7
>
> We can even describe some more complex topologies if a susbset (2-7) of the
> cluster can't powergate independatly:
>
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
> domain 1: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 0-1 2-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
>
> CPU2:
> domain 0: span 2-3 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
> domain 1: span 2-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 2-7 4-5 6-7
> domain 2: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 2-7 0-1
> domain 3: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
>
> In this case, we have an aditionnal sched_domain MC level for this subset (2-7)
> of cores so we can trigger some load balance in this subset before doing that
> on the complete cluster (which is the last level of cache in my example)
>
> We can add more levels that will describe other dependency/independency like
> the frequency scaling dependency and as a result the final sched_domain
> topology will have additional levels (if they have not been removed during
> the degenerate sequence)
>
> My concern is about the configuration of the table that is used to create the
> sched_domain. Some levels are "duplicated" with different flags configuration
> which make the table not easily readable and we must also take care of the
> order because parents have to gather all cpus of its childs. So we must
> choose which capabilities will be a subset of the other one. The order is
> almost straight forward when we describe 1 or 2 kind of capabilities
> (package ressource sharing and power sharing) but it can become complex if we
> want to add more.
What if we want to add arch specific flags to the NUMA domain? Currently
with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
the arch can modify the sd flags of the topology levels till just before
the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
initialized. We need to perhaps call into arch here to probe for
additional flags?
Thanks
Regards
Preeti U Murthy
>
> Regards
> Vincent
>
On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
> What if we want to add arch specific flags to the NUMA domain? Currently
> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
> the arch can modify the sd flags of the topology levels till just before
> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
> initialized. We need to perhaps call into arch here to probe for
> additional flags?
What are you thinking of? I was hoping all NUMA details were captured in
the distance table.
Its far easier to talk of specifics in this case.
On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
>> What if we want to add arch specific flags to the NUMA domain? Currently
>> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
>> the arch can modify the sd flags of the topology levels till just before
>> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
>> initialized. We need to perhaps call into arch here to probe for
>> additional flags?
>
> What are you thinking of? I was hoping all NUMA details were captured in
> the distance table.
>
> Its far easier to talk of specifics in this case.
>
If the processor can be core gated, then there is very little power
savings that we could yield from consolidating all the load onto a
single node in a NUMA domain. 6 cores on one node or 3 cores each on two
nodes, the power is drawn by 6 cores in all. So I was thinking under
this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
the NUMA domain and spread the load if it favours the workload.
Regards
Preeti U Murthy
On Tue, Jan 07, 2014 at 04:09:39PM +0530, Preeti U Murthy wrote:
> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
> > On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
> >> What if we want to add arch specific flags to the NUMA domain? Currently
> >> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
> >> the arch can modify the sd flags of the topology levels till just before
> >> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
> >> initialized. We need to perhaps call into arch here to probe for
> >> additional flags?
> >
> > What are you thinking of? I was hoping all NUMA details were captured in
> > the distance table.
> >
> > Its far easier to talk of specifics in this case.
> >
> If the processor can be core gated, then there is very little power
> savings that we could yield from consolidating all the load onto a
> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
> nodes, the power is drawn by 6 cores in all. So I was thinking under
> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
> the NUMA domain and spread the load if it favours the workload.
So Intel has so far not said a lot of sensible things about power
management on their multi-socket platform.
And I've not heard anything at all from IBM on the POWER chips.
What I know from the Intel side is that packet idle hardly saves
anything when compared to the DRAM power and the cost of having to do
remote memory accesses.
In other words, I'm not at all considering power aware scheduling for
NUMA systems until someone starts talking sense :-)
On Tue, Jan 07, 2014 at 10:39:39AM +0000, Preeti U Murthy wrote:
> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
> > On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
> >> What if we want to add arch specific flags to the NUMA domain? Currently
> >> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
> >> the arch can modify the sd flags of the topology levels till just before
> >> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
> >> initialized. We need to perhaps call into arch here to probe for
> >> additional flags?
> >
> > What are you thinking of? I was hoping all NUMA details were captured in
> > the distance table.
> >
> > Its far easier to talk of specifics in this case.
> >
> If the processor can be core gated, then there is very little power
> savings that we could yield from consolidating all the load onto a
> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
> nodes, the power is drawn by 6 cores in all.
Not being a NUMA expert, I would have thought that load consolidation at
node level would nearly always save power even when cpus can be power
gated individually. The number of cpus awake is the same, but you only
need to power the caches, memory, and other node peripherals for one
node instead of two in your example. Wouldn't that save power?
Memory/cache intensive workloads might benefit from spreading at node
level though.
Am I missing something?
Morten
On 7 January 2014 11:39, Preeti U Murthy <[email protected]> wrote:
> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
>> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
>>> What if we want to add arch specific flags to the NUMA domain? Currently
>>> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
>>> the arch can modify the sd flags of the topology levels till just before
>>> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
>>> initialized. We need to perhaps call into arch here to probe for
>>> additional flags?
>>
>> What are you thinking of? I was hoping all NUMA details were captured in
>> the distance table.
>>
>> Its far easier to talk of specifics in this case.
>>
> If the processor can be core gated, then there is very little power
> savings that we could yield from consolidating all the load onto a
> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
> nodes, the power is drawn by 6 cores in all. So I was thinking under
> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
> the NUMA domain and spread the load if it favours the workload.
The policy of keeping the tasks running on cores that are close (same
node) to the memory, is the more power efficient, isn't it ? so it's
probably more about where to place the memory than about where to
place the tasks ?
Vincent
>
> Regards
> Preeti U Murthy
>
On 01/07/2014 04:43 PM, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 04:09:39PM +0530, Preeti U Murthy wrote:
>> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
>>> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
>>>> What if we want to add arch specific flags to the NUMA domain? Currently
>>>> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
>>>> the arch can modify the sd flags of the topology levels till just before
>>>> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
>>>> initialized. We need to perhaps call into arch here to probe for
>>>> additional flags?
>>>
>>> What are you thinking of? I was hoping all NUMA details were captured in
>>> the distance table.
>>>
>>> Its far easier to talk of specifics in this case.
>>>
>> If the processor can be core gated, then there is very little power
>> savings that we could yield from consolidating all the load onto a
>> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
>> nodes, the power is drawn by 6 cores in all. So I was thinking under
>> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
>> the NUMA domain and spread the load if it favours the workload.
>
> So Intel has so far not said a lot of sensible things about power
> management on their multi-socket platform.
>
> And I've not heard anything at all from IBM on the POWER chips.
>
> What I know from the Intel side is that packet idle hardly saves
> anything when compared to the DRAM power and the cost of having to do
> remote memory accesses.
>
> In other words, I'm not at all considering power aware scheduling for
> NUMA systems until someone starts talking sense :-)
>
On Power8 systems, most of the cpuidle power management is done at the
core level. Doing so is expected to yield us good power savings without
much loss of performance, with little exit latency from these idle
states and little overhead obtained from re-initialization of the cores.
However doing idle power management at a node level could hit
performance although good power savings is obtained due to the overhead
of re-initialization of the node which could be significant and of
course the large exit latency from such idle states.
Therefore we would try and consolidate load to cores as much as possible
rather than to nodes so as to leave as many cores idle. Again
consolidation to cores needs to be to 3-4 threads in a core. With 8
threads in a core, running just one thread would hardly do justice to
the core's resources. At the same time running the core full throttle
would hit performance. Hence a fine balance could be obtained by
consolidating load to minimum number of threads.
*Consolidating load to core and spreading the load across nodes* would
probably help memory intensive workloads finish faster due to less
contention on local node memory and can get the cores to idle faster.
Thanks
Regards
Preeti U Murthy
On 01/07/2014 06:01 PM, Vincent Guittot wrote:
> On 7 January 2014 11:39, Preeti U Murthy <[email protected]> wrote:
>> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
>>> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
>>>> What if we want to add arch specific flags to the NUMA domain? Currently
>>>> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
>>>> the arch can modify the sd flags of the topology levels till just before
>>>> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
>>>> initialized. We need to perhaps call into arch here to probe for
>>>> additional flags?
>>>
>>> What are you thinking of? I was hoping all NUMA details were captured in
>>> the distance table.
>>>
>>> Its far easier to talk of specifics in this case.
>>>
>> If the processor can be core gated, then there is very little power
>> savings that we could yield from consolidating all the load onto a
>> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
>> nodes, the power is drawn by 6 cores in all. So I was thinking under
>> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
>> the NUMA domain and spread the load if it favours the workload.
>
> The policy of keeping the tasks running on cores that are close (same
> node) to the memory, is the more power efficient, isn't it ? so it's
> probably more about where to place the memory than about where to
> place the tasks ?
Yes this is another point. One of the reasons that we try to consolidate
load to cores is that on Power8 systems most of the power management is
at the core level and node level cpuidle states are usually entered into
on fully idle systems due to the overhead involved in exit from these
idle states as I mentioned in reply to this thread.
Another point questioning node level idle states which could for
instance include flushing of large shared cache is that if we try and
consolidate the load to nodes, we must also consolidate memory pages
simultaneously. Else the performance will be severely hurt in
re-fetching the pages which were flushed as compared to core level idle
management.
Core level idle power management could include flushing of l2 cache,
which is still ok for performance because re-fetching of the pages on
this cache has relatively low overhead and depending on the arch, the
power savings obtained could be worth the overhead.
Thanks
Regards
Preeti U Murthy
>
> Vincent
>
>>
>> Regards
>> Preeti U Murthy
>>
>