2015-04-01 06:28:24

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs

On 03/31/2015 11:00 PM, Jason Low wrote:
> On Tue, 2015-03-31 at 14:28 +0530, Preeti U Murthy wrote:
>
>> Morten,
>
>> I am a bit confused about the problem you are pointing to.
>
>> I am unable to see the issue. What is it that I am missing ?
>
> Hi Preeti,
>
> Here is one of the potential issues that have been described from my
> understanding.
>
> In situations where there are just a few tasks to pull (for example,
> there's 1 task to move).
>
> Before, if CPU 1 calls run_rebalance_domains(), we'll pull the tasks to
> this CPU 1 (which is already awake) and run the task on CPU 1.
>
> Now, we'll pull the task to some idle CPU 2 and wake up CPU 2 in order
> for the task to run. Meanwhile, CPU 1 may go idle, instead of running
> the task on CPU 1 which was already awake.
>

Alright I see. But it is one additional wake up. And the wake up will be
within the cluster. We will not wake up any CPU in the neighboring
cluster unless there are tasks to be pulled. So, we can wake up a core
out of a deep idle state and never a cluster in the problem described.
In terms of energy efficiency, this is not so bad a scenario, is it?

Regards
Preeti U Murthy


2015-04-01 13:03:17

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs

Hi Preeti and Jason,

On Wed, Apr 01, 2015 at 07:28:03AM +0100, Preeti U Murthy wrote:
> On 03/31/2015 11:00 PM, Jason Low wrote:
> > On Tue, 2015-03-31 at 14:28 +0530, Preeti U Murthy wrote:
> >
> >> Morten,
> >
> >> I am a bit confused about the problem you are pointing to.
> >
> >> I am unable to see the issue. What is it that I am missing ?
> >
> > Hi Preeti,
> >
> > Here is one of the potential issues that have been described from my
> > understanding.
> >
> > In situations where there are just a few tasks to pull (for example,
> > there's 1 task to move).
> >
> > Before, if CPU 1 calls run_rebalance_domains(), we'll pull the tasks to
> > this CPU 1 (which is already awake) and run the task on CPU 1.
> >
> > Now, we'll pull the task to some idle CPU 2 and wake up CPU 2 in order
> > for the task to run. Meanwhile, CPU 1 may go idle, instead of running
> > the task on CPU 1 which was already awake.

Yes. This is the scenario I had in mind although I might have failed to
make it crystal clear in my earlier replies.

> Alright I see. But it is one additional wake up. And the wake up will be
> within the cluster. We will not wake up any CPU in the neighboring
> cluster unless there are tasks to be pulled. So, we can wake up a core
> out of a deep idle state and never a cluster in the problem described.
> In terms of energy efficiency, this is not so bad a scenario, is it?

After Peter pointed out that it shouldn't happen across clusters due to
group_classify()/sg_capacity_factor() it isn't as bad as I initially
thought. It is still not an ideal solution I think. Wake-ups aren't nice
for battery-powered devices. Waking up a cpu in an already active
cluster may still imply powering up the core and bringing the L1 cache
into a usable state, but it isn't as bad as waking up a cluster. I would
prefer to avoid it if we can.

Thinking more about it, don't we also risk doing a lot of iterations in
nohz_idle_balance() leading to nothing (pure overhead) in certain corner
cases? If find_new_ild() is the last cpu in the cluster and we have one
task for each cpu in the cluster but one cpu is currently having two.
Don't we end up trying all nohz-idle cpus before giving up and balancing
the balancer cpu itself. On big machines, going through everyone could
take a while I think. No?

Morten

2015-04-02 00:57:11

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs

On Wed, 2015-04-01 at 14:03 +0100, Morten Rasmussen wrote:

Hi Morten,

> > Alright I see. But it is one additional wake up. And the wake up will be
> > within the cluster. We will not wake up any CPU in the neighboring
> > cluster unless there are tasks to be pulled. So, we can wake up a core
> > out of a deep idle state and never a cluster in the problem described.
> > In terms of energy efficiency, this is not so bad a scenario, is it?
>
> After Peter pointed out that it shouldn't happen across clusters due to
> group_classify()/sg_capacity_factor() it isn't as bad as I initially
> thought. It is still not an ideal solution I think. Wake-ups aren't nice
> for battery-powered devices. Waking up a cpu in an already active
> cluster may still imply powering up the core and bringing the L1 cache
> into a usable state, but it isn't as bad as waking up a cluster. I would
> prefer to avoid it if we can.

Right. I still think that the patch is justified if it addresses the 10
second latency issue, but if we could find a better solution, that would
be great :)

> Thinking more about it, don't we also risk doing a lot of iterations in
> nohz_idle_balance() leading to nothing (pure overhead) in certain corner
> cases? If find_new_ild() is the last cpu in the cluster and we have one
> task for each cpu in the cluster but one cpu is currently having two.
> Don't we end up trying all nohz-idle cpus before giving up and balancing
> the balancer cpu itself. On big machines, going through everyone could
> take a while I think. No?

Iterating through many CPUs could take a while, but since we only do
nohz_idle_balance() when the CPU is idle and exit if need_resched, then
we're only doing so if there is nothing else that needs to run.

Also, we're only attempting balancing when time_after_eq
rq->next_balance, so much of the time, we don't actually traverse all
the CPUs.

So this may not be too big of an issue.

2015-04-02 03:22:54

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs

Hi Morten,

On 04/01/2015 06:33 PM, Morten Rasmussen wrote:
>> Alright I see. But it is one additional wake up. And the wake up will be
>> within the cluster. We will not wake up any CPU in the neighboring
>> cluster unless there are tasks to be pulled. So, we can wake up a core
>> out of a deep idle state and never a cluster in the problem described.
>> In terms of energy efficiency, this is not so bad a scenario, is it?
>
> After Peter pointed out that it shouldn't happen across clusters due to
> group_classify()/sg_capacity_factor() it isn't as bad as I initially
> thought. It is still not an ideal solution I think. Wake-ups aren't nice
> for battery-powered devices. Waking up a cpu in an already active
> cluster may still imply powering up the core and bringing the L1 cache
> into a usable state, but it isn't as bad as waking up a cluster. I would
> prefer to avoid it if we can.
>
> Thinking more about it, don't we also risk doing a lot of iterations in
> nohz_idle_balance() leading to nothing (pure overhead) in certain corner
> cases? If find_new_ild() is the last cpu in the cluster and we have one
> task for each cpu in the cluster but one cpu is currently having two.
> Don't we end up trying all nohz-idle cpus before giving up and balancing
> the balancer cpu itself. On big machines, going through everyone could
> take a while I think. No?

The balancer CPU will iterate as long as need_resched() is not set on
it. It may take a while, but if the balancer CPU has nothing to do, the
iteration will not be at a cost of performance.

Besides this, load balancing itself is optimized in terms of who does it
and how often. The candidate CPUs for load balancing are the first idle
CPUs in a given group. So nohz idle load balancing may abort on some of
the idle CPUs. If the CPUs on our left have not managed to pull tasks,
we abort load balancing too. We will save time here.

Load balancing on bigger sched domains is spaced out in time too. The
min_interval is equal to sd_weight and the balance_interval can be as
large as 2*sd_weight. This should ensure that load balancing across
large scheduling domains are not carried out too often. nohz_idle_load
balancing may therefore not go through the entire scheduling domain
hierarchy for each CPU. This will cut down on the time too.

Regards
Preeti U Murthy
>
> Morten
>