2006-11-07 07:33:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree


* [email protected] <[email protected]> wrote:

> +DECLARE_TASKLET(rebalance, &rebalance_domains, 0L);

i'm not sure i get the point of this whole do-rebalance-in-tasklet idea.
A tasklet is global to the system. The rebalance tick was per-CPU. This
is not an equivalent change at all. What am i missing?

Ingo


2006-11-07 17:44:39

by Christoph Lameter

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Tue, 7 Nov 2006, Ingo Molnar wrote:

> i'm not sure i get the point of this whole do-rebalance-in-tasklet idea.
> A tasklet is global to the system. The rebalance tick was per-CPU. This
> is not an equivalent change at all. What am i missing?

A tasklet runs per cpu. In many ways it is equivalent to an interrupt
context just interrupts are enabled.


2006-11-07 17:53:39

by Suresh Siddha

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Tue, Nov 07, 2006 at 09:44:11AM -0800, Christoph Lameter wrote:
> On Tue, 7 Nov 2006, Ingo Molnar wrote:
>
> > i'm not sure i get the point of this whole do-rebalance-in-tasklet idea.
> > A tasklet is global to the system. The rebalance tick was per-CPU. This
> > is not an equivalent change at all. What am i missing?
>
> A tasklet runs per cpu. In many ways it is equivalent to an interrupt
> context just interrupts are enabled.

Christoph, DECLARE_TASKLET that you had atleast needs to be per cpu..
Not sure if there are any other concerns.

thanks,
suresh

2006-11-07 17:56:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Tue, 7 Nov 2006, Siddha, Suresh B wrote:

> Christoph, DECLARE_TASKLET that you had atleast needs to be per cpu..
> Not sure if there are any other concerns.

Nope. Tasklets scheduled and executed per cpu. These are the former bottom
halves. See tasklet_schedule in kernel/softirq.c


2006-11-07 18:13:14

by Suresh Siddha

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Tue, Nov 07, 2006 at 09:55:54AM -0800, Christoph Lameter wrote:
> On Tue, 7 Nov 2006, Siddha, Suresh B wrote:
>
> > Christoph, DECLARE_TASKLET that you had atleast needs to be per cpu..
> > Not sure if there are any other concerns.
>
> Nope. Tasklets scheduled and executed per cpu. These are the former bottom
> halves. See tasklet_schedule in kernel/softirq.c

You must be referring to __tasklet_schedule() ?

tasklet_schedule doesn't schedule if there is already one scheduled.

no?

thanks,
suresh

2006-11-07 18:19:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Tue, 7 Nov 2006, Siddha, Suresh B wrote:

> tasklet_schedule doesn't schedule if there is already one scheduled.

Right.

/* Tasklets --- multithreaded analogue of BHs.

Main feature differing them of generic softirqs: tasklet
is running only on one CPU simultaneously.

Main feature differing them of BHs: different tasklets
may be run simultaneously on different CPUs.

Properties:
* If tasklet_schedule() is called, then tasklet is guaranteed
to be executed on some cpu at least once after this.

^^^^ Crap. Not equivalent. Must be guaranteed to run on the same cpu.

* If the tasklet is already scheduled, but its excecution is still not
started, it will be executed only once.
* If this tasklet is already running on another CPU (or schedule is called
from tasklet itself), it is rescheduled for later.
* Tasklet is strictly serialized wrt itself, but not
wrt another tasklets. If client needs some intertask synchronization,
he makes it with spinlocks.
*/

2006-11-07 19:17:44

by Christoph Lameter

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

Tasklets are scheduled on the same cpu that triggered the tasklet. They
are just moved to other processors if the processor goes down. So that
aspect is fine. We just need a tasklet struct per cpu.



User a per cpu tasklet to schedule rebalancing

Turns out that tasklets have a flag that only allows one instance to run
on all processors. So we need a tasklet structure for each processor.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.19-rc4-mm2/kernel/sched.c
===================================================================
--- linux-2.6.19-rc4-mm2.orig/kernel/sched.c 2006-11-06 13:58:38.000000000 -0600
+++ linux-2.6.19-rc4-mm2/kernel/sched.c 2006-11-07 13:05:56.236343144 -0600
@@ -2918,7 +2918,8 @@ static void rebalance_domains(unsigned l
this_rq->next_balance = next_balance;
}

-DECLARE_TASKLET(rebalance, &rebalance_domains, 0L);
+static DEFINE_PER_CPU(struct tasklet_struct, rebalance) =
+ { NULL, 0, ATOMIC_INIT(0), rebalance_domains, 0L };
#else
/*
* on UP we do not need to balance between CPUs:
@@ -3171,7 +3172,7 @@ void scheduler_tick(void)
#ifdef CONFIG_SMP
update_load(rq);
if (time_after_eq(jiffies, rq->next_balance))
- tasklet_schedule(&rebalance);
+ tasklet_schedule(&__get_cpu_var(rebalance));
#endif
}


2006-11-07 19:23:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Tue, 7 Nov 2006, Siddha, Suresh B wrote:

> tasklet_schedule doesn't schedule if there is already one scheduled.

Correct. The effect of this is to only allow load balancing on one cpu at a
time. Subsequent attempts to schedule load balancing on other cpus are
ignored until the cpu that is load balancing has finished. Then the
others (hopefully) get a turn.

A pretty interesting unintended effect. It certainly solves the concurrent
load balancing scalability issues and would avoid the need to stagger
load balancing. Wonder how fair it would be?

2006-11-07 20:30:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree


* Christoph Lameter <[email protected]> wrote:

> On Tue, 7 Nov 2006, Siddha, Suresh B wrote:
>
> > Christoph, DECLARE_TASKLET that you had atleast needs to be per
> > cpu.. Not sure if there are any other concerns.
>
> Nope. Tasklets scheduled and executed per cpu. These are the former
> bottom halves. See tasklet_schedule in kernel/softirq.c

no, they are "task"-lets. Softirqs are the equivalent of former
bottom-halves.

a tasklet may run on any CPU but its execution is globally serialized -
it's not what we want to do in the scheduler.

Ingo

2006-11-07 20:32:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree


* Christoph Lameter <[email protected]> wrote:

> Tasklets are scheduled on the same cpu that triggered the tasklet.
> They are just moved to other processors if the processor goes down. So
> that aspect is fine. We just need a tasklet struct per cpu.
>
> User a per cpu tasklet to schedule rebalancing
>
> Turns out that tasklets have a flag that only allows one instance to
> run on all processors. So we need a tasklet structure for each
> processor.

Per-CPU tasklets are equivalent to softirqs, with extra complexity and
overhead ontop of it :-)

so please just introduce a rebalance softirq and attach the scheduling
rebalance tick to it. But i'd suggest to re-test on the 4096-CPU box,
maybe what 'fixed' your workload was the global serialization of the
tasklet. With a per-CPU softirq approach we are i think back to the same
situation that broke your system before.

Ingo

2006-11-07 20:59:15

by Christoph Lameter

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Tue, 7 Nov 2006, Ingo Molnar wrote:

> Per-CPU tasklets are equivalent to softirqs, with extra complexity and
> overhead ontop of it :-)
>
> so please just introduce a rebalance softirq and attach the scheduling
> rebalance tick to it. But i'd suggest to re-test on the 4096-CPU box,
> maybe what 'fixed' your workload was the global serialization of the
> tasklet. With a per-CPU softirq approach we are i think back to the same
> situation that broke your system before.

What broke the system was the disabling of interrupts over long time
periods during load balancing.

2006-11-07 21:35:11

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

Christoph Lameter wrote on Tuesday, November 07, 2006 12:59 PM
> On Tue, 7 Nov 2006, Ingo Molnar wrote:
>
> > Per-CPU tasklets are equivalent to softirqs, with extra complexity and
> > overhead ontop of it :-)
> >
> > so please just introduce a rebalance softirq and attach the scheduling
> > rebalance tick to it. But i'd suggest to re-test on the 4096-CPU box,
> > maybe what 'fixed' your workload was the global serialization of the
> > tasklet. With a per-CPU softirq approach we are i think back to the same
> > situation that broke your system before.
>
> What broke the system was the disabling of interrupts over long time
> periods during load balancing.


The previous global load balancing tasket could be an interesting data point.
Do you see a lot of imbalance in the system with the global tasket? Does it
take prolonged interval to reach balanced system from imbalance?

Conceptually, it doesn't make a lot of sense to serialize load balance in the
System. But in practice, maybe it is good enough to cycle through each cpu
(I suppose it all depends on the user environment). This exercise certainly
make me to think in regards to what is the best algorithm in terms of efficiency
and minimal latency to achieve maximum system throughput. Does kernel really
have to do load balancing so aggressively in polling mode? Perhaps an event
driven mechanism is a better solution.

- Ken

2006-11-07 21:50:03

by Christoph Lameter

[permalink] [raw]
Subject: RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Tue, 7 Nov 2006, Chen, Kenneth W wrote:

> > What broke the system was the disabling of interrupts over long time
> > periods during load balancing.
> The previous global load balancing tasket could be an interesting data point.

Yup seems also very interesting to me. We could drop the staggering code
f.e. if we would leave the patch as is. Maybe there are other ways to
optimize the code because we know that there are no concurrent
balance_tick() functions running.

> Do you see a lot of imbalance in the system with the global tasket? Does it
> take prolonged interval to reach balanced system from imbalance?

I am rather surprised that I did not see any problems but I think we would
need some more testing. It seems that having only one load balance
running at one time speeds up load balacing in general since there is
less lock contention.

Timer interrupts are staggered already. If the tasklet finishes soon then
we will not have a problem and load balancing runs as before. If the
tasklet performance gets throttled by something then it is good to
serialize load balancing since we already have memory contention.
Performing concurrent scans of runqueues (as is now) would increase the
problem.

Before we saw multiple load balance actions ongoing at the same time which
caused node overload because multiple cache line refetches from remote
nodes occurred. I think it was mostly node 0 that got overloaded so that
load balancing performance dropped significantly.

> Conceptually, it doesn't make a lot of sense to serialize load balance in the
> System. But in practice, maybe it is good enough to cycle through each cpu
> (I suppose it all depends on the user environment). This exercise certainly
> make me to think in regards to what is the best algorithm in terms of efficiency
> and minimal latency to achieve maximum system throughput. Does kernel really
> have to do load balancing so aggressively in polling mode? Perhaps an event
> driven mechanism is a better solution.

There is no other way to do load balancing since we have to calculate the
load over many processors and then figure out the one with the most load.

2006-11-10 06:18:18

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

Christoph Lameter wrote on Tuesday, November 07, 2006 1:50 PM
> On Tue, 7 Nov 2006, Chen, Kenneth W wrote:
> > > What broke the system was the disabling of interrupts over long time
> > > periods during load balancing.
> > The previous global load balancing tasket could be an interesting data point.
>
> Yup seems also very interesting to me. We could drop the staggering code
> f.e. if we would leave the patch as is. Maybe there are other ways to
> optimize the code because we know that there are no concurrent
> balance_tick() functions running.
>
> > Do you see a lot of imbalance in the system with the global tasket? Does it
> > take prolonged interval to reach balanced system from imbalance?
>
> I am rather surprised that I did not see any problems but I think we would
> need some more testing. It seems that having only one load balance
> running at one time speeds up load balacing in general since there is
> less lock contention.


I ran majority of micro-benchmarks from LKP project with global load
balance tasklet. (http://kernel-perf.sourceforge.net)

Result is here:
http://kernel-perf.sourceforge.net/sched/global-load-bal.txt

All results are within noise range. The global tasklet does a fairly good job especially on context switch intensive workload like
aim7, volanomark, tbench
etc. Note all machines are non-numa platform.

Base on the data, I think we should make the load balance tasklet one per numa
node instead of one per CPU.

- Ken

2006-11-10 18:50:34

by Christoph Lameter

[permalink] [raw]
Subject: RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Thu, 9 Nov 2006, Chen, Kenneth W wrote:

> All results are within noise range. The global tasklet does a fairly
> good job especially on context switch intensive workload like aim7,
> volanomark, tbench etc. Note all machines are non-numa platform.

> Base on the data, I think we should make the load balance tasklet one per numa
> node instead of one per CPU.

I have done some testing on NUMA with 8p / 4n and 256 p / 128n which seems
to indicate that this is doing very well. Having a global tasklet avoids
concurrent load balancing from multiple nodes which smoothes out the load
balancing load over all nodes in a NUMA system. It fixes the issues that
we saw with contention during concurrent load balancing.

On a 8p system:

I) Percent of ticks where load balancing was found to be required

II) Percent of ticks where we attempted load balancing
but we found that we need to try again due to load balancing
in progress elsewhere (This increases (I) since we found that
load balancing was required but we decided to defer. Tasklet
was not invoked).

I) II)
Boot: 70% ~1%
AIM7: 30% 2%
Idle: 50% <0.5%

256p:
I) II)
Boot: 80% 30%
AIM7: 90% 30%
Idle: 95% 30%

So on larger systems we have more attempts to concurrently execute the
tasklet which also inflates I). I would expect that rate to be raised
further on even larger systems until we likely reach a point of continuous
load balancing somehwere in the system at 1024p (will take some time to to
such a system). It is likely prefereable at that scale to still not have
concurrency. Concurrency may mean that we concurrently balance huge sched
domains which causes long delays on multiple processor and may cause
contention between those load balancing threads that attempt to move tasks
from the same busy processor.

Suresh noted at the Intel OS forum yesterday that we may be able to avoid
running load balancing of the larger sched domains from all processors.
That would reduce the overhead of scheduling and reduce the percentage of
time that load blancing is active even on very large systems.


2006-11-10 21:39:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree


* Christoph Lameter <[email protected]> wrote:

> I have done some testing on NUMA with 8p / 4n and 256 p / 128n which
> seems to indicate that this is doing very well. Having a global
> tasklet avoids concurrent load balancing from multiple nodes which
> smoothes out the load balancing load over all nodes in a NUMA system.
> It fixes the issues that we saw with contention during concurrent load
> balancing.

ok, that's what i suspected - what made the difference wasnt the fact
that it was moved out of irqs-off section, but that it was running
globally, instead of in parallel on every cpu. I have no conceptual
problem with single-threading the more invasive load-balancing bits.
(since it has to touch every runqueue anyway there's probably little
parallelism possible) But it's a scary change nevertheless, it
materially affects every SMP system's balancing characteristics.

Also, tasklets are a pretty bad choice for scalable stuff - it's
basically a shared resource between all CPUs and the tasklet-running
cacheline will bounce between all the CPUs. Also, a tasklet can be
'reactivated', which means that a CPU will do more rebalancing albeit
it's /another/ CPU that wanted that.

So could you try another implementation perhaps, which would match the
'single thread of rebalancing' model better? For example, try a global
spinlock that is trylocked upon rebalance. If the trylock succeeds then
do the rebalance and unlock. If the trylock fails, just skip the
rebalance. (this roughly matches the tasklet logic but has lower
overhead and doesnt cause false, repeat rebalancing)

Ingo

2006-11-10 21:42:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree


* Christoph Lameter <[email protected]> wrote:

> On a 8p system:
>
> I) Percent of ticks where load balancing was found to be required
>
> II) Percent of ticks where we attempted load balancing
> but we found that we need to try again due to load balancing
> in progress elsewhere (This increases (I) since we found that
> load balancing was required but we decided to defer. Tasklet
> was not invoked).
>
> I) II)
> Boot: 70% ~1%
> AIM7: 30% 2%
> Idle: 50% <0.5%
>
> 256p:
> I) II)
> Boot: 80% 30%
> AIM7: 90% 30%
> Idle: 95% 30%

nice measurements and interesting results.

note that with a tasklet a 'retry' will often be done on the /same/ CPU
that was running the tasklet when we tried to schedule it. I.e. such a
'collision' will result not only in the 'loss' of the local rebalance
event, but also causes /another/ rebalance event on the remote CPU.

so a better model would be the trylock model i suggested in the previous
mail: to just lose the rebalance events upon collision and not cause
extra work on the remote CPU. I'd also suggest to keep the rebalancing
code under the irqs-off section, like it is currently - only do it
conditional on trylock success. Do you think that would work?

Ingo

2006-11-10 22:55:05

by Suresh Siddha

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Fri, Nov 10, 2006 at 10:50:13AM -0800, Christoph Lameter wrote:
> Suresh noted at the Intel OS forum yesterday that we may be able to avoid
> running load balancing of the larger sched domains from all processors.
> That would reduce the overhead of scheduling and reduce the percentage of
> time that load blancing is active even on very large systems.

I brought this up at OLS with Nick too. What happens today is, at a particular
domain, each cpu in the sched group will do a load balance at the frequency
of balance_interval. More the cores and threads, more the cpus will be
in each sched group at SMP and NUMA domain. And we endup spending
quite a bit of time doing load balancing in those domains.

One idea(old patch appended) is to make only one cpu in the sched group do
the load balance at that particular sched domain and this load will slowly
percolate down to the other cpus with in that group(when they do load balancing
at lower domains).

Nick was suggesting another alternative, that we should increase the max
interval and probably make it dependent on number of cpus that
share the group. Today for NUMA domain, we already set max_interval
based on online cpus. Perhaps we need to do that for other domains (SMP, MC)
too and increase the max interval on NUMA for big systems?

---

When busy, only the first cpu in the sched group will do the load balance.

--- linux-2.6.9/kernel/sched.c.1 2006-07-11 16:17:37.000000000 -0700
+++ linux-2.6.9/kernel/sched.c 2006-07-12 10:48:48.000000000 -0700
@@ -2329,12 +2329,15 @@
interval = 1;

if (j - sd->last_balance >= interval) {
- sd->last_balance += interval;
+ sd->last_balance = j;
if (load_balance(this_cpu, this_rq, sd, idle)) {
/* We've pulled tasks over so no longer idle */
idle = NOT_IDLE;
}
}
+
+ if (idle != IDLE && this_cpu != first_cpu(sd->span))
+ break;
}
}
#else

2006-11-11 01:01:55

by Christoph Lameter

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Fri, 10 Nov 2006, Ingo Molnar wrote:

> ok, that's what i suspected - what made the difference wasnt the fact
> that it was moved out of irqs-off section, but that it was running
> globally, instead of in parallel on every cpu. I have no conceptual
> problem with single-threading the more invasive load-balancing bits.
> (since it has to touch every runqueue anyway there's probably little
> parallelism possible) But it's a scary change nevertheless, it
> materially affects every SMP system's balancing characteristics.

We saw multiple issues. The first we saw was interrupt holdoff related
since IPIs took a long time to complete. The other was that multiple
load balance actions in multiple CPUs seem to serialize on the locks
trying each to move tasks off the same busy processor. So both better be
addressed.

Load balancing for small domains is running faster so there is less chance
of parallelism. It seems that the staggering of the timer interrupt is
sufficient on smaller systems to avoid concurrent load balancing
operations.

2006-11-11 01:14:12

by Christoph Lameter

[permalink] [raw]
Subject: Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Fri, 10 Nov 2006, Siddha, Suresh B wrote:

> Nick was suggesting another alternative, that we should increase the max
> interval and probably make it dependent on number of cpus that
> share the group. Today for NUMA domain, we already set max_interval
> based on online cpus. Perhaps we need to do that for other domains (SMP, MC)
> too and increase the max interval on NUMA for big systems?

I have been trying to run with this patch. We increase the min interval so
that we can at least put one tick between each of the load balance actions
of each processor. The cpu_offset can be modified to:

/* Don't have all balancing operations going off at once: */
static inline unsigned long cpu_offset(int cpu)
{
if (NR_CPUS < HZ)
return jiffies + cpu * HZ / NR_CPUS;
else
return jiffies + cpu;
}

But exclusion in load balancing would take care of the above.




Index: linux-2.6.19-rc4-mm2/include/linux/topology.h
===================================================================
--- linux-2.6.19-rc4-mm2.orig/include/linux/topology.h 2006-11-07 16:43:43.084665449 -0600
+++ linux-2.6.19-rc4-mm2/include/linux/topology.h 2006-11-07 16:55:23.027245226 -0600
@@ -182,7 +182,7 @@
.parent = NULL, \
.child = NULL, \
.groups = NULL, \
- .min_interval = 64, \
+ .min_interval = max(64, jiffies_to_msec(num_online_cpus())),\
.max_interval = 64*num_online_cpus(), \
.busy_factor = 128, \
.imbalance_pct = 133, \

2006-11-11 02:05:30

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

Christoph Lameter wrote on Friday, November 10, 2006 5:01 PM
> On Fri, 10 Nov 2006, Ingo Molnar wrote:
>
> > ok, that's what i suspected - what made the difference wasnt the fact
> > that it was moved out of irqs-off section, but that it was running
> > globally, instead of in parallel on every cpu. I have no conceptual
> > problem with single-threading the more invasive load-balancing bits.
> > (since it has to touch every runqueue anyway there's probably little
> > parallelism possible) But it's a scary change nevertheless, it
> > materially affects every SMP system's balancing characteristics.
>
> We saw multiple issues. The first we saw was interrupt holdoff related
> since IPIs took a long time to complete. The other was that multiple
> load balance actions in multiple CPUs seem to serialize on the locks
> trying each to move tasks off the same busy processor. So both better be
> addressed.

So designate only one CPU within a domain to do load balance between groups
for that specific sched domain should in theory fix the 2nd problem you
identified. Did you get a chance to look at the patch Suresh posted?

2006-11-11 02:51:20

by Christoph Lameter

[permalink] [raw]
Subject: RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Fri, 10 Nov 2006, Chen, Kenneth W wrote:

> So designate only one CPU within a domain to do load balance between groups
> for that specific sched domain should in theory fix the 2nd problem you
> identified. Did you get a chance to look at the patch Suresh posted?

Yes, I am still thinking about how this would work. This would mean that
the first cpu on a system would move busy processes to itself from all
1023 other processes? That does not sound appealing.

Processes in the allnodes domain would move to processor #0. The next
lower layer are the nodes groups. Not all of the groups at that layer
contain processor #0. So we would never move processes into those sched
domains?

2006-11-13 04:03:57

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

Christoph Lameter wrote on Friday, November 10, 2006 6:51 PM
> On Fri, 10 Nov 2006, Chen, Kenneth W wrote:
>
> > So designate only one CPU within a domain to do load balance between groups
> > for that specific sched domain should in theory fix the 2nd problem you
> > identified. Did you get a chance to look at the patch Suresh posted?
>
> Yes, I am still thinking about how this would work. This would mean that
> the first cpu on a system would move busy processes to itself from all
> 1023 other processes? That does not sound appealing.
>
> Processes in the allnodes domain would move to processor #0. The next
> lower layer are the nodes groups. Not all of the groups at that layer
> contain processor #0. So we would never move processes into those sched
> domains?

Not really, someone needs to put a bloody comments in that patch ;-)

The key is that load balance scans from lowest SMT domain and traverse
upwards to numa allnodes domain. The CPU check and break is placed at
the end of the for loop so it is effectively shorting out i+1 iteration.
And given the fact that sd->span in iteration(i) is the same as sd->groups
in iteration(i+1). The patch is effectively designate first CPU (hard-coded)
in a group within a domain to perform load balance in that domain. All
other groups within a domain will still have at least one CPU assigned to
perform load balance.

Though the original patch is not perfect:

(1) we should extend the logic to all rebalance tick, not just busy tick.
(2) we should initiate load balance within a domain only from least
loaded group.
(3) the load scanning should be done only once per interval per domain.
Currently, it is scanning load for each CPU within a domain. Even with
the patch, the scanning is cut down to one per group. That is still
too much. Large system that has hundreds of groups in numa allnodes
will end up scanning / calculate load over and over again. That should
be cut down as well.

Part of all this problem probably stemmed from "load balance" is incapable
of performing l-d between arbitrary pair of CPUs, and tightly tied load scan
and actual l-d action. And on top of that l-d is really a pull operation
to current running CPU. All these limitations dictate that every CPU somehow
has to scan and pull. It is extremely inefficient on large system.

IMO, we should decouple load scan from actual l-d action and if possible,
make l-d capable of performing on arbitrary pair of CPUs.

2006-11-13 05:45:32

by Christoph Lameter

[permalink] [raw]
Subject: RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

On Sun, 12 Nov 2006, Chen, Kenneth W wrote:

> The key is that load balance scans from lowest SMT domain and traverse
> upwards to numa allnodes domain. The CPU check and break is placed at
> the end of the for loop so it is effectively shorting out i+1 iteration.

It shortens out if the current cpu is not the first cpu of the span. But
at that point it has already done load balancing for the cpu that is not
the first cpu of the span!

> (1) we should extend the logic to all rebalance tick, not just busy tick.

Ok.

> (2) we should initiate load balance within a domain only from least
> loaded group.

This would mean we would have to determine the least loaded group first.

> (3) the load scanning should be done only once per interval per domain.
> Currently, it is scanning load for each CPU within a domain. Even with
> the patch, the scanning is cut down to one per group. That is still
> too much. Large system that has hundreds of groups in numa allnodes
> will end up scanning / calculate load over and over again. That should
> be cut down as well.

Yes we want that. Maybe we could remember the load calculated in
sched_group and use that for a certain time period? Load would only be
calculated once and then all other processors make their decisions based
on the cached load?

> Part of all this problem probably stemmed from "load balance" is incapable
> of performing l-d between arbitrary pair of CPUs, and tightly tied load scan
> and actual l-d action. And on top of that l-d is really a pull operation
> to current running CPU. All these limitations dictate that every CPU somehow
> has to scan and pull. It is extremely inefficient on large system.

Right. However, if we follow this line of thought then we will be
redesigning the load balancing logic.

2006-11-13 06:40:59

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree

Christoph Lameter wrote on Sunday, November 12, 2006 9:45 PM
> > (2) we should initiate load balance within a domain only from least
> > loaded group.
>
> This would mean we would have to determine the least loaded group first.

Well, find_busiest_group() scans every single bloody CPU in the system at
the highest sched_domain level. In fact, this function is capable to find
busiest group within a domain, it should be capable to determine least
loaded group for free because it already scanned every groups within a domain.


> > Part of all this problem probably stemmed from "load balance" is incapable
> > of performing l-d between arbitrary pair of CPUs, and tightly tied load scan
> > and actual l-d action. And on top of that l-d is really a pull operation
> > to current running CPU. All these limitations dictate that every CPU somehow
> > has to scan and pull. It is extremely inefficient on large system.
>
> Right. However, if we follow this line of thought then we will be
> redesigning the load balancing logic.

It won't be a bad idea to redesign it ;-)

There are number of other oddity beside what was identified in it's design:

(1) several sched_groups are statically declared and they will reside in
boot node. I would expect cross node memory access to be expansive.
Every cpu will access these data structure repeatedly.

static struct sched_group sched_group_cpus[NR_CPUS];
static struct sched_group sched_group_core[NR_CPUS];
static struct sched_group sched_group_phys[NR_CPUS];

(2) load balance staggering. Number of people pointed out that it is overly
done.

(3) The for_each_domain() loop in rebalance_tick() looks different from
idle_balance() where it will traverse entire sched domains even if lower
level domain succeeded in moving some tasks. I would expect we either
break out of the for loop like idle_balance(), or somehow update load
for current CPU so it gets accurate load value when doing l-d in the
next level. Currently, It is doing neither.

2006-11-14 01:29:59

by Suresh Siddha

[permalink] [raw]
Subject: [patch] sched domain: move sched group allocations to percpu area

Christoph, Can you please test this on your big systems and ACK it? Thanks.
---

Move the sched group allocations to percpu area. This will minimize cross
node memory references and also cleans up the sched groups allocation for
allnodes sched domain.

Signed-off-by: Suresh Siddha <[email protected]>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 3399701..088a63f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5511,28 +5511,27 @@ static int __init isolated_cpu_setup(cha
__setup ("isolcpus=", isolated_cpu_setup);

/*
- * init_sched_build_groups takes an array of groups, the cpumask we wish
- * to span, and a pointer to a function which identifies what group a CPU
- * belongs to. The return value of group_fn must be a valid index into the
- * groups[] array, and must be >= 0 and < NR_CPUS (due to the fact that we
- * keep track of groups covered with a cpumask_t).
+ * init_sched_build_groups takes the cpumask we wish to span, and a pointer
+ * to a function which identifies what group(along with sched group) a CPU
+ * belongs to. The return value of group_fn must be a >= 0 and < NR_CPUS
+ * (due to the fact that we keep track of groups covered with a cpumask_t).
*
* init_sched_build_groups will build a circular linked list of the groups
* covered by the given span, and will set each group's ->cpumask correctly,
* and ->cpu_power to 0.
*/
static void
-init_sched_build_groups(struct sched_group groups[], cpumask_t span,
- const cpumask_t *cpu_map,
- int (*group_fn)(int cpu, const cpumask_t *cpu_map))
+init_sched_build_groups(cpumask_t span, const cpumask_t *cpu_map,
+ int (*group_fn)(int cpu, const cpumask_t *cpu_map,
+ struct sched_group **sg))
{
struct sched_group *first = NULL, *last = NULL;
cpumask_t covered = CPU_MASK_NONE;
int i;

for_each_cpu_mask(i, span) {
- int group = group_fn(i, cpu_map);
- struct sched_group *sg = &groups[group];
+ struct sched_group *sg;
+ int group = group_fn(i, cpu_map, &sg);
int j;

if (cpu_isset(i, covered))
@@ -5542,7 +5541,7 @@ init_sched_build_groups(struct sched_gro
sg->cpu_power = 0;

for_each_cpu_mask(j, span) {
- if (group_fn(j, cpu_map) != group)
+ if (group_fn(j, cpu_map, NULL) != group)
continue;

cpu_set(j, covered);
@@ -6118,10 +6117,13 @@ int sched_smt_power_savings = 0, sched_m
*/
#ifdef CONFIG_SCHED_SMT
static DEFINE_PER_CPU(struct sched_domain, cpu_domains);
-static struct sched_group sched_group_cpus[NR_CPUS];
+static DEFINE_PER_CPU(struct sched_group, sched_group_cpus);

-static int cpu_to_cpu_group(int cpu, const cpumask_t *cpu_map)
+static int cpu_to_cpu_group(int cpu, const cpumask_t *cpu_map,
+ struct sched_group **sg)
{
+ if (sg)
+ *sg = &per_cpu(sched_group_cpus, cpu);
return cpu;
}
#endif
@@ -6131,39 +6133,52 @@ #endif
*/
#ifdef CONFIG_SCHED_MC
static DEFINE_PER_CPU(struct sched_domain, core_domains);
-static struct sched_group sched_group_core[NR_CPUS];
+static DEFINE_PER_CPU(struct sched_group, sched_group_core);
#endif

#if defined(CONFIG_SCHED_MC) && defined(CONFIG_SCHED_SMT)
-static int cpu_to_core_group(int cpu, const cpumask_t *cpu_map)
+static int cpu_to_core_group(int cpu, const cpumask_t *cpu_map,
+ struct sched_group **sg)
{
+ int group;
cpumask_t mask = cpu_sibling_map[cpu];
cpus_and(mask, mask, *cpu_map);
- return first_cpu(mask);
+ group = first_cpu(mask);
+ if (sg)
+ *sg = &per_cpu(sched_group_core, group);
+ return group;
}
#elif defined(CONFIG_SCHED_MC)
-static int cpu_to_core_group(int cpu, const cpumask_t *cpu_map)
+static int cpu_to_core_group(int cpu, const cpumask_t *cpu_map,
+ struct sched_group **sg)
{
+ if (sg)
+ *sg = &per_cpu(sched_group_core, cpu);
return cpu;
}
#endif

static DEFINE_PER_CPU(struct sched_domain, phys_domains);
-static struct sched_group sched_group_phys[NR_CPUS];
+static DEFINE_PER_CPU(struct sched_group, sched_group_phys);

-static int cpu_to_phys_group(int cpu, const cpumask_t *cpu_map)
+static int cpu_to_phys_group(int cpu, const cpumask_t *cpu_map,
+ struct sched_group **sg)
{
+ int group;
#ifdef CONFIG_SCHED_MC
cpumask_t mask = cpu_coregroup_map(cpu);
cpus_and(mask, mask, *cpu_map);
- return first_cpu(mask);
+ group = first_cpu(mask);
#elif defined(CONFIG_SCHED_SMT)
cpumask_t mask = cpu_sibling_map[cpu];
cpus_and(mask, mask, *cpu_map);
- return first_cpu(mask);
+ group = first_cpu(mask);
#else
- return cpu;
+ group = cpu;
#endif
+ if (sg)
+ *sg = &per_cpu(sched_group_phys, group);
+ return group;
}

#ifdef CONFIG_NUMA
@@ -6176,12 +6191,22 @@ static DEFINE_PER_CPU(struct sched_domai
static struct sched_group **sched_group_nodes_bycpu[NR_CPUS];

static DEFINE_PER_CPU(struct sched_domain, allnodes_domains);
-static struct sched_group *sched_group_allnodes_bycpu[NR_CPUS];
+static DEFINE_PER_CPU(struct sched_group, sched_group_allnodes);

-static int cpu_to_allnodes_group(int cpu, const cpumask_t *cpu_map)
+static int cpu_to_allnodes_group(int cpu, const cpumask_t *cpu_map,
+ struct sched_group **sg)
{
- return cpu_to_node(cpu);
+ cpumask_t nodemask = node_to_cpumask(cpu_to_node(cpu));
+ int group;
+
+ cpus_and(nodemask, nodemask, *cpu_map);
+ group = first_cpu(nodemask);
+
+ if (sg)
+ *sg = &per_cpu(sched_group_allnodes, group);
+ return group;
}
+
static void init_numa_sched_groups_power(struct sched_group *group_head)
{
struct sched_group *sg = group_head;
@@ -6217,16 +6242,9 @@ static void free_sched_groups(const cpum
int cpu, i;

for_each_cpu_mask(cpu, *cpu_map) {
- struct sched_group *sched_group_allnodes
- = sched_group_allnodes_bycpu[cpu];
struct sched_group **sched_group_nodes
= sched_group_nodes_bycpu[cpu];

- if (sched_group_allnodes) {
- kfree(sched_group_allnodes);
- sched_group_allnodes_bycpu[cpu] = NULL;
- }
-
if (!sched_group_nodes)
continue;

@@ -6320,7 +6338,7 @@ static int build_sched_domains(const cpu
struct sched_domain *sd;
#ifdef CONFIG_NUMA
struct sched_group **sched_group_nodes = NULL;
- struct sched_group *sched_group_allnodes = NULL;
+ int sd_allnodes = 0;

/*
* Allocate the per-node list of sched groups
@@ -6338,7 +6356,6 @@ #endif
* Set up domains for cpus specified by the cpu_map.
*/
for_each_cpu_mask(i, *cpu_map) {
- int group;
struct sched_domain *sd = NULL, *p;
cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));

@@ -6347,26 +6364,12 @@ #endif
#ifdef CONFIG_NUMA
if (cpus_weight(*cpu_map)
> SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
- if (!sched_group_allnodes) {
- sched_group_allnodes
- = kmalloc_node(sizeof(struct sched_group)
- * MAX_NUMNODES,
- GFP_KERNEL,
- cpu_to_node(i));
- if (!sched_group_allnodes) {
- printk(KERN_WARNING
- "Can not alloc allnodes sched group\n");
- goto error;
- }
- sched_group_allnodes_bycpu[i]
- = sched_group_allnodes;
- }
sd = &per_cpu(allnodes_domains, i);
*sd = SD_ALLNODES_INIT;
sd->span = *cpu_map;
- group = cpu_to_allnodes_group(i, cpu_map);
- sd->groups = &sched_group_allnodes[group];
+ cpu_to_allnodes_group(i, cpu_map, &sd->groups);
p = sd;
+ sd_allnodes = 1;
} else
p = NULL;

@@ -6381,36 +6384,33 @@ #endif

p = sd;
sd = &per_cpu(phys_domains, i);
- group = cpu_to_phys_group(i, cpu_map);
*sd = SD_CPU_INIT;
sd->span = nodemask;
sd->parent = p;
if (p)
p->child = sd;
- sd->groups = &sched_group_phys[group];
+ cpu_to_phys_group(i, cpu_map, &sd->groups);

#ifdef CONFIG_SCHED_MC
p = sd;
sd = &per_cpu(core_domains, i);
- group = cpu_to_core_group(i, cpu_map);
*sd = SD_MC_INIT;
sd->span = cpu_coregroup_map(i);
cpus_and(sd->span, sd->span, *cpu_map);
sd->parent = p;
p->child = sd;
- sd->groups = &sched_group_core[group];
+ cpu_to_core_group(i, cpu_map, &sd->groups);
#endif

#ifdef CONFIG_SCHED_SMT
p = sd;
sd = &per_cpu(cpu_domains, i);
- group = cpu_to_cpu_group(i, cpu_map);
*sd = SD_SIBLING_INIT;
sd->span = cpu_sibling_map[i];
cpus_and(sd->span, sd->span, *cpu_map);
sd->parent = p;
p->child = sd;
- sd->groups = &sched_group_cpus[group];
+ cpu_to_cpu_group(i, cpu_map, &sd->groups);
#endif
}

@@ -6422,8 +6422,7 @@ #ifdef CONFIG_SCHED_SMT
if (i != first_cpu(this_sibling_map))
continue;

- init_sched_build_groups(sched_group_cpus, this_sibling_map,
- cpu_map, &cpu_to_cpu_group);
+ init_sched_build_groups(this_sibling_map, cpu_map, &cpu_to_cpu_group);
}
#endif

@@ -6434,8 +6433,7 @@ #ifdef CONFIG_SCHED_MC
cpus_and(this_core_map, this_core_map, *cpu_map);
if (i != first_cpu(this_core_map))
continue;
- init_sched_build_groups(sched_group_core, this_core_map,
- cpu_map, &cpu_to_core_group);
+ init_sched_build_groups(this_core_map, cpu_map, &cpu_to_core_group);
}
#endif

@@ -6448,15 +6446,13 @@ #endif
if (cpus_empty(nodemask))
continue;

- init_sched_build_groups(sched_group_phys, nodemask,
- cpu_map, &cpu_to_phys_group);
+ init_sched_build_groups(nodemask, cpu_map, &cpu_to_phys_group);
}

#ifdef CONFIG_NUMA
/* Set up node groups */
- if (sched_group_allnodes)
- init_sched_build_groups(sched_group_allnodes, *cpu_map,
- cpu_map, &cpu_to_allnodes_group);
+ if (sd_allnodes)
+ init_sched_build_groups(*cpu_map, cpu_map, &cpu_to_allnodes_group);

for (i = 0; i < MAX_NUMNODES; i++) {
/* Set up node groups */
@@ -6548,10 +6544,10 @@ #ifdef CONFIG_NUMA
for (i = 0; i < MAX_NUMNODES; i++)
init_numa_sched_groups_power(sched_group_nodes[i]);

- if (sched_group_allnodes) {
- int group = cpu_to_allnodes_group(first_cpu(*cpu_map), cpu_map);
- struct sched_group *sg = &sched_group_allnodes[group];
+ if (sd_allnodes) {
+ struct sched_group *sg;

+ cpu_to_allnodes_group(first_cpu(*cpu_map), cpu_map, &sg);
init_numa_sched_groups_power(sg);
}
#endif

2006-11-14 08:24:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] sched domain: move sched group allocations to percpu area


* Siddha, Suresh B <[email protected]> wrote:

> Move the sched group allocations to percpu area. This will minimize
> cross node memory references and also cleans up the sched groups
> allocation for allnodes sched domain.
>
> Signed-off-by: Suresh Siddha <[email protected]>

yeah, makes sense.

Acked-by: Ingo Molnar <[email protected]>

Ingo

2006-11-14 16:57:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] sched domain: move sched group allocations to percpu area

On Mon, 13 Nov 2006, Siddha, Suresh B wrote:

> Christoph, Can you please test this on your big systems and ACK it? Thanks.

We also thought about fixing it this way. Thanks!

Acked-by: Christoph Lameter <[email protected]>