Andi,
Can you be more specific with "it doesn't load balance threads
aggressively enough"? Or what behavior of the base NUMA scheduler is
missing in the sched-domain scheduler especially for NUMA?
Jun
>-----Original Message-----
>From: Andi Kleen [mailto:[email protected]]
>Sent: Thursday, March 25, 2004 3:47 AM
>To: Rick Lindsley
>Cc: Andi Kleen; Ingo Molnar; [email protected]; linux-
>[email protected]; [email protected]; [email protected];
>[email protected]; Nakajima, Jun; [email protected]; lse-
>[email protected]; [email protected]
>Subject: Re: [Lse-tech] [patch] sched-domain cleanups,
sched-2.6.5-rc2-mm2-
>A3
>
>On Thu, Mar 25, 2004 at 03:40:22AM -0800, Rick Lindsley wrote:
>> The main problem it has is that it performs quite badly on
Opteron
>NUMA
>> e.g. in the OpenMP STREAM test (much worse than the normal
scheduler)
>>
>> Andi, I've got some schedstat code which may help us to understand
why.
>> I'll need to port it to Ingo's changes, but if I drop you a patch in
a
>> day or two can you try your test on sched-domain/non-sched-domain,
>> collecting the stats?
>
>The openmp failure is already pretty well understood - it doesn't load
>balance
>threads aggressively enough over CPUs after startup.
>
>-Andi
On Thu, Mar 25, 2004 at 07:31:37AM -0800, Nakajima, Jun wrote:
> Andi,
>
> Can you be more specific with "it doesn't load balance threads
> aggressively enough"? Or what behavior of the base NUMA scheduler is
> missing in the sched-domain scheduler especially for NUMA?
It doesn't do load balance in wake_up_forked_process() and is relatively
non aggressive in balancing later. This leads to the multithreaded OpenMP
STREAM running its childs first on the same node as the original process
and allocating memory there. Then later they run on a different node when
the balancing finally happens, but generate cross traffic to the old node,
instead of using the memory bandwidth of their local nodes.
The difference is very visible, even the 4 thread STREAM only sees the
bandwidth of a single node. With a more aggressive scheduler you get
4 times as much.
Admittedly it's a bit of a stupid benchmark, but seems to representative
for a lot of HPC codes.
-Andi
* Andi Kleen <[email protected]> wrote:
> It doesn't do load balance in wake_up_forked_process() and is
> relatively non aggressive in balancing later. This leads to the
> multithreaded OpenMP STREAM running its childs first on the same node
> as the original process and allocating memory there. [...]
i believe the fix we want is to pre-balance the context at fork() time.
I've implemented this (which is basically just a reuse of
sched_balance_exec() in fork.c, and the related namespace cleanups),
could you give it a go:
http://redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm2-A5
another solution would be to add SD_BALANCE_FORK.
also, the best place to do fork() blancing is not at
wake_up_forked_process() time, but prior doing the MM copy. This patch
does it there. At wakeup time we've already copied all the pagetables
and created tons of dirty cachelines.
Ingo
On Thu, 25 Mar 2004 20:09:45 +0100
Ingo Molnar <[email protected]> wrote:
> also, the best place to do fork() blancing is not at
> wake_up_forked_process() time, but prior doing the MM copy. This patch
> does it there. At wakeup time we've already copied all the pagetables
> and created tons of dirty cachelines.
That won't help for threaded programs that use clone(). OpenMP is such a case.
-Andi
>> It doesn't do load balance in wake_up_forked_process() and is
>> relatively non aggressive in balancing later. This leads to the
>> multithreaded OpenMP STREAM running its childs first on the same node
>> as the original process and allocating memory there. [...]
>
> i believe the fix we want is to pre-balance the context at fork() time.
> I've implemented this (which is basically just a reuse of
> sched_balance_exec() in fork.c, and the related namespace cleanups),
> could you give it a go:
>
> http://redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm2-A5
>
> another solution would be to add SD_BALANCE_FORK.
>
> also, the best place to do fork() blancing is not at
> wake_up_forked_process() time, but prior doing the MM copy. This patch
> does it there. At wakeup time we've already copied all the pagetables
> and created tons of dirty cachelines.
How are you going to decide whether to rebalance at fork time or exec time?
Exec time balancing is a *lot* more efficient, it just doesn't work for
things that don't exec ... cloned threads would certainly be one case.
M.
* Andi Kleen <[email protected]> wrote:
> That won't help for threaded programs that use clone(). OpenMP is such
> a case.
yeah, agreed. Also, exec-balance, if applied to fork(), would migrate
the parent which is not what we want. We could perhaps migrate the
parent to the target CPU, copy the context, then migrate the parent back
to the original CPU ... but this sounds too complex.
Ingo
* Andi Kleen <[email protected]> wrote:
> That won't help for threaded programs that use clone(). OpenMP is such
> a case.
this patch:
redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4
does balancing at wake_up_forked_process()-time.
but it's a hard issue. Especially after fork() we do have a fair amount
of cache context, and migrating at this point can be bad for
performance.
Ingo
* Martin J. Bligh <[email protected]> wrote:
> Exec time balancing is a *lot* more efficient, it just doesn't work
> for things that don't exec ... cloned threads would certainly be one
> case.
yeah - exec-balancing is a clear thing. fork/clone time balancing is
alot less clear.
Ingo
* Andi Kleen <[email protected]> wrote:
> It doesn't do load balance in wake_up_forked_process() and is
> relatively non aggressive in balancing later. This leads to the
> multithreaded OpenMP STREAM running its childs first on the same node
> as the original process and allocating memory there. Then later they
> run on a different node when the balancing finally happens, but
> generate cross traffic to the old node, instead of using the memory
> bandwidth of their local nodes.
>
> The difference is very visible, even the 4 thread STREAM only sees the
> bandwidth of a single node. With a more aggressive scheduler you get 4
> times as much.
>
> Admittedly it's a bit of a stupid benchmark, but seems to
> representative for a lot of HPC codes.
There's no way the scheduler can figure out the scheduling and memory
use patterns of the new tasks in advance.
but userspace could give hints - e.g. a syscall that triggers a
rebalancing: sys_sched_load_balance(). This way userspace notifies the
scheduler that it is on 'zero ground' and that the scheduler can move it
to the least loaded cpu/node.
a variant of this is already possible, userspace can use setaffinity to
load-balance manually - but sched_load_balance() would be automatic.
Ingo
There's no way the scheduler can figure out the scheduling and memory
use patterns of the new tasks in advance.
True. Four threads may want to stay on the same node because they are
sharing a lot of data and working on something in parallel, or they
may want to go to different nodes because the only thing they have in
common is a control structure that directs their (largely independent
but highly synchronized) efforts.
A while ago there was some effort at user-level page replication, which
meant you took a hit once but after that you'd effectively migrated a page
to your local memory. The longer you stayed put, the more local your
RSS got. I seem to recall some bugs or caveats, though. Anybody know
the state of that? It might take the burden off the scheduler using a
crystal ball and putting it on a 20/20-hindsight VM system instead.
Rick
>> Exec time balancing is a *lot* more efficient, it just doesn't work
>> for things that don't exec ... cloned threads would certainly be one
>> case.
>
> yeah - exec-balancing is a clear thing. fork/clone time balancing is
> alot less clear.
OK, well it *looks* to me from a quick look at your patch like
sched_balance_context will rebalance at both fork *and* exec time.
That seems like a bad plan, but maybe I'm misreading it.
Can we hold off on changing the fork/exec time balancing until we've
come to a plan as to what should actually be done with it? Unless we're
giving it some hint from userspace, it's frigging hard to be sure if
it's going to exec or not - and the vast majority of things do.
There was a really good reason why the code is currently set up that
way, it's not some random accident ;-)
Clone is a much more interesting case, though at the time, I consciously
decided NOT to do that, as we really mostly want threads on the same
node. The exception is the case where we have one app with lots of threads,
and nothing much else running on the system ... I tend to think of that
as an artificial benchmark situation, but maybe that's not fair. We
probably need to just do a more conservative version of the cross-node
rebalance at fork time.
M.
On Thursday 25 March 2004 15:59, Ingo Molnar wrote:
> * Andi Kleen <[email protected]> wrote:
> > It doesn't do load balance in wake_up_forked_process() and is
> > relatively non aggressive in balancing later. This leads to the
> > multithreaded OpenMP STREAM running its childs first on the same node
> > as the original process and allocating memory there. Then later they
> > run on a different node when the balancing finally happens, but
> > generate cross traffic to the old node, instead of using the memory
> > bandwidth of their local nodes.
> >
> > The difference is very visible, even the 4 thread STREAM only sees the
> > bandwidth of a single node. With a more aggressive scheduler you get 4
> > times as much.
> >
> > Admittedly it's a bit of a stupid benchmark, but seems to
> > representative for a lot of HPC codes.
>
> There's no way the scheduler can figure out the scheduling and memory
> use patterns of the new tasks in advance.
>
> but userspace could give hints - e.g. a syscall that triggers a
> rebalancing: sys_sched_load_balance(). This way userspace notifies the
> scheduler that it is on 'zero ground' and that the scheduler can move it
> to the least loaded cpu/node.
>
> a variant of this is already possible, userspace can use setaffinity to
> load-balance manually - but sched_load_balance() would be automatic.
For Opteron simply placing all cpus in the same sched domain may solve all of
this, since we will have balancing frequency of the default scheduler. Is
there any reason this cannot be done for Opteron?
Also, I think Erich Focht had another patch which would allow much more
frequent node balancing is the nr_cpus_node was 1.
> For Opteron simply placing all cpus in the same sched domain may solve all of
> this, since we will have balancing frequency of the default scheduler. Is
> there any reason this cannot be done for Opteron?
That seems like a good plan to me - they really don't want that cross-node
balancing. It might be cleaner to implement it by just tweaking the
cross-balance paramters for that system to have the same effect, but it
probably doesn't matter much (I'm thinking of some future case when they
decide to do multi-chip on die or SMT, so just keying off 1 cpu per node
doesn't really fix it).
M.
Andi Kleen wrote:
> On Thu, Mar 25, 2004 at 07:31:37AM -0800, Nakajima, Jun wrote:
>
>>Andi,
>>
>>Can you be more specific with "it doesn't load balance threads
>>aggressively enough"? Or what behavior of the base NUMA scheduler is
>>missing in the sched-domain scheduler especially for NUMA?
>
>
> It doesn't do load balance in wake_up_forked_process() and is relatively
> non aggressive in balancing later. This leads to the multithreaded OpenMP
> STREAM running its childs first on the same node as the original process
> and allocating memory there. Then later they run on a different node when
> the balancing finally happens, but generate cross traffic to the old node,
> instead of using the memory bandwidth of their local nodes.
>
> The difference is very visible, even the 4 thread STREAM only sees the
> bandwidth of a single node. With a more aggressive scheduler you get
> 4 times as much.
>
> Admittedly it's a bit of a stupid benchmark, but seems to representative
> for a lot of HPC codes.
Hi Andi,
Sorry I keep telling you I'll work on this, but I never get
around to it. Mostly lack of hardware makes it difficult. I've
fixed a few bugs and some other workloads, so I keep hoping
that they will fix your problem :P
Your STREAM performance is really bad and I hope you don't
think I'm going to ignore it even if it is a bit stupid. Give
me a bit more time.
Of course, there is nothing fundamentally wrong with
sched-domains that is causing your problem. It can easily do
anything the old numa scheduler can do. It must be a bug or
some bad tuning somewhere.
Nick
On Thu, 25 Mar 2004 16:30:16 -0600
Andrew Theurer <[email protected]> wrote:
> For Opteron simply placing all cpus in the same sched domain may solve all of
> this, since we will have balancing frequency of the default scheduler. Is
> there any reason this cannot be done for Opteron?
Yes, that makes sense. I will try that
-Andi
On Thu, Mar 25, 2004 at 09:30:32PM +0100, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > That won't help for threaded programs that use clone(). OpenMP is such
> > a case.
>
> this patch:
>
> redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4
>
> does balancing at wake_up_forked_process()-time.
>
> but it's a hard issue. Especially after fork() we do have a fair amount
> of cache context, and migrating at this point can be bad for
> performance.
I ported it by hand to the -mm4 scheduler now and tested it. While
it works marginally better than the standard -mm scheduler
(you get 1 1/2 the bandwidth of one CPU instead of one) it's still
still much worse than the optimum of nearly 4 CPUs archived by
2.4 or the standard scheduler.
-Andi
I've got a web page up now on my home machine which shows data from
schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under
load from kernbench, SPECjbb, and SPECdet.
http://eaglet.rain.com/rick/linux/sched-domain/index.html
Two things that stand out are that sched-domains tends to call
load_balance() less frequently when it is idle and more frequently when
it is busy (as compared to the "standard" scheduler.) Another is that
even though it moves fewer tasks on average, the sched-domains code shows
about half of pull_task()'s work is coming from active_load_balance() ...
and that seems wrong. Could these be contributing to what you're seeing?
Rick
On Mon, 29 Mar 2004 02:20:58 -0800
Rick Lindsley <[email protected]> wrote:
> I've got a web page up now on my home machine which shows data from
> schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under
> load from kernbench, SPECjbb, and SPECdet.
>
> http://eaglet.rain.com/rick/linux/sched-domain/index.html
>
> Two things that stand out are that sched-domains tends to call
> load_balance() less frequently when it is idle and more frequently when
> it is busy (as compared to the "standard" scheduler.) Another is that
> even though it moves fewer tasks on average, the sched-domains code shows
> about half of pull_task()'s work is coming from active_load_balance() ...
> and that seems wrong. Could these be contributing to what you're seeing?
Sounds quite possible yes.
-Andi
Andi Kleen wrote:
> On Thu, Mar 25, 2004 at 09:30:32PM +0100, Ingo Molnar wrote:
>
>>* Andi Kleen <[email protected]> wrote:
>>
>>
>>>That won't help for threaded programs that use clone(). OpenMP is such
>>>a case.
>>
>>this patch:
>>
>> redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4
>>
>>does balancing at wake_up_forked_process()-time.
>>
>>but it's a hard issue. Especially after fork() we do have a fair amount
>>of cache context, and migrating at this point can be bad for
>>performance.
>
>
> I ported it by hand to the -mm4 scheduler now and tested it. While
> it works marginally better than the standard -mm scheduler
> (you get 1 1/2 the bandwidth of one CPU instead of one) it's still
> still much worse than the optimum of nearly 4 CPUs archived by
> 2.4 or the standard scheduler.
>
OK there must be some pretty simple reason why this is happening.
I guess being OpenMP it is probably a bit complicated for you to
try your own scheduling in userspace using CPU affinities?
Otherwise could you trace what gets scheduled where for both
good and bad kernels? It should help us work out what is going
on.
I wonder if using one CPU from each quad of the NUMAQ would be
give at all comparable behaviour...
If it isn't a big problem, could you test with -mm5 with the
generic sched domain? STREAM doesn't take long, does it?
I don't expect much difference, but the code is in flux while
Ingo and I try to sort things out.
On Mon, 29 Mar 2004 21:20:12 +1000
Nick Piggin <[email protected]> wrote:
> >
> > I ported it by hand to the -mm4 scheduler now and tested it. While
> > it works marginally better than the standard -mm scheduler
> > (you get 1 1/2 the bandwidth of one CPU instead of one) it's still
> > still much worse than the optimum of nearly 4 CPUs archived by
> > 2.4 or the standard scheduler.
> >
>
Sorry ignore this report - I just found out I booted the wrong
kernel by mistake. Currently retesting, also with the proposed change
to only use a single scheduling domain.
-Andi
Rick Lindsley wrote:
> I've got a web page up now on my home machine which shows data from
> schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under
> load from kernbench, SPECjbb, and SPECdet.
>
> http://eaglet.rain.com/rick/linux/sched-domain/index.html
>
I can't see it
> Two things that stand out are that sched-domains tends to call
> load_balance() less frequently when it is idle and more frequently when
> it is busy (as compared to the "standard" scheduler.) Another is that
John Hawkes noticed problems here too. mm5 has a patch to
improve this for NUMA node balancing. No change on non-NUMA
though if that is what you were testing - we might need to
tune this a bit if it is hurting.
> even though it moves fewer tasks on average, the sched-domains code shows
> about half of pull_task()'s work is coming from active_load_balance() ...
Yeah this is wrong and shouldn't be happening. It would have been
due to a bug in the imbalance calculation which is now fixed.
* Andi Kleen <[email protected]> wrote:
> Sorry ignore this report - I just found out I booted the wrong kernel
> by mistake. Currently retesting, also with the proposed change to only
> use a single scheduling domain.
here are the items that are in the works:
redhat.com/~mingo/scheduler-patches/sched.patch
it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active
balancing a bit.
Ingo
On Mon, 29 Mar 2004 13:46:35 +0200
Ingo Molnar <[email protected]> wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > Sorry ignore this report - I just found out I booted the wrong kernel
> > by mistake. Currently retesting, also with the proposed change to only
> > use a single scheduling domain.
>
> here are the items that are in the works:
>
> redhat.com/~mingo/scheduler-patches/sched.patch
I'm trying to, but -mm5 doesn't work at all on the 4 way machine.
It goes through the full boot up sequence, but then never opens a login
on the console and sshd also doesn't work.
Andrew, maybe that's related to your tty fixes?
-Andi
On Mon, 29 Mar 2004 09:03:01 +0200
Andi Kleen <[email protected]> wrote:
>
> I'm trying to, but -mm5 doesn't work at all on the 4 way machine.
> It goes through the full boot up sequence, but then never opens a login
> on the console and sshd also doesn't work.
>
> Andrew, maybe that's related to your tty fixes?
Reverting the two makes login work again
-Andi
Rick Lindsley wrote:
> I've got a web page up now on my home machine which shows data from
> schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under
> load from kernbench, SPECjbb, and SPECdet.
>
> http://eaglet.rain.com/rick/linux/sched-domain/index.html
>
I can't see it
Ack, sorry, wrong path. Hazards of typing at 3am .. should've used cut 'n'
paste ...
http://eaglet.rain.com/rick/linux/results/sched-domain/index.html
Rick
On Mon, 29 Mar 2004 13:46:35 +0200
Ingo Molnar <[email protected]> wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > Sorry ignore this report - I just found out I booted the wrong kernel
> > by mistake. Currently retesting, also with the proposed change to only
> > use a single scheduling domain.
>
> here are the items that are in the works:
>
> redhat.com/~mingo/scheduler-patches/sched.patch
>
> it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active
> balancing a bit.
I applied only this patch and it did slightly better than the normal -mm*
1.5 - 2x CPU bandwidth, but still very short of the 3.7x-4x mainline
and 2.4 reach.
-Andi
Andi Kleen wrote:
> On Mon, 29 Mar 2004 13:46:35 +0200
> Ingo Molnar <[email protected]> wrote:
>
>
>>* Andi Kleen <[email protected]> wrote:
>>
>>
>>>Sorry ignore this report - I just found out I booted the wrong kernel
>>>by mistake. Currently retesting, also with the proposed change to only
>>>use a single scheduling domain.
>>
>>here are the items that are in the works:
>>
>> redhat.com/~mingo/scheduler-patches/sched.patch
>>
>>it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active
>>balancing a bit.
>
>
> I applied only this patch and it did slightly better than the normal -mm*
> 1.5 - 2x CPU bandwidth, but still very short of the 3.7x-4x mainline
> and 2.4 reach.
So both -mm5 and Ingo's sched.patch are much worse than
what 2.4 and 2.6 get?
Rick Lindsley wrote:
> Rick Lindsley wrote:
> > I've got a web page up now on my home machine which shows data from
> > schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under
> > load from kernbench, SPECjbb, and SPECdet.
> >
> > http://eaglet.rain.com/rick/linux/sched-domain/index.html
> >
>
> I can't see it
>
> Ack, sorry, wrong path. Hazards of typing at 3am .. should've used cut 'n'
> paste ...
>
> http://eaglet.rain.com/rick/linux/results/sched-domain/index.html
>
Hi Rick,
This looks very cool. Very comprehensive. Have you got any
plans to intergrate it with sched_domains (so for example,
you can see stats for each domain)?
I will have to have a look at the code, it should be useful
for testing.
Thanks
Nick
This looks very cool. Very comprehensive. Have you got any
plans to intergrate it with sched_domains (so for example,
you can see stats for each domain)?
Yes -- ideally we can add some stats to domains too, so we can tell
(for example) how often it is adjusting rebalance intervals, or how many
processes are moved as a result of each domain's policy, etc. Every time
I add another counter I cringe a bit, because we don't want to impose
overhead in the scheduler. But so far, using per-cpu data, utilizing
runqueue locking when it's in use, and accepting minor inaccuracies that
may result from the remaining cases, seems to be yielding a pretty good
picture of things without imposing a measurable load.
If you want to start using it yourself, I'm open to feedback. I have patches
for major releases at
http://oss.software.ibm.com/linux/patches/?patch_id=730
and a host of smaller releases (like rc2-mm5) at eaglet:
http://eaglet.rain.com/rick/linux/schedstat/
If you're feeling *really* lucky I have a handful of useful but often
ungeneralized tools I can share, like the the ones that made that web
page.
Rick
On Tue, 30 Mar 2004 09:51:46 +1000
Nick Piggin <[email protected]> wrote:
> So both -mm5 and Ingo's sched.patch are much worse than
> what 2.4 and 2.6 get?
Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla)
Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), but still
much worse than the max of 3.7x-4x CPU bandwidth.
-Andi
* Andi Kleen <[email protected]> wrote:
> > So both -mm5 and Ingo's sched.patch are much worse than
> > what 2.4 and 2.6 get?
>
> Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla)
>
> Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU),
> but still much worse than the max of 3.7x-4x CPU bandwidth.
Andi, could you please try the patch below - this will test whether this
has to do with the rate of balancing between NUMA nodes. The patch
itself is not correct (it way overbalances on NUMA), but it tests the
theory.
Ingo
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -627,7 +627,7 @@ struct sched_domain {
.parent = NULL, \
.groups = NULL, \
.min_interval = 8, \
- .max_interval = 256*fls(num_online_cpus()),\
+ .max_interval = 8, \
.busy_factor = 8, \
.imbalance_pct = 125, \
.cache_hot_time = (10*1000000), \
Andi Kleen wrote:
> On Tue, 30 Mar 2004 09:51:46 +1000
> Nick Piggin <[email protected]> wrote:
>
>
>
>>So both -mm5 and Ingo's sched.patch are much worse than
>>what 2.4 and 2.6 get?
>
>
> Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla)
>
> Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), but still
> much worse than the max of 3.7x-4x CPU bandwidth.
>
So it is very likely to be a case of the threads running too
long on one CPU before being balanced off, and faulting in
most of their working memory from one node, right?
I think it is impossible for the scheduler to correctly
identify this and implement the behaviour that OpenMP wants
without causing regressions on more general workloads
(Assuming this is the problem).
We are not going to go back to the wild balancing that
numasched does (I have some benchmarks where sched-domains
reduces cross node task movement by several orders of
magnitude). So the other option is to do balance on clone
across NUMA nodes, and make it very sensitive to imbalance.
Or probably better to make it easy to balance off to an idle
CPU, but much more difficult to balance off to a busy CPU.
I suspect this would still be a regression for other tests
though where thread creation is more frequent, threads share
working set more often, or the number of threads > the number
of CPUs.
On Tue, 30 Mar 2004 08:40:15 +0200
Ingo Molnar <[email protected]> wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > So both -mm5 and Ingo's sched.patch are much worse than
> > > what 2.4 and 2.6 get?
> >
> > Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla)
> >
> > Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU),
> > but still much worse than the max of 3.7x-4x CPU bandwidth.
>
> Andi, could you please try the patch below - this will test whether this
> has to do with the rate of balancing between NUMA nodes. The patch
> itself is not correct (it way overbalances on NUMA), but it tests the
> theory.
This works much better, but wildly varying (my tests go from 2.8xCPU to
~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
results would be better though.
-Andi
Andi Kleen wrote:
> On Tue, 30 Mar 2004 08:40:15 +0200
> Ingo Molnar <[email protected]> wrote:
>
>
>>* Andi Kleen <[email protected]> wrote:
>>
>>
>>>>So both -mm5 and Ingo's sched.patch are much worse than
>>>>what 2.4 and 2.6 get?
>>>
>>>Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla)
>>>
>>>Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU),
>>>but still much worse than the max of 3.7x-4x CPU bandwidth.
>>
>>Andi, could you please try the patch below - this will test whether this
>>has to do with the rate of balancing between NUMA nodes. The patch
>>itself is not correct (it way overbalances on NUMA), but it tests the
>>theory.
>
>
> This works much better, but wildly varying (my tests go from 2.8xCPU to
> ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
> results would be better though.
>
Oh good, thanks Ingo. Andi you probably want to lower your minimum
balance time too then, and maybe try with an even lower maximum.
Maybe reduce cache_hot_time a bit too.
* Andi Kleen <[email protected]> wrote:
> > Andi, could you please try the patch below - this will test whether this
> > has to do with the rate of balancing between NUMA nodes. The patch
> > itself is not correct (it way overbalances on NUMA), but it tests the
> > theory.
>
> This works much better, but wildly varying (my tests go from 2.8xCPU
> to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
> results would be better though.
ok, could you try min_interval,max_interval and busy_factor all with a
value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
purposes.)
Ingo
> We are not going to go back to the wild balancing that
> numasched does (I have some benchmarks where sched-domains
> reduces cross node task movement by several orders of
> magnitude).
Agreed, I think that'd be a fatal mistake ...
> So the other option is to do balance on clone
> across NUMA nodes, and make it very sensitive to imbalance.
> Or probably better to make it easy to balance off to an idle
> CPU, but much more difficult to balance off to a busy CPU.
I think that's correct, but we need to be careful. We really, really
do want to try to keep threads on the same node *if* we have enough
processes around to keep the machine busy. Because we don't balance
on fork, we make a reasonable job of that today, but we should probably
be more reluctant on rebalance than we are.
It's when we have less processes than nodes that we want to spread things
around. That's a difficult balance to strike (and exactly why I wimped
out on it originally ;-)).
M.
On Tue, 30 Mar 2004 17:03:42 +1000
Nick Piggin <[email protected]> wrote:
>
> So it is very likely to be a case of the threads running too
> long on one CPU before being balanced off, and faulting in
> most of their working memory from one node, right?
Yes.
> I think it is impossible for the scheduler to correctly
> identify this and implement the behaviour that OpenMP wants
> without causing regressions on more general workloads
> (Assuming this is the problem).
Regression on what workload? The 2.4 kernel who did the
early balancing didn't seem to have problems.
I have NUMA API for an application to select memory placement
manually, but it's unrealistic to expect all applications to use it,
so the scheduler has to do at least an reasonable default.
In general on Opteron you want to go as quickly as possible
to your target node. Keeping things on the local node and hoping
that threads won't need to be balanced off is probably a loss.
It is quite possible that other systems have different requirements,
but I doubt there is a "one size fits all" requirement and
doing a custom domain setup or similar would be fine for me.
(or at least if sched domain cannot be tuned for Opteron then
it would have failed its promise of being a configurable scheduler)
> I suspect this would still be a regression for other tests
> though where thread creation is more frequent, threads share
> working set more often, or the number of threads > the number
> of CPUs.
I can try such tests if they're not too time consuming to set up.
What did you have in mind?
-Andi
Andi Kleen wrote:
> On Tue, 30 Mar 2004 17:03:42 +1000
> Nick Piggin <[email protected]> wrote:
>
>
>>So it is very likely to be a case of the threads running too
>>long on one CPU before being balanced off, and faulting in
>>most of their working memory from one node, right?
>
>
> Yes.
>
>
>>I think it is impossible for the scheduler to correctly
>>identify this and implement the behaviour that OpenMP wants
>>without causing regressions on more general workloads
>>(Assuming this is the problem).
>
>
> Regression on what workload? The 2.4 kernel who did the
> early balancing didn't seem to have problems.
>
No, but hopefully sched domains balancing will do
better than the old numasched.
> I have NUMA API for an application to select memory placement
> manually, but it's unrealistic to expect all applications to use it,
> so the scheduler has to do at least an reasonable default.
>
> In general on Opteron you want to go as quickly as possible
> to your target node. Keeping things on the local node and hoping
> that threads won't need to be balanced off is probably a loss.
> It is quite possible that other systems have different requirements,
> but I doubt there is a "one size fits all" requirement and
> doing a custom domain setup or similar would be fine for me.
It is the same situation with all NUMA, obviously Opteron's
1 CPU per node means it is sensitive to node imbalances.
> (or at least if sched domain cannot be tuned for Opteron then
> it would have failed its promise of being a configurable scheduler)
>
Well it seems like Ingo is on to something. Phew! :)
>
>>I suspect this would still be a regression for other tests
>>though where thread creation is more frequent, threads share
>>working set more often, or the number of threads > the number
>>of CPUs.
>
>
> I can try such tests if they're not too time consuming to set up.
> What did you have in mind?
>
Not really sure. I guess probably most things that use a
lot of threads, maybe java, a web server using per connection
threads (if there is such a thing).
On the other hand though, maybe it will be a good idea if it
is done carefully...
Ingo Molnar wrote:
> * Andi Kleen <[email protected]> wrote:
>
>
>>>Andi, could you please try the patch below - this will test whether this
>>>has to do with the rate of balancing between NUMA nodes. The patch
>>>itself is not correct (it way overbalances on NUMA), but it tests the
>>>theory.
>>
>>This works much better, but wildly varying (my tests go from 2.8xCPU
>>to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
>>results would be better though.
>
>
> ok, could you try min_interval,max_interval and busy_factor all with a
> value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
> purposes.)
>
(sorry, forget what I said then, I'll leave it to Ingo)
Martin J. Bligh wrote:
>>We are not going to go back to the wild balancing that
>>numasched does (I have some benchmarks where sched-domains
>>reduces cross node task movement by several orders of
>>magnitude).
>
>
> Agreed, I think that'd be a fatal mistake ...
>
>
>>So the other option is to do balance on clone
>>across NUMA nodes, and make it very sensitive to imbalance.
>>Or probably better to make it easy to balance off to an idle
>>CPU, but much more difficult to balance off to a busy CPU.
>
>
> I think that's correct, but we need to be careful. We really, really
> do want to try to keep threads on the same node *if* we have enough
> processes around to keep the machine busy. Because we don't balance
> on fork, we make a reasonable job of that today, but we should probably
> be more reluctant on rebalance than we are.
>
> It's when we have less processes than nodes that we want to spread things
> around. That's a difficult balance to strike (and exactly why I wimped
> out on it originally ;-)).
>
Well NUMA balance on exec is obviously the right thing to do.
Maybe balance on clone would be beneficial if we only balance onto
CPUs which are idle or very very imbalanced. Basically, if you are
very sure that it is going to be balanced off anyway, it is probably
better to do it at clone.
> Well NUMA balance on exec is obviously the right thing to do.
>
> Maybe balance on clone would be beneficial if we only balance onto
> CPUs which are idle or very very imbalanced. Basically, if you are
> very sure that it is going to be balanced off anyway, it is probably
> better to do it at clone.
Yup ... sounds utterly sensible. But I think we need to make the current
balance favour grouping threads together on the same CPU/node more first
if possible ;-)
M.
> Regression on what workload? The 2.4 kernel who did the
> early balancing didn't seem to have problems.
well the hard balance is between a program that just splits of one
thread and has those 2 threads working closely together (in which case
you want the 2 threads to be together on the same quad in a quad-like
setup) and a program that splits of a thread and has the 2 threads
working basically entirely independent.
Benchmarks are typically of the later kind... but real world
applications ???? The ones I can think of using threads are of the
former kind.
* Andi Kleen <[email protected]> wrote:
> This works much better, but wildly varying (my tests go from 2.8xCPU
> to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
> results would be better though.
i'm resurrecting the balance-on-clone patch i sent a couple of days ago.
I found at least one bug in it that might explain why it didnt work back
then. (also, the scheduler back then was also too agressive at migrating
tasks back.) Stay tuned.
Ingo
* Nick Piggin <[email protected]> wrote:
> >This works much better, but wildly varying (my tests go from 2.8xCPU to
> >~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
> >results would be better though.
>
> Oh good, thanks Ingo. Andi you probably want to lower your minimum
> balance time too then, and maybe try with an even lower maximum. Maybe
> reduce cache_hot_time a bit too.
i dont think we want to balance with that high of a frequency on NUMA
Opteron. These tunes were for testing only.
i'm dusting off the balance-on-clone patch right now, that should be the
correct solution. It is based on a find_idlest_cpu() function which
searches for the least loaded CPU and checks whether we can do passive
load-balancing to it. Ie. it's yet another balancing point in the
scheduler, _not_ some balancing logic change.
Ingo
On Tue, 30 Mar 2004 09:15:19 +0200
Ingo Molnar <[email protected]> wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > Andi, could you please try the patch below - this will test whether this
> > > has to do with the rate of balancing between NUMA nodes. The patch
> > > itself is not correct (it way overbalances on NUMA), but it tests the
> > > theory.
> >
> > This works much better, but wildly varying (my tests go from 2.8xCPU
> > to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
> > results would be better though.
>
> ok, could you try min_interval,max_interval and busy_factor all with a
> value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
> purposes.)
I kept the old patch and made these changes. The results are much more
consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had this
with older kernels too.
-Andi
Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>
>>>This works much better, but wildly varying (my tests go from 2.8xCPU to
>>>~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
>>>results would be better though.
>>
>>Oh good, thanks Ingo. Andi you probably want to lower your minimum
>>balance time too then, and maybe try with an even lower maximum. Maybe
>>reduce cache_hot_time a bit too.
>
>
> i dont think we want to balance with that high of a frequency on NUMA
> Opteron. These tunes were for testing only.
>
I guess not. Andi says he wants it more like UMA balancing though...
> i'm dusting off the balance-on-clone patch right now, that should be the
> correct solution. It is based on a find_idlest_cpu() function which
> searches for the least loaded CPU and checks whether we can do passive
> load-balancing to it. Ie. it's yet another balancing point in the
> scheduler, _not_ some balancing logic change.
>
Yep, as I said to Martin, I also agree this is probably good if it
is done carefully. I think we'll need to get a horde of thread
benchmarking people together before turning it on by default, of
course.
It seems Andi can now get equivalent results without it now, so it
isn't a pressing issue.
* Nick Piggin <[email protected]> wrote:
> Maybe balance on clone would be beneficial if we only balance onto
> CPUs which are idle or very very imbalanced. Basically, if you are
> very sure that it is going to be balanced off anyway, it is probably
> better to do it at clone.
balancing threads/processes is not a problem, as long as it happens
within the rules of normal balancing.
ie. 'new context created' (on exec, fork or clone) is just an event that
impacts the load scenario, and which might trigger rebalancing.
_if_ the sharing between various contexts is very high and it's actually
faster to run them all single-threaded, then the application writer can
bind them to one CPU, via the affinity syscalls. But the scheduler
cannot know this advance.
so the cleanest assumption, from the POV of the scheduler, is that
there's no sharing between contexts. Things become really simple once
this assumption is made.
and frankly, it's much easier to argue with application developers whose
application scales badly and thus the scheduler over-distributes it,
than with application developers who's application scales badly due to
the scheduler.
Ingo
* Andi Kleen <[email protected]> wrote:
> > ok, could you try min_interval,max_interval and busy_factor all with a
> > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
> > purposes.)
>
> I kept the old patch and made these changes. The results are much more
> consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had
> this with older kernels too.
great.
now, could you try the following patch, against vanilla -mm5:
redhat.com/~mingo/scheduler-patches/sched2.patch
this includes 'context balancing' and doesnt touch the NUMA async
balancing tunables. Do you get better performance than with stock -mm5?
Ingo
Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>
>>Maybe balance on clone would be beneficial if we only balance onto
>>CPUs which are idle or very very imbalanced. Basically, if you are
>>very sure that it is going to be balanced off anyway, it is probably
>>better to do it at clone.
>
>
> balancing threads/processes is not a problem, as long as it happens
> within the rules of normal balancing.
>
> ie. 'new context created' (on exec, fork or clone) is just an event that
> impacts the load scenario, and which might trigger rebalancing.
>
> _if_ the sharing between various contexts is very high and it's actually
> faster to run them all single-threaded, then the application writer can
> bind them to one CPU, via the affinity syscalls. But the scheduler
> cannot know this advance.
>
> so the cleanest assumption, from the POV of the scheduler, is that
> there's no sharing between contexts. Things become really simple once
> this assumption is made.
>
> and frankly, it's much easier to argue with application developers whose
> application scales badly and thus the scheduler over-distributes it,
> than with application developers who's application scales badly due to
> the scheduler.
>
You're probably mostly right, but I really don't know if I'd
start with the assumption that threads don't share anything.
I think they're very likely to share memory and cache.
Also, these additional system wide balance points don't come
for free if you attach them to common operations (as opposed
to the slow periodic balancing).
find_best_cpu needs to pull down NR_CPUs remote (and probably
hot&dirty) cachelines, which can get expensive, for an
operation that you are very likely to be better off *without*
if your threads do share any memory.
On Thursday 25 March 2004 23:28, Martin J. Bligh wrote:
> Can we hold off on changing the fork/exec time balancing until we've
> come to a plan as to what should actually be done with it? Unless we're
> giving it some hint from userspace, it's frigging hard to be sure if
> it's going to exec or not - and the vast majority of things do.
After more than a year (or two?) of discussions there's no better idea
yet than giving a userspace hint. Default should be to balance at
exec(), and maybe use a syscall for saying: balance all children a
particular process is going to fork/clone at creation time. Everybody
reached the insight that we can't foresee what's optimal, so there is
only one solution: control the behavior. Give the user a tool to
improve the performance. Just a small inheritable variable in the task
structure is enough. Whether you give the hint at or before run-time
or even at compile-time is not really the point...
I don't think it's worth to wait and hope that somebody shows up with
a magic algorithm which balances every kind of job optimally.
> There was a really good reason why the code is currently set up that
> way, it's not some random accident ;-)
The current code isn't a result of a big optimization effort, it's the
result of stripping stuff down to something which was acceptable at
all in the 2.6 feature freeze phase such that we get at least _some_
NUMA scheduler infrastructure. It was clear right from the beginning
that it has to be extended to really become useful.
> Clone is a much more interesting case, though at the time, I consciously
> decided NOT to do that, as we really mostly want threads on the same
> node.
That is not true in the case of HPC applications. And if someone uses
OpenMP he is just doing that kind of stuff. I consider STREAM a good
benchmark because it shows exactly the problem of HPC applications:
they need a lot of memory bandwidth, they don't run in cache and the
tasks live really long. Spreading those tasks across the nodes gives
me more bandwidth per task and I accumulate the positive effect
because the tasks run for hours or days. It's a simple and clear case
where the scheduler should be improved.
Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
are not relevant for HPC. In a compute center it actually doesn't
matter much whether some shell command returns 10% faster, it just
shouldn't disturb my super simulation code for which I bought an
expensive NUMA box.
Regards,
Erich
* Nick Piggin <[email protected]> wrote:
> You're probably mostly right, but I really don't know if I'd start
> with the assumption that threads don't share anything. I think they're
> very likely to share memory and cache.
it all depends on the workload i guess, but generally if the application
scales well then the threads only share data in a read-mostly manner -
hence we can balance at creation time.
if the application does not scale well then balancing too early cannot
make the app perform much worse.
things like JVMs tend to want good balancing - they really are userspace
simulations of separate contexts with little sharing and good overall
scalability of the architecture.
> Also, these additional system wide balance points don't come for free
> if you attach them to common operations (as opposed to the slow
> periodic balancing).
yes, definitely.
the implementation in sched2.patch does not take this into account yet.
There are a number of things we can do about the 500 CPUs case. Eg. only
do the balance search towards the next N nodes/cpus (tunable via a
domain parameter).
Ingo
Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>
>>You're probably mostly right, but I really don't know if I'd start
>>with the assumption that threads don't share anything. I think they're
>>very likely to share memory and cache.
>
>
> it all depends on the workload i guess, but generally if the application
> scales well then the threads only share data in a read-mostly manner -
> hence we can balance at creation time.
>
> if the application does not scale well then balancing too early cannot
> make the app perform much worse.
>
> things like JVMs tend to want good balancing - they really are userspace
> simulations of separate contexts with little sharing and good overall
> scalability of the architecture.
>
Well, it will be interesting to see how it goes. Unfortunately
I don't have a single realistic benchmark. In fact the only
threaded one I have is volanomark.
>
>>Also, these additional system wide balance points don't come for free
>>if you attach them to common operations (as opposed to the slow
>>periodic balancing).
>
>
> yes, definitely.
>
> the implementation in sched2.patch does not take this into account yet.
> There are a number of things we can do about the 500 CPUs case. Eg. only
> do the balance search towards the next N nodes/cpus (tunable via a
> domain parameter).
Yeah I think we shouldn't worry too much about the 500 CPUs
case, because they will obviously end up using their own
domains. But it is possible this would hurt smaller CPU
counts too. Again, it means testing.
I think we should probably aim to have a usable and decent
default domain for 32, maybe 64 CPUs, and not worry about
larger numbers too much if it would hurt lower end performance.
(please use [email protected])
Erich Focht wrote:
>On Thursday 25 March 2004 23:28, Martin J. Bligh wrote:
>
>>Can we hold off on changing the fork/exec time balancing until we've
>>come to a plan as to what should actually be done with it? Unless we're
>>giving it some hint from userspace, it's frigging hard to be sure if
>>it's going to exec or not - and the vast majority of things do.
>>
>
>After more than a year (or two?) of discussions there's no better idea
>yet than giving a userspace hint. Default should be to balance at
>exec(), and maybe use a syscall for saying: balance all children a
>particular process is going to fork/clone at creation time. Everybody
>reached the insight that we can't foresee what's optimal, so there is
>only one solution: control the behavior. Give the user a tool to
>improve the performance. Just a small inheritable variable in the task
>structure is enough. Whether you give the hint at or before run-time
>or even at compile-time is not really the point...
>
>I don't think it's worth to wait and hope that somebody shows up with
>a magic algorithm which balances every kind of job optimally.
>
>
I'm with Martin here, we are just about to merge all this
sched-domains stuff. So we should at least wait until after
that. And of course, *nothing* gets changed without at least
one benchmark that shows it improves something. So far
nobody has come up to the plate with that.
>>There was a really good reason why the code is currently set up that
>>way, it's not some random accident ;-)
>>
>
>The current code isn't a result of a big optimization effort, it's the
>result of stripping stuff down to something which was acceptable at
>all in the 2.6 feature freeze phase such that we get at least _some_
>NUMA scheduler infrastructure. It was clear right from the beginning
>that it has to be extended to really become useful.
>
>
>>Clone is a much more interesting case, though at the time, I consciously
>>decided NOT to do that, as we really mostly want threads on the same
>>node.
>>
>
>That is not true in the case of HPC applications. And if someone uses
>OpenMP he is just doing that kind of stuff. I consider STREAM a good
>benchmark because it shows exactly the problem of HPC applications:
>they need a lot of memory bandwidth, they don't run in cache and the
>tasks live really long. Spreading those tasks across the nodes gives
>me more bandwidth per task and I accumulate the positive effect
>because the tasks run for hours or days. It's a simple and clear case
>where the scheduler should be improved.
>
>Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
>are not relevant for HPC. In a compute center it actually doesn't
>matter much whether some shell command returns 10% faster, it just
>shouldn't disturb my super simulation code for which I bought an
>expensive NUMA box.
>
>
There are other things, like java, ervers, etc that use threads.
The point is that we have never had this before, and nobody
(until now) has been asking for it. And there are as yet no
convincing benchmarks that even show best case improvements. And
it could very easily have some bad cases. And finally, HPC
applications are the very ones that should be using CPU
affinities because they are usually tuned quite tightly to the
specific architecture.
Let's just make sure we don't change defaults without any
reason...
On Tue, 30 Mar 2004 10:18:40 +0200
Ingo Molnar <[email protected]> wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > ok, could you try min_interval,max_interval and busy_factor all with a
> > > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
> > > purposes.)
> >
> > I kept the old patch and made these changes. The results are much more
> > consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had
> > this with older kernels too.
>
> great.
>
> now, could you try the following patch, against vanilla -mm5:
>
> redhat.com/~mingo/scheduler-patches/sched2.patch
>
> this includes 'context balancing' and doesnt touch the NUMA async
> balancing tunables. Do you get better performance than with stock -mm5?
I get better performance (roughly 2.1x CPU), but only about half the optimum.
-Andi
Hi Nick,
On Tuesday 30 March 2004 11:05, Nick Piggin wrote:
> >exec(), and maybe use a syscall for saying: balance all children a
> >particular process is going to fork/clone at creation time. Everybody
> >reached the insight that we can't foresee what's optimal, so there is
> >only one solution: control the behavior. Give the user a tool to
> >improve the performance. Just a small inheritable variable in the task
> >structure is enough. Whether you give the hint at or before run-time
> >or even at compile-time is not really the point...
> >
> >I don't think it's worth to wait and hope that somebody shows up with
> >a magic algorithm which balances every kind of job optimally.
>
> I'm with Martin here, we are just about to merge all this
> sched-domains stuff. So we should at least wait until after
> that. And of course, *nothing* gets changed without at least
> one benchmark that shows it improves something. So far
> nobody has come up to the plate with that.
I thought you're talking the whole time about STREAM. That is THE
benchmark which shows you an impact of balancing at fork. At it is a
VERY relevant benchmark. Though you shouldn't run it on historical
machines like NUMAQ, no compute center in the western world will buy
NUMAQs for high performance... Andy typically runs STREAM on all CPUs
of a machine. Try on N/2 and N/4 and so on, you'll see the impact.
> >>Clone is a much more interesting case, though at the time, I consciously
> >>decided NOT to do that, as we really mostly want threads on the same
> >>node.
> >
> >That is not true in the case of HPC applications. And if someone uses
> >OpenMP he is just doing that kind of stuff. I consider STREAM a good
> >benchmark because it shows exactly the problem of HPC applications:
> >they need a lot of memory bandwidth, they don't run in cache and the
> >tasks live really long. Spreading those tasks across the nodes gives
> >me more bandwidth per task and I accumulate the positive effect
> >because the tasks run for hours or days. It's a simple and clear case
> >where the scheduler should be improved.
> >
> >Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
> >are not relevant for HPC. In a compute center it actually doesn't
> >matter much whether some shell command returns 10% faster, it just
> >shouldn't disturb my super simulation code for which I bought an
> >expensive NUMA box.
>
> There are other things, like java, ervers, etc that use threads.
I'm just saying that you should have the choice. The default should be
as before, balance at exec().
> The point is that we have never had this before, and nobody
> (until now) has been asking for it. And there are as yet no
?? Sorry, I'm having balance at fork since 2001 in the NEC IA64 NUMA
kernels and users use it intensively with OpenMP. Advertised it a lot,
asked for it, atlked about it at the last OLS. Only IA64 was
considered rare big iron. I understand that the issue gets hotter if
the problem hurts on AMD64...
> convincing benchmarks that even show best case improvements. And
> it could very easily have some bad cases.
Again: I'm talking about having the choice. The user decides. Nothing
protects you against user stupidity, but if they just have the choice
of poor automatic initial scheduling, it's not enough. And: having the
fork/clone initial balancing policy means: you don't need to make your
code complicated and unportable by playing with setaffinity (which is
just plainly unusable when you share the machine with other users).
> And finally, HPC
> applications are the very ones that should be using CPU
> affinities because they are usually tuned quite tightly to the
> specific architecture.
There are companies mainly selling NUMA machines for HPC (SGI?), so
this is not a niche market. Clusters of big NUMA machines are not
unusual, and they're typically not used for databases but for HPC
apps. Unfortunately proprietary UNIX is still considered to have
better features than Linux for such configurations.
> Let's just make sure we don't change defaults without any
> reason...
No reason? Aaarghh... >;-)
Erich
On Tue, 30 Mar 2004 12:04:13 +0200
Erich Focht <[email protected]> wrote:
Hallo Erich,
> On Tuesday 30 March 2004 11:05, Nick Piggin wrote:
> > >exec(), and maybe use a syscall for saying: balance all children a
> > >particular process is going to fork/clone at creation time. Everybody
> > >reached the insight that we can't foresee what's optimal, so there is
> > >only one solution: control the behavior. Give the user a tool to
> > >improve the performance. Just a small inheritable variable in the task
> > >structure is enough. Whether you give the hint at or before run-time
> > >or even at compile-time is not really the point...
> > >
> > >I don't think it's worth to wait and hope that somebody shows up with
> > >a magic algorithm which balances every kind of job optimally.
> >
> > I'm with Martin here, we are just about to merge all this
> > sched-domains stuff. So we should at least wait until after
> > that. And of course, *nothing* gets changed without at least
> > one benchmark that shows it improves something. So far
> > nobody has come up to the plate with that.
>
> I thought you're talking the whole time about STREAM. That is THE
> benchmark which shows you an impact of balancing at fork. At it is a
> VERY relevant benchmark. Though you shouldn't run it on historical
> machines like NUMAQ, no compute center in the western world will buy
> NUMAQs for high performance... Andy typically runs STREAM on all CPUs
> of a machine. Try on N/2 and N/4 and so on, you'll see the impact.
Actually I run it on 1-4 CPUs (don't have more to try), but didn't
always bother to report everything.. With the default
mm5 scheduler the bandwidth of 1,2,3,4 is constantly like 1 CPU.
I agree with you that the "balancing on fork is bad" assumption is dubious
at best. For HPC it definitly is wrong, for others it is unproven as well.
As I wrote earlier our own results on HyperThreaded machines running 2.4
were similar. On HT at least early balancing seems to be a win too -
it's obvious because there is no cache cost to be paid when you move
between two virtual CPUs on the same core.
> > There are other things, like java, ervers, etc that use threads.
>
> I'm just saying that you should have the choice. The default should be
> as before, balance at exec().
Choice is probably not bad, but a good default is important too.
I'm not really sure doing it by default would be such a bad idea.
A thread allocating some memory on its own is probably not that unusual,
even outside the HPC space. And on a NUMA system you want that already
on the final node.
-Andi
Erich Focht <[email protected]> wrote:
>
> > And finally, HPC
> > applications are the very ones that should be using CPU
> > affinities because they are usually tuned quite tightly to the
> > specific architecture.
>
> There are companies mainly selling NUMA machines for HPC (SGI?), so
> this is not a niche market.
It is niche in terms of number of machines and in terms of affected users.
And the people who provide these machines have the resources to patch the
scheduler if needs be.
Correct me if I'm wrong, but what we have here is a situation where if we
design the scheduler around the HPC requirement, it will work poorly in a
significant number of other applications. And we don't see a way of fixing
this without either a /proc/i-am-doing-hpc, or a config option, or
requiring someone to carry an external patch, yes?
If so then all of those seem reasonable options to me. We should optimise
the scheduler for the common case, and that ain't HPC.
If we agree that architecturally sched-domains _can_ satisfy the HPC
requirement then I think that's good enough for now. I'd prefer that Ingo
and Nick not have to bust a gut trying to get optimum HPC performance
before the code is even merged up.
Do you agree that sched-domains is architected appropriately?
--Erich Focht <[email protected]> wrote (on Tuesday, March 30, 2004 00:30:25 +0200):
> On Thursday 25 March 2004 23:28, Martin J. Bligh wrote:
>> Can we hold off on changing the fork/exec time balancing until we've
>> come to a plan as to what should actually be done with it? Unless we're
>> giving it some hint from userspace, it's frigging hard to be sure if
>> it's going to exec or not - and the vast majority of things do.
>
> After more than a year (or two?) of discussions there's no better idea
> yet than giving a userspace hint. Default should be to balance at
> exec(), and maybe use a syscall for saying: balance all children a
> particular process is going to fork/clone at creation time. Everybody
> reached the insight that we can't foresee what's optimal, so there is
> only one solution: control the behavior. Give the user a tool to
> improve the performance. Just a small inheritable variable in the task
> structure is enough. Whether you give the hint at or before run-time
> or even at compile-time is not really the point...
Agreed ... absolutely.
> I don't think it's worth to wait and hope that somebody shows up with
> a magic algorithm which balances every kind of job optimally.
Especially as I don't believe that exists ;-) It's not deterministic.
>> Clone is a much more interesting case, though at the time, I consciously
>> decided NOT to do that, as we really mostly want threads on the same
>> node.
>
> That is not true in the case of HPC applications. And if someone uses
> OpenMP he is just doing that kind of stuff. I consider STREAM a good
> benchmark because it shows exactly the problem of HPC applications:
> they need a lot of memory bandwidth, they don't run in cache and the
> tasks live really long. Spreading those tasks across the nodes gives
> me more bandwidth per task and I accumulate the positive effect
> because the tasks run for hours or days. It's a simple and clear case
> where the scheduler should be improved.
>
> Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
> are not relevant for HPC. In a compute center it actually doesn't
> matter much whether some shell command returns 10% faster, it just
> shouldn't disturb my super simulation code for which I bought an
> expensive NUMA box.
OK, but the scheduler can't know the difference automatically, I don't
think ... and whether we should tune the scheduler for "user work" or
HPC is going to be a hotly contested point ;-) We need to try to find
something that works for both. And suppose you have a 4 node system,
with 4 HPC apps running? Surely you want each app to have one node to
itself? That's more the case I'm worried about than "user work" vs HPC,
to be honest.
M.
> Well, it will be interesting to see how it goes. Unfortunately
> I don't have a single realistic benchmark.
That's OK, neither does anyone else ;-) OK, for HPC workloads they do,
but not for other stuff.
The closest I can come conceptually is to run multiple instances of a
Java benchmark in parallel. The existing ones all tend to be either 1
process with many threads, or many processes each with one thread. There's
no m x n benchamrks around I've found, and that seems to be a lot more
like what the customers I've seen are interested in (throwing a DB,
webserver, java, etc all on one machine).
Making balance_on_fork a userspace hintable thing wouldn't hurt us at all
though, and would provide a great escape route for the HPC people.
Some simple pokeable in /proc would probably be sufficient. balance_on_clone
is harder, as whether you want to do it or not depends more on the state
of the rest of the system, which is very hard for userspace to know ...
M.
the latest scheduler patch, against 2.6.5-rc3-mm1, can be found at:
redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc3-mm1-A0
this includes:
- fork/clone-time balancing. It looks quite good here, but needs more
testing for impact.
- a minor fix for passive balancing. (calculating at a -1 load level
was not perfectly precise with a runqueue length of ~4 or longer.)
- use sync wakeups for parent-wakeup. This makes a single-task strace
execute on only one CPU on SMP, which is precisely what we want. It
should also be a speedup for a number of workloads where the parent
is actively wait4()-ing for the child to exit.
Ingo
Erich Focht wrote:
> Hi Nick,
>
Hi Erich,
> On Tuesday 30 March 2004 11:05, Nick Piggin wrote:
>
>>I'm with Martin here, we are just about to merge all this
>>sched-domains stuff. So we should at least wait until after
>>that. And of course, *nothing* gets changed without at least
>>one benchmark that shows it improves something. So far
>>nobody has come up to the plate with that.
>
>
> I thought you're talking the whole time about STREAM. That is THE
> benchmark which shows you an impact of balancing at fork. At it is a
> VERY relevant benchmark. Though you shouldn't run it on historical
> machines like NUMAQ, no compute center in the western world will buy
> NUMAQs for high performance... Andy typically runs STREAM on all CPUs
> of a machine. Try on N/2 and N/4 and so on, you'll see the impact.
>
Well yeah, but the immediate problem was that sched-domains was
*much* worse than 2.6's numasched, neither of which balance on
fork/clone. I didn't want to obscure the issue by implementing
balance on fork/clone until we worked out exactly the problem.
Anyway, once sched-domains goes in, you can basically do whatever
you like without impacting anyone else...
>>
>>There are other things, like java, ervers, etc that use threads.
>
>
> I'm just saying that you should have the choice. The default should be
> as before, balance at exec().
>
Yeah well that is a very sane thing to do ;)
>
>>The point is that we have never had this before, and nobody
>>(until now) has been asking for it. And there are as yet no
>
>
> ?? Sorry, I'm having balance at fork since 2001 in the NEC IA64 NUMA
> kernels and users use it intensively with OpenMP. Advertised it a lot,
> asked for it, atlked about it at the last OLS. Only IA64 was
> considered rare big iron. I understand that the issue gets hotter if
> the problem hurts on AMD64...
>
Sorry I hadn't realised. I guess because you are happy with
your own stuff you don't make too much noise about it on the
list lately. I apologise.
I wonder though, why don't you just teach OpenMP to use
affinities as well? Surely that is better than relying on the
behaviour of the scheduler, even if it does balance on clone.
>
>>convincing benchmarks that even show best case improvements. And
>>it could very easily have some bad cases.
>
>
> Again: I'm talking about having the choice. The user decides. Nothing
> protects you against user stupidity, but if they just have the choice
> of poor automatic initial scheduling, it's not enough. And: having the
> fork/clone initial balancing policy means: you don't need to make your
> code complicated and unportable by playing with setaffinity (which is
> just plainly unusable when you share the machine with other users).
>
If you do it by hand, you know exactly what is going to happen,
and you can turn off the balance-on-clone flags and you don't
incur the hit of pulling in remote cachelines from every CPU at
clone time to do balancing. Surely an HPC application wouldn't
mind doing that? (I guess they probably don't call clone a lot
though).
>
>>And finally, HPC
>>applications are the very ones that should be using CPU
>>affinities because they are usually tuned quite tightly to the
>>specific architecture.
>
>
> There are companies mainly selling NUMA machines for HPC (SGI?), so
> this is not a niche market. Clusters of big NUMA machines are not
> unusual, and they're typically not used for databases but for HPC
> apps. Unfortunately proprietary UNIX is still considered to have
> better features than Linux for such configurations.
>
Well, SGI should be doing tests soon and tuning the scheduler
to their liking. Hopefully others will too, so we'll see what
happens.
>
>>Let's just make sure we don't change defaults without any
>>reason...
>
>
> No reason? Aaarghh... >;-)
>
Sorry I mean evidence. I'm sure with a properly tuned
implementation, you could get really good speedups in lots
of places... I just want to *see* them. All I have seen so
far is Andi getting a bit better performance on something
where he can get *much* better performance by making a
trivial tweak instead.
I really don't have the software or hardware to test this
at all so I just have to sit and watch.
Ingo Molnar wrote:
> - use sync wakeups for parent-wakeup. This makes a single-task strace
> execute on only one CPU on SMP, which is precisely what we want. It
> should also be a speedup for a number of workloads where the parent
> is actively wait4()-ing for the child to exit.
Nice
On Tuesday 30 March 2004 13:02, Andrew Morton wrote:
> Erich Focht <[email protected]> wrote:
> > > And finally, HPC
> > > applications are the very ones that should be using CPU
> > > affinities because they are usually tuned quite tightly to the
> > > specific architecture.
> >
> > There are companies mainly selling NUMA machines for HPC (SGI?), so
> > this is not a niche market.
>
> It is niche in terms of number of machines and in terms of affected users.
> And the people who provide these machines have the resources to patch the
> scheduler if needs be.
Uhm, depends on the CPUs you think of. I bet much more than half of
the Opterons and Itanium2 CPUs sold last year went into HPC. Certainly
not so many IA64s went into NUMA machines. But almost all Opterons ;-)
IBM's NUMA machines with Power CPUs are mainly sold with AIX into the
HPC market, I don't recall to have seen big HPC installations with HP
Superdome under Linux, not yet...? IBM sells x86-NUMA more into the
commercial market, the only big visible Linux-NUMA in HPC is SGI's
Altix. Most of the other NUMA machines go into HPC with other OSes and
we don't care about them (yet?). So you're probably right about the
number of Linux-NUMA-HPC users, but this actually shows that
Linux-NUMA is currently not the ideal choice. We're working on it,
right?
> Correct me if I'm wrong, but what we have here is a situation where if we
> design the scheduler around the HPC requirement, it will work poorly in a
> significant number of other applications. And we don't see a way of fixing
> this without either a /proc/i-am-doing-hpc, or a config option, or
> requiring someone to carry an external patch, yes?
>
> If so then all of those seem reasonable options to me. We should optimise
> the scheduler for the common case, and that ain't HPC.
Yes! A per process flag would be enough to have the choice.
> If we agree that architecturally sched-domains _can_ satisfy the HPC
> requirement then I think that's good enough for now. I'd prefer that Ingo
> and Nick not have to bust a gut trying to get optimum HPC performance
> before the code is even merged up.
Sure. On the other hand the benchmark brought into discussion by Andi
is very easy to understand, much easier than any Java monster. If the
scheduler doesn't have a screw for running this optimally, it's
disappointing.
> Do you agree that sched-domains is architected appropriately?
My current impression is: YES. My testing experience with it is
still very limited...
Regards,
Erich
On Tuesday 30 March 2004 17:01, Martin J. Bligh wrote:
> > I don't think it's worth to wait and hope that somebody shows up with
> > a magic algorithm which balances every kind of job optimally.
>
> Especially as I don't believe that exists ;-) It's not deterministic.
Right, so let's choose the initial balancing policy on a per process
basis.
> > Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
> > are not relevant for HPC. In a compute center it actually doesn't
> > matter much whether some shell command returns 10% faster, it just
> > shouldn't disturb my super simulation code for which I bought an
> > expensive NUMA box.
>
> OK, but the scheduler can't know the difference automatically, I don't
> think ... and whether we should tune the scheduler for "user work" or
> HPC is going to be a hotly contested point ;-) We need to try to find
> something that works for both. And suppose you have a 4 node system,
> with 4 HPC apps running? Surely you want each app to have one node to
> itself?
If the machine is 100% full all the time and all apps demand the same
amount of bandwidth, yes, I want 1 job per node. If the average load is
less than 100% (sometimes only 2-3 jobs are running) then I'd prefer to
spread the processes of a job across the machine. The average bandwidth
per process will be higher. Modern NUMA machines have big bandwidth to
neighboring nodes and not too bad latency penalties for remote accesses.
Regards,
Erich
> On Tuesday 30 March 2004 17:01, Martin J. Bligh wrote:
>> > I don't think it's worth to wait and hope that somebody shows up with
>> > a magic algorithm which balances every kind of job optimally.
>>
>> Especially as I don't believe that exists ;-) It's not deterministic.
>
> Right, so let's choose the initial balancing policy on a per process
> basis.
Yup, that seems like a reasonable thing to do. That way you can override
it for things that fork and never exec, if they're performance critical
(like HPC maybe).
>> > Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
>> > are not relevant for HPC. In a compute center it actually doesn't
>> > matter much whether some shell command returns 10% faster, it just
>> > shouldn't disturb my super simulation code for which I bought an
>> > expensive NUMA box.
>>
>> OK, but the scheduler can't know the difference automatically, I don't
>> think ... and whether we should tune the scheduler for "user work" or
>> HPC is going to be a hotly contested point ;-) We need to try to find
>> something that works for both. And suppose you have a 4 node system,
>> with 4 HPC apps running? Surely you want each app to have one node to
>> itself?
>
> If the machine is 100% full all the time and all apps demand the same
> amount of bandwidth, yes, I want 1 job per node. If the average load is
> less than 100% (sometimes only 2-3 jobs are running) then I'd prefer to
> spread the processes of a job across the machine. The average bandwidth
> per process will be higher. Modern NUMA machines have big bandwidth to
> neighboring nodes and not too bad latency penalties for remote accesses.
In theory at least, doing the rebalance_on_clone if and only if there are
idle procs on another node sounds reasonable. In practice, I'm not sure
how well that'll work, since one app may well start wholly before another,
but maybe we can figure out something smart to do.
M.
On Wednesday 31 March 2004 04:08, Nick Piggin wrote:
> >>I'm with Martin here, we are just about to merge all this
> >>sched-domains stuff. So we should at least wait until after
> >>that. And of course, *nothing* gets changed without at least
> >>one benchmark that shows it improves something. So far
> >>nobody has come up to the plate with that.
> >
> > I thought you're talking the whole time about STREAM. That is THE
> > benchmark which shows you an impact of balancing at fork. At it is a
> > VERY relevant benchmark. Though you shouldn't run it on historical
> > machines like NUMAQ, no compute center in the western world will buy
> > NUMAQs for high performance... Andy typically runs STREAM on all CPUs
> > of a machine. Try on N/2 and N/4 and so on, you'll see the impact.
>
> Well yeah, but the immediate problem was that sched-domains was
> *much* worse than 2.6's numasched, neither of which balance on
> fork/clone. I didn't want to obscure the issue by implementing
> balance on fork/clone until we worked out exactly the problem.
I had the feeling that solving the performance issue reported by Andi
would ease the integration into the baseline...
> >>The point is that we have never had this before, and nobody
> >>(until now) has been asking for it. And there are as yet no
> >
> > ?? Sorry, I'm having balance at fork since 2001 in the NEC IA64 NUMA
> > kernels and users use it intensively with OpenMP. Advertised it a lot,
> > asked for it, atlked about it at the last OLS. Only IA64 was
> > considered rare big iron. I understand that the issue gets hotter if
> > the problem hurts on AMD64...
>
> Sorry I hadn't realised. I guess because you are happy with
> your own stuff you don't make too much noise about it on the
> list lately. I apologise.
The usual excuse: busy with other stuff...
> I wonder though, why don't you just teach OpenMP to use
> affinities as well? Surely that is better than relying on the
> behaviour of the scheduler, even if it does balance on clone.
You mean in the compiler? I don't think this is a good idea, that way you
loose flexibility in resource overcomitment. And performance when overselling
the machine's CPUs.
> > Again: I'm talking about having the choice. The user decides. Nothing
> > protects you against user stupidity, but if they just have the choice
> > of poor automatic initial scheduling, it's not enough. And: having the
> > fork/clone initial balancing policy means: you don't need to make your
> > code complicated and unportable by playing with setaffinity (which is
> > just plainly unusable when you share the machine with other users).
>
> If you do it by hand, you know exactly what is going to happen,
> and you can turn off the balance-on-clone flags and you don't
> incur the hit of pulling in remote cachelines from every CPU at
> clone time to do balancing. Surely an HPC application wouldn't
> mind doing that? (I guess they probably don't call clone a lot
> though).
OpenMP is implemented with clone. MPI parallel applications just exec,
they're fine. IMO the static affinity/cpumask handling should be done
externally by some resource manager which has a good overview on the
long-term load of the machine. It's a different issue, nothing for the
scheduler. I wouldn't leave it to the program, too unflexible and
unportable across machines and OSes.
> > There are companies mainly selling NUMA machines for HPC (SGI?), so
> > this is not a niche market. Clusters of big NUMA machines are not
> > unusual, and they're typically not used for databases but for HPC
> > apps. Unfortunately proprietary UNIX is still considered to have
> > better features than Linux for such configurations.
>
> Well, SGI should be doing tests soon and tuning the scheduler
> to their liking. Hopefully others will too, so we'll see what
> happens.
Maybe they are happy with their stuff, too. They have the cpumemsets
and some external affinity control, AFAIK.
> >>Let's just make sure we don't change defaults without any
> >>reason...
> >
> > No reason? Aaarghh... >;-)
>
> Sorry I mean evidence. I'm sure with a properly tuned
> implementation, you could get really good speedups in lots
> of places... I just want to *see* them. All I have seen so
> far is Andi getting a bit better performance on something
> where he can get *much* better performance by making a
> trivial tweak instead.
I get the feeling that Andi's simple OpenMP job is already complex
enough to lead to wrong initial scheduling with the current aproach.
I suppose the reason are the 1-2 helper threads which are started
together with the worker threads (depending on the used compiler).
On small machines (and 4 cpus is small) they significantly disturb
the initial task distribution. For example with the Intel compiler
and 4 worker threads you get 6 tasks. The helper tasks are typically
runnable when the code starts so you get (in order of creation)
CPU Task Role
1 1 worker
2 2 helper
3 3 helper
4 4 worker
1-4 5 worker
1-4 6 worker
So the difficulty is to find out which task will do real work and
which task is just spoiling the statistics. I think...
Regards,
Erich