Hi,
I hope that you can include the following set of CPU scheduler
patches in -mm soon, if you have no other significant performance
work going on.
There are some fairly significant changes, with a few basic aims:
* Improve SMT behaviour
* Improve CMP behaviour, CMP/NUMA scheduling (ie. Opteron)
* Reduce task movement, esp over NUMA nodes.
They are not going to be very well tuned for most usages at the
moment (unfortunately dbt2/3-pgsql on OSDL isn't working, which
is a good one). So hopefully I can address regressions as they
come up.
There are a few problems with the scheduler currently:
Problem #1:
It has _very_ aggressive idle CPU pulling. Not only does it not
really obey imbalances, it is also wrong for eg. an SMT CPU
who's sibling is not idle. The reason this was done really is to
bring down idle time on some workloads (dbt2-pgsql, other
database stuff).
So I address this in the following ways; reduce special casing
for idle balancing, revert some of the recent moves toward even
more aggressive balancing.
Then provide a range of averaging levels for CPU "load averages",
and we choose which to use in which situation on a sched-domain
basis. This allows idle balancing to use a more instantaneous value
for calculating load, so idle CPUs need not wait many timer ticks
for the load averages to catch up. This can hopefully solve our
idle time problems.
Also, further moderate "affine wakeups", which can tend to move
most tasks to one CPU on some workloads and cause idle problems.
Problem #2:
The second problem is that balance-on-exec is not sched-domains
aware. This means it will tend to (for example) fill up two cores
of a CPU on one socket, then fill up two cores on the next socket,
etc. What we want is to try to spread load evenly across memory
controllers.
So make that sched-domains aware following the same pattern as
find_busiest_group / find_busiest_queue.
Problem #3:
Lastly, implement balance-on-fork/clone again. I have come to the
realisation that for NUMA, this is probably the best solution.
Run-cloned-child-last has run out of steam on CMP systems. What
it was supposed to do was provide a period where the child could
be pulled to another CPU before it starts running and allocating
memory. Unfortunately on CMP systems, this tends to just be to the
other sibling.
Also, having such a difference between thread and process creation
was not really ideal, so we balance on all types of fork/clone.
This really helps some things (like STREAM) on CMP Opterons, but
also hurts others, so naturally it is settable per-domain.
Problem #4:
Sched domains isn't very useful to me in its current form. Bring
it up to date with what I've been using. I don't think anyone other
than myself uses it so that should be OK.
Nick
2/13
4/13
3/13
I have an updated userspace parser for this thing, if you
are still keeping it on your website.
5/13
7/13
8/13
12/13
11/13
10/13
* Nick Piggin <[email protected]> wrote:
> 1/13
>
ugh, has this been tested? It needs the patch below.
Ingo
Signed-off-by: Ingo Molnar <[email protected]>
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -2704,11 +2704,11 @@ need_resched_nonpreemptible:
schedstat_inc(rq, sched_cnt);
now = sched_clock();
- if (likely((long long)now - prev->timestamp < NS_MAX_SLEEP_AVG))
+ if (likely((long long)now - prev->timestamp < NS_MAX_SLEEP_AVG)) {
run_time = now - prev->timestamp;
if (unlikely((long long)now - prev->timestamp < 0))
run_time = 0;
- else
+ } else
run_time = NS_MAX_SLEEP_AVG;
/*
6/13
On Thu, 2005-02-24 at 08:46 +0100, Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
> > 1/13
> >
>
> ugh, has this been tested? It needs the patch below.
>
Yes. Which might also explain why I didn't see -ve intervals :(
Thanks Ingo.
In the context of the whole patchset, testing has mainly been
based around multiprocessor behaviour so this doesn't invalidate
that.
* Nick Piggin <[email protected]> wrote:
> 2/13
yeah, we need this. (Eventually someone should explore a different way
to handle affine tasks as this is getting quirky, although it looks
algorithmically impossible in an O(1) way.)
Signed-off-by: Ingo Molnar <[email protected]>
Ingo
* Nick Piggin <[email protected]> wrote:
> 3/13
>
> I have an updated userspace parser for this thing, if you
> are still keeping it on your website.
Signed-off-by: Ingo Molnar <[email protected]>
Ingo
* Nick Piggin <[email protected]> wrote:
> On Thu, 2005-02-24 at 08:46 +0100, Ingo Molnar wrote:
> > * Nick Piggin <[email protected]> wrote:
> >
> > > 1/13
> > >
> >
> > ugh, has this been tested? It needs the patch below.
> >
>
> Yes. Which might also explain why I didn't see -ve intervals :( Thanks
> Ingo.
>
> In the context of the whole patchset, testing has mainly been based
> around multiprocessor behaviour so this doesn't invalidate that.
nono, by 'this' i only meant that patch. The other ones look mainly OK,
but obviously they need a _ton_ of testing.
these:
[PATCH 1/13] timestamp fixes
(+fix)
[PATCH 2/13] improve pinned task handling
[PATCH 3/13] rework schedstats
can go into BK right after 2.6.11 is released as they are fixes or
norisk-improvements. [lets call them 'group A'] These three:
[PATCH 4/13] find_busiest_group fixlets
[PATCH 5/13] find_busiest_group cleanup
[PATCH 7/13] better active balancing heuristic
look pretty fine too and i'd suggest early BK integration too - but in
theory they could impact things negatively so that's where immediate BK
integration has to stop in the first phase, to get some feedback. [lets
call them 'group B']
these:
[PATCH 6/13] no aggressive idle balancing
[PATCH 8/13] generalised CPU load averaging
[PATCH 9/13] less affine wakups
[PATCH 10/13] remove aggressive idle balancing
[PATCH 11/13] sched-domains aware balance-on-fork
[PATCH 12/13] schedstats additions for sched-balance-fork
[PATCH 13/13] basic tuning
change things radically, and i'm uneasy about them even in the 2.6.12
timeframe. [lets call them 'group C'] I'd suggest we give them a go in
-mm and see how things go, so all of them get:
Acked-by: Ingo Molnar <[email protected]>
If things dont stabilize quickly then we need to do it piecemail wise.
The only possible natural split seems to be to go for the running-task
balancing changes first:
[PATCH 6/13] no aggressive idle balancing
[PATCH 8/13] generalised CPU load averaging
[PATCH 9/13] less affine wakups
[PATCH 10/13] remove aggressive idle balancing
[PATCH 13/13] basic tuning
perhaps #8 and relevant portions of #13 could be moved from group C into
group B and thus hit BK early, but that would need remerging.
and then for the fork/clone-balancing changes:
[PATCH 11/13] sched-domains aware balance-on-fork
[PATCH 12/13] schedstats additions for sched-balance-fork
a more finegrained splitup doesnt make much sense, as these groups are
pretty compact conceptually.
But i expect fork/clone balancing to be almost certainly a problem. (We
didnt get it right for all workloads in 2.6.7, and i think it cannot be
gotten right currently either, without userspace API help - but i'd be
happy to be proven wrong.)
(if you agree with my generic analysis then when you regenerate your
patches next time please reorder them according to the flow above, and
please try to insert future fixlets not end-of-stream but according to
the conceptual grouping.)
Ingo
* Nick Piggin <[email protected]> wrote:
> 4/13
> 5/13
#insert <previous mail>
Acked-by: Ingo Molnar <[email protected]>
Ingo
* Nick Piggin <[email protected]> wrote:
> 7/13
yeah, we need this one too.
Acked-by: Ingo Molnar <[email protected]>
Ingo
* Nick Piggin <[email protected]> wrote:
> [PATCH 6/13] no aggressive idle balancing
>
> [PATCH 8/13] generalised CPU load averaging
> [PATCH 9/13] less affine wakups
> [PATCH 10/13] remove aggressive idle balancing
they look fine, but these are the really scary ones :-) Maybe we could
do #8 and #9 first, then #6+#10. But it's probably pointless to look at
these in isolation.
Acked-by: Ingo Molnar <[email protected]>
Ingo
* Nick Piggin <[email protected]> wrote:
> [PATCH 11/13] sched-domains aware balance-on-fork
> [PATCH 12/13] schedstats additions for sched-balance-fork
> [PATCH 13/13] basic tuning
STREAMS numbers tricky. It's pretty much the only benchmark that 1)
relies on being able to allocate alot of RAM in a NUMA-friendly way 2)
does all of its memory allocation in the first timeslice of cloned
worker threads.
There is little help we get from userspace, and i'm not sure we want to
add scheduler overhead for this single benchmark - when something like a
_tiny_ bit of NUMAlib use within the OpenMP library would probably solve
things equally well!
Anyway, the code itself looks fine and it would be good if it improved
things, so:
Acked-by: Ingo Molnar <[email protected]>
but this too needs alot of testing, and it the one that has the highest
likelyhood of actually not making it upstream.
Ingo
Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>
>> [PATCH 6/13] no aggressive idle balancing
>>
>> [PATCH 8/13] generalised CPU load averaging
>> [PATCH 9/13] less affine wakups
>> [PATCH 10/13] remove aggressive idle balancing
>
>
> they look fine, but these are the really scary ones :-) Maybe we could
> do #8 and #9 first, then #6+#10. But it's probably pointless to look at
> these in isolation.
>
Oh yes, they are very scary and I guarantee they'll cause
problems :P
I didn't have any plans to get these in for 2.6.12 (2.6.13 at the
very earliest). But it will be nice if Andrew can pick these up
early so we try to get as much regression testing as possible.
I pretty much agree with your ealier breakdown of the patches (ie.
some are fixes, others fairly straightfoward improvements that may
get into 2.6.12, of course). Thanks very much for the review.
I expect to rework the patches, and things will get tuned and
changed around a bit... Any problem with you taking these now
though Andrew?
Nick
* Nick Piggin <[email protected]> wrote:
> Ingo Molnar wrote:
> >* Nick Piggin <[email protected]> wrote:
> >
> >
> >>[PATCH 6/13] no aggressive idle balancing
> >>
> >>[PATCH 8/13] generalised CPU load averaging
> >>[PATCH 9/13] less affine wakups
> >>[PATCH 10/13] remove aggressive idle balancing
> >
> >
> >they look fine, but these are the really scary ones :-) Maybe we could
> >do #8 and #9 first, then #6+#10. But it's probably pointless to look at
> >these in isolation.
> >
>
> Oh yes, they are very scary and I guarantee they'll cause
> problems :P
:-|
> I didn't have any plans to get these in for 2.6.12 (2.6.13 at the very
> earliest). But it will be nice if Andrew can pick these up early so we
> try to get as much regression testing as possible.
>
> I pretty much agree with your ealier breakdown of the patches (ie.
> some are fixes, others fairly straightfoward improvements that may get
> into 2.6.12, of course). Thanks very much for the review.
>
> I expect to rework the patches, and things will get tuned and changed
> around a bit... Any problem with you taking these now though Andrew?
sure, fine with me.
Ingo
Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>
>> [PATCH 11/13] sched-domains aware balance-on-fork
>> [PATCH 12/13] schedstats additions for sched-balance-fork
>> [PATCH 13/13] basic tuning
>
>
> STREAMS numbers tricky. It's pretty much the only benchmark that 1)
> relies on being able to allocate alot of RAM in a NUMA-friendly way 2)
> does all of its memory allocation in the first timeslice of cloned
> worker threads.
>
I know what you mean... but this is not _just_ for STREAM. Firstly,
if we start 4 tasks on one core (of a 4 socket / 8 core system), and
just let them be moved around by the periodic balancer, they will
tend to cluster on 2 or 3 CPUs, and that will be the steady state.
> There is little help we get from userspace, and i'm not sure we want to
> add scheduler overhead for this single benchmark - when something like a
> _tiny_ bit of NUMAlib use within the OpenMP library would probably solve
> things equally well!
>
True, for OpenMP apps (and this work shouldn't stop that from happening).
But other threaded apps are also important, and fork()ed apps can be
important too.
What I hear from the NUMA guys (POWER5, AMD) is that they really want to
keep memory controllers busy. This seems to be the best way to do it.
There are a few differences between this and when we last tried it. The
main thing is that the balancer is now sched-domains aware. I hope we
can get it to do the right thing more often (at least it is a per domain
flag, so those who don't want it don't turn it on).
> Anyway, the code itself looks fine and it would be good if it improved
> things, so:
>
> Acked-by: Ingo Molnar <[email protected]>
>
> but this too needs alot of testing, and it the one that has the highest
> likelyhood of actually not making it upstream.
>
Thanks for reviewing.
I have an updated userspace parser for this thing, if you are still
keeping it on your website.
Sure, be happy to include it, thanks! Send it along. Is it for version
11 or version 12?
Move balancing fields into struct sched_domain, so we can get more
useful results on systems with multiple domains (eg SMT+SMP, CMP+NUMA,
SMP+NUMA, etc).
It looks like you've also removed the stats for pull_task() and
wake_up_new_task(). Was this intentional, or accidental?
I can't quite get the patch to apply cleanly against several variants
of 2.6.10 or 2.6.11-rc*. Which version is the patch for?
Rick
There is little help we get from userspace, and i'm not sure we want to
add scheduler overhead for this single benchmark - when something like a
_tiny_ bit of NUMAlib use within the OpenMP library would probably solve
things equally well!
There's has been a general problem with sched domains and it trying to
meet two goals: "1) spread things around evenly within a domain and
balance across domains infrequently", and "2) load up cores before
loading up siblings, even at the expense of violating 1)".
We've had trouble getting both 1) and 2) implemented correctly in
the past. If this patch gets us closer to that nirvana, it will be
valuable regardless of the benchmark it also happens to be improving.
Regardless, I agree it will need good testing, and we may need to
pick the wheat from the chaff.
Rick
Rick Lindsley wrote:
> I have an updated userspace parser for this thing, if you are still
> keeping it on your website.
>
> Sure, be happy to include it, thanks! Send it along. Is it for version
> 11 or version 12?
>
Version 12. I can send it to you next week. This was actually
directed at Andrew, who was hosting my parser somewhere for a
while. But it would be probably better if it just goes on your
site.
Sorry, no update to your script because I can't write perl to
save my life, let alone read it :|
> Move balancing fields into struct sched_domain, so we can get more
> useful results on systems with multiple domains (eg SMT+SMP, CMP+NUMA,
> SMP+NUMA, etc).
>
> It looks like you've also removed the stats for pull_task() and
> wake_up_new_task(). Was this intentional, or accidental?
>
I didn't really think wunt_cnt or wunt_moved were very interesting
with respect to the scheduler. wunt_cnt is more or less just the
total number of forks, wunt_moved is a very rare situation where
the new task is moved before being woken up.
The balance-on-fork schedstats code does add some more interesting
stuff here.
pull_task() lost the "lost" counter because that is tricky to put
into the context of sched-domains. It also isn't important in my
current line of analysis because I'm just doing summations over
all CPUs, so in that case your "gained" was always the same as
"lost".
If you have uses for other counters, by all means send patches and
we can discuss (not that I can imagine I'd have any objections).
> I can't quite get the patch to apply cleanly against several variants
> of 2.6.10 or 2.6.11-rc*. Which version is the patch for?
>
It was 11-rc4, but it should fit on -rc5 OK too.
Hi,
Just had a question on multi homed host.
I want to have two network cards and two IPs. I want one IP to be
statically configured and the IP for card to be obtained by DHCP. Is this
possible. Any references??
Regards,
Ravindra N.
Rick Lindsley wrote:
> There is little help we get from userspace, and i'm not sure we want to
> add scheduler overhead for this single benchmark - when something like a
> _tiny_ bit of NUMAlib use within the OpenMP library would probably solve
> things equally well!
>
> There's has been a general problem with sched domains and it trying to
> meet two goals: "1) spread things around evenly within a domain and
> balance across domains infrequently", and "2) load up cores before
> loading up siblings, even at the expense of violating 1)".
>
Yes, you hit the nail on the head. Well, the other (potentially
more problematic) part of the equation is "3) keep me close to
my parent and siblings, because we'll be likely to share memory
and/or communicate".
However: I'm hoping that on unloaded or lightly loaded NUMA
systems, it is usually the right choice to spread tasks across
nodes. Especially on the newer breed of low remote latency, high
bandwidth systems like Opterons and POWER5s.
When load ramps up a bit and we start saturating CPUs, the amount
of balance-on-forking should slow down, so we start to fulfil
requirement 3 for workloads that perhaps resemble more general
purpose server stuff.
That's the plan anyway. We'll see...
Nick, can you describe the system you run the DB tests on? Do you have
any cpu idle time stats and hopefully some context switch rate stats?
I think I understand the concern [patch 6] of stealing a task from one
node to an idle cpu in another node, but I wonder if we can have some
sort of check for idle balance: if the domain/node to steal from has
some idle cpu somewhere, we do not steal, period. To do this we have a
cpu_idle bitmask, we update as cpus go idle/busy, and we reference this
cpu_idle & sd->cpu_mask to see if there's at least one cpu that's idle.
> Ingo wrote:
>
> But i expect fork/clone balancing to be almost certainly a problem. (We
> didnt get it right for all workloads in 2.6.7, and i think it cannot be
> gotten right currently either, without userspace API help - but i'd be
> happy to be proven wrong.)
Perhaps initially one could balance on fork up to the domain level which
has task_hot_time=0, up to a shared cache by default. Anything above
that could require a numactl like preference from userspace.
-Andrew Theurer
Andrew Theurer wrote:
> Nick, can you describe the system you run the DB tests on? Do you have
> any cpu idle time stats and hopefully some context switch rate stats?
Yeah, it is dbt3-pgsql on OSDL's 8-way STP machines. I think they're
PIII Xeons with 2MB L2 cache.
I had been having some difficulty running them recently, but I might
have just worked out what the problem was, so hopefully I can benchmark
the patchset I just sent to Andrew (plus fixes from Ingo, etc).
Basically what would happen is that with seemingly small changes, the
"throughput" figure would go from about 260 tps to under 100. I don't
know exactly why this benchmark is particularly sensitive to the
problem, but it has been a good canary for too-much-idle-time
regressions in the past.
> I think I understand the concern [patch 6] of stealing a task from one
> node to an idle cpu in another node, but I wonder if we can have some
> sort of check for idle balance: if the domain/node to steal from has
> some idle cpu somewhere, we do not steal, period. To do this we have a
> cpu_idle bitmask, we update as cpus go idle/busy, and we reference this
> cpu_idle & sd->cpu_mask to see if there's at least one cpu that's idle.
>
I think this could become a scalability problem... and I'd prefer to
initially try to address problems of too much idle time by tuning them
out rather than use things like wake-to-idle-cpu.
One of my main problems with wake to idle is that it is explicitly
introducing a bias so that wakers repel their wakees. With all else
being equal, we want exactly the opposite effect (hence all the affine
wakeups and wake balancing stuff).
The regular periodic balancing may be dumb, but at least it doesn't
have that kind of bias.
The other thing of course is that we want to reduce internode task
movement at almost all costs, and the periodic balancer can be very
careful about this - something more difficult for wake_idle unless
we introduce more complexity there (a very time critical path).
Anyway, I'm not saying no to anything at this stage... let's just see
how we go.
>> Ingo wrote:
>>
>> But i expect fork/clone balancing to be almost certainly a problem. (We
>> didnt get it right for all workloads in 2.6.7, and i think it cannot be
>> gotten right currently either, without userspace API help - but i'd be
>> happy to be proven wrong.)
>
>
> Perhaps initially one could balance on fork up to the domain level which
> has task_hot_time=0, up to a shared cache by default. Anything above
> that could require a numactl like preference from userspace.
>
That may end up being what we have to do. Probably doesn't make
much sense to do it at all if we aren't doing it for NUMA.
Nick
Nick Piggin wrote:
> Andrew Theurer wrote:
>
>> Nick, can you describe the system you run the DB tests on? Do you
>> have any cpu idle time stats and hopefully some context switch rate
>> stats?
>
>
> Yeah, it is dbt3-pgsql on OSDL's 8-way STP machines. I think they're
> PIII Xeons with 2MB L2 cache.
>
> I had been having some difficulty running them recently, but I might
> have just worked out what the problem was, so hopefully I can benchmark
> the patchset I just sent to Andrew (plus fixes from Ingo, etc).
>
Nope. Still not working :P
On Thu, Feb 24, 2005 at 11:13:14PM +1100, Nick Piggin wrote:
> Ingo Molnar wrote:
> > * Nick Piggin <[email protected]> wrote:
> >
> >
> >> [PATCH 6/13] no aggressive idle balancing
> >>
> >> [PATCH 8/13] generalised CPU load averaging
> >> [PATCH 9/13] less affine wakups
> >> [PATCH 10/13] remove aggressive idle balancing
> >
> >
> > they look fine, but these are the really scary ones :-) Maybe we could
> > do #8 and #9 first, then #6+#10. But it's probably pointless to look at
> > these in isolation.
> >
>
> Oh yes, they are very scary and I guarantee they'll cause
> problems :P
By code inspection, I see an issue with this patch
[PATCH 10/13] remove aggressive idle balancing
Why are we removing cpu_and_siblings_are_idle check from active_load_balance?
In case of SMT, we want to give prioritization to an idle package while
doing active_load_balance(infact, active_load_balance will be kicked
mainly because there is an idle package)
Just the re-addition of cpu_and_siblings_are_idle check to
active_load_balance might not be enough. We somehow need to communicate
this to move_tasks, otherwise can_migrate_task will fail and we will
never be able to do active_load_balance.
thanks,
suresh
Siddha, Suresh B wrote:
>
> By code inspection, I see an issue with this patch
> [PATCH 10/13] remove aggressive idle balancing
>
> Why are we removing cpu_and_siblings_are_idle check from active_load_balance?
> In case of SMT, we want to give prioritization to an idle package while
> doing active_load_balance(infact, active_load_balance will be kicked
> mainly because there is an idle package)
>
> Just the re-addition of cpu_and_siblings_are_idle check to
> active_load_balance might not be enough. We somehow need to communicate
> this to move_tasks, otherwise can_migrate_task will fail and we will
> never be able to do active_load_balance.
>
Active balancing should only kick in after the prescribed number
of rebalancing failures - can_migrate_task will see this, and
will allow the balancing to take place.
That said, we currently aren't doing _really_ well for SMT on
some workloads, however with this patch we are heading in the
right direction I think.
I have been mainly looking at tuning CMP Opterons recently (they
are closer to a "traditional" SMP+NUMA than SMT, when it comes
to the scheduler's point of view). However, in earlier revisions
of the patch I had been looking at SMT performance and was able
to get it much closer to perfect:
I was working on a 4 socket x440 with HT. The problem area is
usually when the load is lower than the number of logical CPUs.
So on tbench, we do say 450MB/s with 4 or more threads without
HT, and 550MB/s with 8 or more threads with HT, however we only
do 300MB/s with 4 threads.
Those aren't the exact numbers, but that's basically what they
look like. Now I was able to bring the 4 thread + HT case much
closer to the 4 thread - HT numbers, but with earlier patchsets.
When I get a chance I will do more tests on the HT system, but
the x440 is infuriating for fine tuning performance, because it
is a NUMA system, but it doesn't tell the kernel about it, so
it will randomly schedule things on "far away" CPUs, and results
vary.
PS. Another thing I would like to see tested is a 3 level domain
setup (SMT + SMP + NUMA). I don't have access to one though.
Nick,
On Mon, Mar 07, 2005 at 04:34:18PM +1100, Nick Piggin wrote:
> Siddha, Suresh B wrote:
>
> >
> > By code inspection, I see an issue with this patch
> > [PATCH 10/13] remove aggressive idle balancing
> >
> > Why are we removing cpu_and_siblings_are_idle check from active_load_balance?
> > In case of SMT, we want to give prioritization to an idle package while
> > doing active_load_balance(infact, active_load_balance will be kicked
> > mainly because there is an idle package)
> >
> > Just the re-addition of cpu_and_siblings_are_idle check to
> > active_load_balance might not be enough. We somehow need to communicate
> > this to move_tasks, otherwise can_migrate_task will fail and we will
> > never be able to do active_load_balance.
> >
>
> Active balancing should only kick in after the prescribed number
> of rebalancing failures - can_migrate_task will see this, and
> will allow the balancing to take place.
We are resetting the nr_balance_failed to cache_nice_tries after kicking
active balancing. But can_migrate_task will succeed only if
nr_balance_failed > cache_nice_tries.
>
> That said, we currently aren't doing _really_ well for SMT on
> some workloads, however with this patch we are heading in the
> right direction I think.
Lets take an example of three packages with two logical threads each.
Assume P0 is loaded with two processes(one in each logical thread),
P1 contains only one process and P2 is idle.
In this example, active balance will be kicked on one of the threads(assume
thread 0) in P0, which then should find an idle package and move it to
one of the idle threads in P2.
With your current patch, idle package check in active_load_balance has
disappeared, and we may endup moving the process from thread 0 to thread 1
in P0. I can't really make logic out of the active_load_balance code
after your patch 10/13
>
> I have been mainly looking at tuning CMP Opterons recently (they
> are closer to a "traditional" SMP+NUMA than SMT, when it comes
> to the scheduler's point of view). However, in earlier revisions
> of the patch I had been looking at SMT performance and was able
> to get it much closer to perfect:
>
I am reasonably sure that the removal of cpu_and_siblings_are_idle check
from active_load_balance will cause HT performance regressions.
> I was working on a 4 socket x440 with HT. The problem area is
> usually when the load is lower than the number of logical CPUs.
> So on tbench, we do say 450MB/s with 4 or more threads without
> HT, and 550MB/s with 8 or more threads with HT, however we only
> do 300MB/s with 4 threads.
Are you saying 2.6.11 has this problem?
>
> Those aren't the exact numbers, but that's basically what they
> look like. Now I was able to bring the 4 thread + HT case much
> closer to the 4 thread - HT numbers, but with earlier patchsets.
> When I get a chance I will do more tests on the HT system, but
> the x440 is infuriating for fine tuning performance, because it
> is a NUMA system, but it doesn't tell the kernel about it, so
> it will randomly schedule things on "far away" CPUs, and results
> vary.
Why don't you use any other simple HT+SMP system?
I will also do some performance analysis with your other patches
on some of the systems that I have access to.
thanks,
suresh
Siddha, Suresh B wrote:
> Nick,
>
> On Mon, Mar 07, 2005 at 04:34:18PM +1100, Nick Piggin wrote:
>
>>
>>Active balancing should only kick in after the prescribed number
>>of rebalancing failures - can_migrate_task will see this, and
>>will allow the balancing to take place.
>
>
> We are resetting the nr_balance_failed to cache_nice_tries after kicking
> active balancing. But can_migrate_task will succeed only if
> nr_balance_failed > cache_nice_tries.
>
It is indeed, thanks for catching that. We should probably make it
reset the count to the point where it will start moving cache hot
tasks (ie. cache_nice_tries+1).
I'll look at that and send Andrew a patch.
>
>>That said, we currently aren't doing _really_ well for SMT on
>>some workloads, however with this patch we are heading in the
>>right direction I think.
>
>
> Lets take an example of three packages with two logical threads each.
> Assume P0 is loaded with two processes(one in each logical thread),
> P1 contains only one process and P2 is idle.
>
> In this example, active balance will be kicked on one of the threads(assume
> thread 0) in P0, which then should find an idle package and move it to
> one of the idle threads in P2.
>
> With your current patch, idle package check in active_load_balance has
> disappeared, and we may endup moving the process from thread 0 to thread 1
> in P0. I can't really make logic out of the active_load_balance code
> after your patch 10/13
>
Ah yep, right you are there, too. I obviously hadn't looked closely
enough at the recent active_load_balance patches that had gone in :(
What should probably do is heed the "push_cpu" prescription (push_cpu
is now unused).
I think active_load_balance is too complex at the moment, but still
too dumb to really make the right choice here over the full range of
domains. What we need to do is pass in some more info from load_balance,
so active_load_balance doesn't need any "smarts".
Thanks for pointing this out too. I'll make a patch.
>
>>I have been mainly looking at tuning CMP Opterons recently (they
>>are closer to a "traditional" SMP+NUMA than SMT, when it comes
>>to the scheduler's point of view). However, in earlier revisions
>>of the patch I had been looking at SMT performance and was able
>>to get it much closer to perfect:
>>
>
>
> I am reasonably sure that the removal of cpu_and_siblings_are_idle check
> from active_load_balance will cause HT performance regressions.
>
Yep.
>
>>I was working on a 4 socket x440 with HT. The problem area is
>>usually when the load is lower than the number of logical CPUs.
>>So on tbench, we do say 450MB/s with 4 or more threads without
>>HT, and 550MB/s with 8 or more threads with HT, however we only
>>do 300MB/s with 4 threads.
>
>
> Are you saying 2.6.11 has this problem?
>
I think so. I'll have a look at it again.
>
>>Those aren't the exact numbers, but that's basically what they
>>look like. Now I was able to bring the 4 thread + HT case much
>>closer to the 4 thread - HT numbers, but with earlier patchsets.
>>When I get a chance I will do more tests on the HT system, but
>>the x440 is infuriating for fine tuning performance, because it
>>is a NUMA system, but it doesn't tell the kernel about it, so
>>it will randomly schedule things on "far away" CPUs, and results
>>vary.
>
>
> Why don't you use any other simple HT+SMP system?
>
Yes I will, of course. Some issues can become more pronounced
with more physical CPUs, but the main reason is that the x440
is the only one with HT at work where I was doing testing.
To be honest I hadn't looked hard enough at the HT issues yet
as you've noticed. So thanks for the review and I'll fix things
up.
> I will also do some performance analysis with your other patches
> on some of the systems that I have access to.
>
Thanks.
Nick
Nick,
On Mon, Mar 07, 2005 at 07:28:23PM +1100, Nick Piggin wrote:
> Siddha, Suresh B wrote:
> > We are resetting the nr_balance_failed to cache_nice_tries after kicking
> > active balancing. But can_migrate_task will succeed only if
> > nr_balance_failed > cache_nice_tries.
> >
>
> It is indeed, thanks for catching that. We should probably make it
> reset the count to the point where it will start moving cache hot
> tasks (ie. cache_nice_tries+1).
That still might not be enough. We probably need to pass push_cpu's
sd to move_tasks call in active_load_balance, instead of current busiest_cpu's
sd. Just like push_cpu, we need to add one more field to the runqueue which
will specify the domain level of the push_cpu at which we have an imbalance.
>
> I'll look at that and send Andrew a patch.
>
> >
> >>That said, we currently aren't doing _really_ well for SMT on
> >>some workloads, however with this patch we are heading in the
> >>right direction I think.
> >
> >
> > Lets take an example of three packages with two logical threads each.
> > Assume P0 is loaded with two processes(one in each logical thread),
> > P1 contains only one process and P2 is idle.
> >
> > In this example, active balance will be kicked on one of the threads(assume
> > thread 0) in P0, which then should find an idle package and move it to
> > one of the idle threads in P2.
> >
> > With your current patch, idle package check in active_load_balance has
> > disappeared, and we may endup moving the process from thread 0 to thread 1
> > in P0. I can't really make logic out of the active_load_balance code
> > after your patch 10/13
> >
>
> Ah yep, right you are there, too. I obviously hadn't looked closely
> enough at the recent active_load_balance patches that had gone in :(
> What should probably do is heed the "push_cpu" prescription (push_cpu
> is now unused).
push_cpu might not be the ideal destination in all cases. Take a NUMA domain
above SMT+SMP domains in my above example. Assume P0, P1 is in node-0 and
P2, P3 in node-1. Assume Loads of P0,P1,P2 are same as the above example,with P3
containing one process load. Now any idle thread in P2 or P3 can trigger
active load balance on P0. We should be selecting thread in P2 ideally
(currently this is what we get with idle package check). But with push_cpu,
we might move to the idle thread in P3 and then finally move to P2(it will be a
two step process)
thanks,
suresh
Siddha, Suresh B wrote:
> Nick,
>
> On Mon, Mar 07, 2005 at 07:28:23PM +1100, Nick Piggin wrote:
>
>>Siddha, Suresh B wrote:
>>
>>>We are resetting the nr_balance_failed to cache_nice_tries after kicking
>>>active balancing. But can_migrate_task will succeed only if
>>>nr_balance_failed > cache_nice_tries.
>>>
>>
>>It is indeed, thanks for catching that. We should probably make it
>>reset the count to the point where it will start moving cache hot
>>tasks (ie. cache_nice_tries+1).
>
>
> That still might not be enough. We probably need to pass push_cpu's
> sd to move_tasks call in active_load_balance, instead of current busiest_cpu's
> sd. Just like push_cpu, we need to add one more field to the runqueue which
> will specify the domain level of the push_cpu at which we have an imbalance.
>
It should be the lowest domain level that spans both this_cpu and
push_cpu, and has the SD_BALANCE flag set. We could possibly be a bit
more general here, but so long as nobody is coming up with weird and
wonderful sched_domains schemes, push_cpu should give you all the info
needed.
>>Ah yep, right you are there, too. I obviously hadn't looked closely
>>enough at the recent active_load_balance patches that had gone in :(
>>What should probably do is heed the "push_cpu" prescription (push_cpu
>>is now unused).
>
>
> push_cpu might not be the ideal destination in all cases. Take a NUMA domain
> above SMT+SMP domains in my above example. Assume P0, P1 is in node-0 and
> P2, P3 in node-1. Assume Loads of P0,P1,P2 are same as the above example,with P3
> containing one process load. Now any idle thread in P2 or P3 can trigger
> active load balance on P0. We should be selecting thread in P2 ideally
> (currently this is what we get with idle package check). But with push_cpu,
> we might move to the idle thread in P3 and then finally move to P2(it will be a
> two step process)
>
Hmm yeah. It is a bit tricky. We don't currently do exceptionally well
at this sort of "balancing over multiple domains" very well in the
periodic balancer either.
But at this stage I prefer to not get overly complex, and allow some
imperfect task movement, because it should rarely be a problem, and is
much better than it was before. The main place where it can go wrong
is multi-level NUMA balancing, where moving a task twice (between
different nodes) can cause more problems.
Nick
On Tue, Mar 08, 2005 at 07:17:31PM +1100, Nick Piggin wrote:
> Siddha, Suresh B wrote:
> > That still might not be enough. We probably need to pass push_cpu's
> > sd to move_tasks call in active_load_balance, instead of current busiest_cpu's
> > sd. Just like push_cpu, we need to add one more field to the runqueue which
> > will specify the domain level of the push_cpu at which we have an imbalance.
> >
>
> It should be the lowest domain level that spans both this_cpu and
> push_cpu, and has the SD_BALANCE flag set. We could possibly be a bit
I agree.
> more general here, but so long as nobody is coming up with weird and
> wonderful sched_domains schemes, push_cpu should give you all the info
> needed.
>
> > push_cpu might not be the ideal destination in all cases. Take a NUMA domain
> > above SMT+SMP domains in my above example. Assume P0, P1 is in node-0 and
> > P2, P3 in node-1. Assume Loads of P0,P1,P2 are same as the above example,with P3
> > containing one process load. Now any idle thread in P2 or P3 can trigger
> > active load balance on P0. We should be selecting thread in P2 ideally
> > (currently this is what we get with idle package check). But with push_cpu,
> > we might move to the idle thread in P3 and then finally move to P2(it will be a
> > two step process)
> >
>
> Hmm yeah. It is a bit tricky. We don't currently do exceptionally well
> at this sort of "balancing over multiple domains" very well in the
> periodic balancer either.
With periodic balancer, moved tasks will not be actively running
and by the time it gets a cpu slot, it will most probably be on the
correct cpu (though "most probably" is the key word here ;-)
> But at this stage I prefer to not get overly complex, and allow some
> imperfect task movement, because it should rarely be a problem, and is
> much better than it was before. The main place where it can go wrong
> is multi-level NUMA balancing, where moving a task twice (between
> different nodes) can cause more problems.
With active_load_balance, we will be moving the currently running
process. So obviously if we can move in one single step, that will be nice.
I agree with the complexity part though. And it will become more complex
with upcoming dual-core requirements.
thanks,
suresh