Hi,
I wonder why it was decided to do newidle balancing in the NUMA
domain? And with newidle_idx == 0 at that.
This means that every time the CPU goes idle, every CPU in the
system gets a remote cacheline or two hit. Not very nice O(n^2)
behaviour on the interconnect. Not to mention trashing our
NUMA locality.
And then I see some proposal to do ratelimiting of newidle
balancing :( Seems like hack upon hack making behaviour much more
complex.
One "symptom" of bad mutex contention can be that increasing the
balancing rate can help a bit to reduce idle time (because it
can get the woken thread which is holding a semaphore to run ASAP
after we run out of runnable tasks in the system due to them
hitting contention on that semaphore).
I really hope this change wasn't done in order to help -rt or
something sad like sysbench on MySQL.
And btw, I'll stay out of mentioning anything about CFS development,
but it really sucks to be continually making significant changes to
domains balancing *and* per-runqueue scheduling at the same time :(
It makes it even difficult to bisect things.
Thanks,
Nick
On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote:
> Hi,
>
> I wonder why it was decided to do newidle balancing in the NUMA
> domain? And with newidle_idx == 0 at that.
>
> This means that every time the CPU goes idle, every CPU in the
> system gets a remote cacheline or two hit. Not very nice O(n^2)
> behaviour on the interconnect. Not to mention trashing our
> NUMA locality.
>
> And then I see some proposal to do ratelimiting of newidle
> balancing :( Seems like hack upon hack making behaviour much more
> complex.
>
> One "symptom" of bad mutex contention can be that increasing the
> balancing rate can help a bit to reduce idle time (because it
> can get the woken thread which is holding a semaphore to run ASAP
> after we run out of runnable tasks in the system due to them
> hitting contention on that semaphore).
>
> I really hope this change wasn't done in order to help -rt or
> something sad like sysbench on MySQL.
IIRC this was kbuild and other spreading workloads that want this.
the newidle_idx=0 thing is because I frequently saw it make funny
balance decisions based on old load numbers, like f_b_g() selecting a
group that didn't even have tasks in anymore.
We went without newidle for a while, but then people started complaining
about that kbuild time, and there is a x264 encoder thing that looses
tons of throughput.
On Mon, Nov 23, 2009 at 12:36:15PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote:
> > Hi,
> >
> > I wonder why it was decided to do newidle balancing in the NUMA
> > domain? And with newidle_idx == 0 at that.
> >
> > This means that every time the CPU goes idle, every CPU in the
> > system gets a remote cacheline or two hit. Not very nice O(n^2)
> > behaviour on the interconnect. Not to mention trashing our
> > NUMA locality.
> >
> > And then I see some proposal to do ratelimiting of newidle
> > balancing :( Seems like hack upon hack making behaviour much more
> > complex.
> >
> > One "symptom" of bad mutex contention can be that increasing the
> > balancing rate can help a bit to reduce idle time (because it
> > can get the woken thread which is holding a semaphore to run ASAP
> > after we run out of runnable tasks in the system due to them
> > hitting contention on that semaphore).
> >
> > I really hope this change wasn't done in order to help -rt or
> > something sad like sysbench on MySQL.
>
> IIRC this was kbuild and other spreading workloads that want this.
>
> the newidle_idx=0 thing is because I frequently saw it make funny
> balance decisions based on old load numbers, like f_b_g() selecting a
> group that didn't even have tasks in anymore.
Well it is just a damping factor on runqueue flucturations. If the
group recently had load then the point of the idx is to account
for this. On the other hand, if we have other groups that are also
above the idx damped average, it would make sense to use them
instead. (ie. cull source groups with no pullable tasks).
> We went without newidle for a while, but then people started complaining
> about that kbuild time, and there is a x264 encoder thing that looses
> tons of throughput.
So... these were due to what? Other changes in domains balancing?
Changes in CFS? Something else? Or were they comparisons versus
other operating systems?
* Peter Zijlstra <[email protected]> wrote:
> On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote:
> > Hi,
> >
> > I wonder why it was decided to do newidle balancing in the NUMA
> > domain? And with newidle_idx == 0 at that.
> >
> > This means that every time the CPU goes idle, every CPU in the
> > system gets a remote cacheline or two hit. Not very nice O(n^2)
> > behaviour on the interconnect. Not to mention trashing our
> > NUMA locality.
> >
> > And then I see some proposal to do ratelimiting of newidle
> > balancing :( Seems like hack upon hack making behaviour much more
> > complex.
> >
> > One "symptom" of bad mutex contention can be that increasing the
> > balancing rate can help a bit to reduce idle time (because it
> > can get the woken thread which is holding a semaphore to run ASAP
> > after we run out of runnable tasks in the system due to them
> > hitting contention on that semaphore).
> >
> > I really hope this change wasn't done in order to help -rt or
> > something sad like sysbench on MySQL.
>
> IIRC this was kbuild and other spreading workloads that want this.
>
> the newidle_idx=0 thing is because I frequently saw it make funny
> balance decisions based on old load numbers, like f_b_g() selecting a
> group that didn't even have tasks in anymore.
>
> We went without newidle for a while, but then people started
> complaining about that kbuild time, and there is a x264 encoder thing
> that looses tons of throughput.
Yep, i too reacted in a similar way to Nick initially - but i think you
are right, we really want good, precise metrics and want to be
optional/fuzzy in our balancing _decisions_, not in our metrics.
Ingo
On Mon, 2009-11-23 at 12:43 +0100, Nick Piggin wrote:
> On Mon, Nov 23, 2009 at 12:36:15PM +0100, Peter Zijlstra wrote:
> > IIRC this was kbuild and other spreading workloads that want this.
> >
> > the newidle_idx=0 thing is because I frequently saw it make funny
> > balance decisions based on old load numbers, like f_b_g() selecting a
> > group that didn't even have tasks in anymore.
>
> Well it is just a damping factor on runqueue flucturations. If the
> group recently had load then the point of the idx is to account
> for this. On the other hand, if we have other groups that are also
> above the idx damped average, it would make sense to use them
> instead. (ie. cull source groups with no pullable tasks).
Right, thing is, I'm still catching up from being gone, and haven't
actually read and tought through the whole rate-limiting thing :-(
If you see a better way to accomplish things, please holler.
> > We went without newidle for a while, but then people started complaining
> > about that kbuild time, and there is a x264 encoder thing that looses
> > tons of throughput.
>
> So... these were due to what? Other changes in domains balancing?
> Changes in CFS? Something else? Or were they comparisons versus
> other operating systems?
Comparison to Con's latest single-rq spread like there's no cache
affinity BFS thing.
On Mon, Nov 23, 2009 at 12:45:50PM +0100, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote:
> > > Hi,
> > >
> > > I wonder why it was decided to do newidle balancing in the NUMA
> > > domain? And with newidle_idx == 0 at that.
> > >
> > > This means that every time the CPU goes idle, every CPU in the
> > > system gets a remote cacheline or two hit. Not very nice O(n^2)
> > > behaviour on the interconnect. Not to mention trashing our
> > > NUMA locality.
> > >
> > > And then I see some proposal to do ratelimiting of newidle
> > > balancing :( Seems like hack upon hack making behaviour much more
> > > complex.
> > >
> > > One "symptom" of bad mutex contention can be that increasing the
> > > balancing rate can help a bit to reduce idle time (because it
> > > can get the woken thread which is holding a semaphore to run ASAP
> > > after we run out of runnable tasks in the system due to them
> > > hitting contention on that semaphore).
> > >
> > > I really hope this change wasn't done in order to help -rt or
> > > something sad like sysbench on MySQL.
> >
> > IIRC this was kbuild and other spreading workloads that want this.
> >
> > the newidle_idx=0 thing is because I frequently saw it make funny
> > balance decisions based on old load numbers, like f_b_g() selecting a
> > group that didn't even have tasks in anymore.
> >
> > We went without newidle for a while, but then people started
> > complaining about that kbuild time, and there is a x264 encoder thing
> > that looses tons of throughput.
>
> Yep, i too reacted in a similar way to Nick initially - but i think you
> are right, we really want good, precise metrics and want to be
> optional/fuzzy in our balancing _decisions_, not in our metrics.
Well to be fair, the *decision* is to use a longer-term weight for
the runqueue to reduce balancing (seeing as we naturally do far more
balancing on conditions means that we tend to look at our instant
runqueue weight when it is 0).
* Nick Piggin <[email protected]> wrote:
> On Mon, Nov 23, 2009 at 12:45:50PM +0100, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote:
> > > > Hi,
> > > >
> > > > I wonder why it was decided to do newidle balancing in the NUMA
> > > > domain? And with newidle_idx == 0 at that.
> > > >
> > > > This means that every time the CPU goes idle, every CPU in the
> > > > system gets a remote cacheline or two hit. Not very nice O(n^2)
> > > > behaviour on the interconnect. Not to mention trashing our
> > > > NUMA locality.
> > > >
> > > > And then I see some proposal to do ratelimiting of newidle
> > > > balancing :( Seems like hack upon hack making behaviour much more
> > > > complex.
> > > >
> > > > One "symptom" of bad mutex contention can be that increasing the
> > > > balancing rate can help a bit to reduce idle time (because it
> > > > can get the woken thread which is holding a semaphore to run ASAP
> > > > after we run out of runnable tasks in the system due to them
> > > > hitting contention on that semaphore).
> > > >
> > > > I really hope this change wasn't done in order to help -rt or
> > > > something sad like sysbench on MySQL.
> > >
> > > IIRC this was kbuild and other spreading workloads that want this.
> > >
> > > the newidle_idx=0 thing is because I frequently saw it make funny
> > > balance decisions based on old load numbers, like f_b_g() selecting a
> > > group that didn't even have tasks in anymore.
> > >
> > > We went without newidle for a while, but then people started
> > > complaining about that kbuild time, and there is a x264 encoder thing
> > > that looses tons of throughput.
> >
> > Yep, i too reacted in a similar way to Nick initially - but i think you
> > are right, we really want good, precise metrics and want to be
> > optional/fuzzy in our balancing _decisions_, not in our metrics.
>
> Well to be fair, the *decision* is to use a longer-term weight for the
> runqueue to reduce balancing (seeing as we naturally do far more
> balancing on conditions means that we tend to look at our instant
> runqueue weight when it is 0).
Well, the problem with that is that it uses a potentially outdated piece
of metric - and that can become visible if balancing events are rare
enough.
I.e. we do need a time scale (rate of balancing) to be able to do this
correctly on a statistical level - which pretty much brings in 'rate
limit' kind of logic.
We are better off observing reality precisely and then saying "dont do
this action" instead of fuzzing our metrics [or using fuzzy metrics
conditionally - which is really the same] and hoping that in the end it
will be as if we didnt do certain decisions.
(I hope i explained my point clearly enough.)
No argument that it could be done cleaner - the duality right now of
both having the fuzzy stats and the rate limiting should be decided
one way or another.
Also, no argument that if you can measure bad effects from this change
on any workload we need to look at that and fix it.
Thanks,
Ingo
On Mon, Nov 23, 2009 at 12:50:45PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-11-23 at 12:43 +0100, Nick Piggin wrote:
> > On Mon, Nov 23, 2009 at 12:36:15PM +0100, Peter Zijlstra wrote:
>
> > > IIRC this was kbuild and other spreading workloads that want this.
> > >
> > > the newidle_idx=0 thing is because I frequently saw it make funny
> > > balance decisions based on old load numbers, like f_b_g() selecting a
> > > group that didn't even have tasks in anymore.
> >
> > Well it is just a damping factor on runqueue flucturations. If the
> > group recently had load then the point of the idx is to account
> > for this. On the other hand, if we have other groups that are also
> > above the idx damped average, it would make sense to use them
> > instead. (ie. cull source groups with no pullable tasks).
>
> Right, thing is, I'm still catching up from being gone, and haven't
> actually read and tought through the whole rate-limiting thing :-(
>
> If you see a better way to accomplish things, please holler.
Well, not with adding another hack on top of newidle becuase it
is too aggressive because it was turned on where it shouldn't
be :)
Within an LLC we really want to be quite aggressive at moving
tasks around in order to minimise idle time.
*Perhaps* another type of balancing mode which is rate limited.
However...
> > > We went without newidle for a while, but then people started complaining
> > > about that kbuild time, and there is a x264 encoder thing that looses
> > > tons of throughput.
> >
> > So... these were due to what? Other changes in domains balancing?
> > Changes in CFS? Something else? Or were they comparisons versus
> > other operating systems?
>
> Comparison to Con's latest single-rq spread like there's no cache
> affinity BFS thing.
Seems like a bit of a knee jerk reaction here. sched domains balancing
seems to have been _relatively_ good for quite a long time without lots
of changes. And lots of users (distros, apparently google) still haven't
got a handle on CFS performance problems so I would have really rathered
introducing such extra changes far far more slowly.
(ie. wait, until CFS is more sorted and at least gone through some
maturity and deployment in distros).
I can easily make test cases where performance goes up by actually some
orders of magnitude by making the balancing totally aggressive over all
domains. And actually even we have some report of a (closed source)
java app having slowdowns due to this. Thing is, such workloads if they
require such frequent cross-numa-node balancing are maybe not really
scalable anyway.
kbuild -- this is frequent communications, short lived tasks, so it
is a good case for more aggressive balancing at the expense of NUMA
affinity. But this is probably further over to that side than a lot
of other workloads, so if our defaults aren't the *best* for kbuild
(ie. not quite aggressive enough) then I think that is probably a
good sign.
How large was the kbuild improvement? Was O(1) tried as well?
On Mon, Nov 23, 2009 at 01:08:49PM +0100, Ingo Molnar wrote:
>
> * Nick Piggin <[email protected]> wrote:
>
> > On Mon, Nov 23, 2009 at 12:45:50PM +0100, Ingo Molnar wrote:
> > Well to be fair, the *decision* is to use a longer-term weight for the
> > runqueue to reduce balancing (seeing as we naturally do far more
> > balancing on conditions means that we tend to look at our instant
> > runqueue weight when it is 0).
>
> Well, the problem with that is that it uses a potentially outdated piece
> of metric - and that can become visible if balancing events are rare
> enough.
It shouldn't, it happens on the local CPU, at each scheduler tick.
> I.e. we do need a time scale (rate of balancing) to be able to do this
> correctly on a statistical level - which pretty much brings in 'rate
> limit' kind of logic.
>
> We are better off observing reality precisely and then saying "dont do
> this action" instead of fuzzing our metrics [or using fuzzy metrics
> conditionally - which is really the same] and hoping that in the end it
> will be as if we didnt do certain decisions.
We do though. We take the instant value and the weighted value and
take the min or max (depending on source or destination).
And we decide to do that because of the instantaneous fluctuations
on runqueue load can be far far outside even a normal short term
operating condition.
I don't know how you would otherwise propose to damp those fluctuations
that don't actually require any balancing. Rate limiting doesn't get
rid of those at all, it just does a bit less frequent blaancing. But
on the NUMA domain, memory affinity has been destroyed whether a task is
moved once or 100 times.
> (I hope i explained my point clearly enough.)
>
> No argument that it could be done cleaner - the duality right now of
> both having the fuzzy stats and the rate limiting should be decided
> one way or another.
Well, I would say please keep domain balancing behaviour at least
somewhat close to how it was with O(1) scheduler at least until CFS
is more sorted out. There is no need to knee jerk because BFS is
better at something.
We can certainly look at making improvements and take queues from
BFS to use in sched domains, but it is so easy to introduce regressions
and also regressions versus previous kernels are much more important
than slowdowns versus an out of tree patch. So while CFS is still
going through troubles I think it is much better to slow down on
sched-domains changes.
Thanks,
* Nick Piggin <[email protected]> wrote:
> > (I hope i explained my point clearly enough.)
> >
> > No argument that it could be done cleaner - the duality right now of
> > both having the fuzzy stats and the rate limiting should be decided
> > one way or another.
>
> Well, I would say please keep domain balancing behaviour at least
> somewhat close to how it was with O(1) scheduler at least until CFS is
> more sorted out. There is no need to knee jerk because BFS is better
> at something.
Well, see the change (commit 0ec9fab3d) attached below - even in
hindsight it was well argued by Mike with quite a few numbers: +68% in
x264 performance.
If you think it's a bad change (combined with newidle-rate-limit) then
please show the numbers that counter it and preferably extend 'perf
bench sched' with the testcases that you care about most.
IIRC (and Peter mentioned this too) Mike got into the whole kbuild angle
due to comparing mainline scheduler to BFS - but that's really a
positive thing IMO, not some "knee-jerk reaction".
Ingo
----------------------->
>From 0ec9fab3d186d9cbb00c0f694d4a260d07c198d9 Mon Sep 17 00:00:00 2001
From: Mike Galbraith <[email protected]>
Date: Tue, 15 Sep 2009 15:07:03 +0200
Subject: [PATCH] sched: Improve latencies and throughput
Make the idle balancer more agressive, to improve a
x264 encoding workload provided by Jason Garrett-Glaser:
NEXT_BUDDY NO_LB_BIAS
encoded 600 frames, 252.82 fps, 22096.60 kb/s
encoded 600 frames, 250.69 fps, 22096.60 kb/s
encoded 600 frames, 245.76 fps, 22096.60 kb/s
NO_NEXT_BUDDY LB_BIAS
encoded 600 frames, 344.44 fps, 22096.60 kb/s
encoded 600 frames, 346.66 fps, 22096.60 kb/s
encoded 600 frames, 352.59 fps, 22096.60 kb/s
NO_NEXT_BUDDY NO_LB_BIAS
encoded 600 frames, 425.75 fps, 22096.60 kb/s
encoded 600 frames, 425.45 fps, 22096.60 kb/s
encoded 600 frames, 422.49 fps, 22096.60 kb/s
Peter pointed out that this is better done via newidle_idx,
not via LB_BIAS, newidle balancing should look for where
there is load _now_, not where there was load 2 ticks ago.
Worst-case latencies are improved as well as no buddies
means less vruntime spread. (as per prior lkml discussions)
This change improves kbuild-peak parallelism as well.
Reported-by: Jason Garrett-Glaser <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/ia64/include/asm/topology.h | 5 +++--
arch/powerpc/include/asm/topology.h | 2 +-
arch/sh/include/asm/topology.h | 3 ++-
arch/x86/include/asm/topology.h | 4 +---
include/linux/topology.h | 2 +-
kernel/sched_features.h | 2 +-
6 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index 47f3c51..42f1673 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -61,7 +61,7 @@ void build_cpu_to_node_map(void);
.cache_nice_tries = 2, \
.busy_idx = 2, \
.idle_idx = 1, \
- .newidle_idx = 2, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.forkexec_idx = 1, \
.flags = SD_LOAD_BALANCE \
@@ -87,10 +87,11 @@ void build_cpu_to_node_map(void);
.cache_nice_tries = 2, \
.busy_idx = 3, \
.idle_idx = 2, \
- .newidle_idx = 2, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.forkexec_idx = 1, \
.flags = SD_LOAD_BALANCE \
+ | SD_BALANCE_NEWIDLE \
| SD_BALANCE_EXEC \
| SD_BALANCE_FORK \
| SD_BALANCE_WAKE \
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index a6b220a..1a2c9eb 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -57,7 +57,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
.cache_nice_tries = 1, \
.busy_idx = 3, \
.idle_idx = 1, \
- .newidle_idx = 2, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.flags = SD_LOAD_BALANCE \
| SD_BALANCE_EXEC \
diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h
index 9054e5c..c843677 100644
--- a/arch/sh/include/asm/topology.h
+++ b/arch/sh/include/asm/topology.h
@@ -15,13 +15,14 @@
.cache_nice_tries = 2, \
.busy_idx = 3, \
.idle_idx = 2, \
- .newidle_idx = 2, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.forkexec_idx = 1, \
.flags = SD_LOAD_BALANCE \
| SD_BALANCE_FORK \
| SD_BALANCE_EXEC \
| SD_BALANCE_WAKE \
+ | SD_BALANCE_NEWIDLE \
| SD_SERIALIZE, \
.last_balance = jiffies, \
.balance_interval = 1, \
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 4b1b335..7fafd1b 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -116,14 +116,12 @@ extern unsigned long node_remap_size[];
# define SD_CACHE_NICE_TRIES 1
# define SD_IDLE_IDX 1
-# define SD_NEWIDLE_IDX 2
# define SD_FORKEXEC_IDX 0
#else
# define SD_CACHE_NICE_TRIES 2
# define SD_IDLE_IDX 2
-# define SD_NEWIDLE_IDX 2
# define SD_FORKEXEC_IDX 1
#endif
@@ -137,7 +135,7 @@ extern unsigned long node_remap_size[];
.cache_nice_tries = SD_CACHE_NICE_TRIES, \
.busy_idx = 3, \
.idle_idx = SD_IDLE_IDX, \
- .newidle_idx = SD_NEWIDLE_IDX, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.forkexec_idx = SD_FORKEXEC_IDX, \
\
diff --git a/include/linux/topology.h b/include/linux/topology.h
index c87edcd..4298745 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -151,7 +151,7 @@ int arch_update_cpu_topology(void);
.cache_nice_tries = 1, \
.busy_idx = 2, \
.idle_idx = 1, \
- .newidle_idx = 2, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.forkexec_idx = 1, \
\
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 891ea0f..e98c2e8 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -67,7 +67,7 @@ SCHED_FEAT(AFFINE_WAKEUPS, 1)
* wakeup-preemption), since its likely going to consume data we
* touched, increases cache locality.
*/
-SCHED_FEAT(NEXT_BUDDY, 1)
+SCHED_FEAT(NEXT_BUDDY, 0)
/*
* Prefer to schedule the task that ran last (when we did
On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote:
> Hi,
>
> I wonder why it was decided to do newidle balancing in the NUMA
> domain? And with newidle_idx == 0 at that.
>
> This means that every time the CPU goes idle, every CPU in the
> system gets a remote cacheline or two hit. Not very nice O(n^2)
> behaviour on the interconnect. Not to mention trashing our
> NUMA locality.
Painful on little boxen too if left unchained.
> And then I see some proposal to do ratelimiting of newidle
> balancing :( Seems like hack upon hack making behaviour much more
> complex.
That's mine, and yeah, it is hackish. It just keeps newidle at bay for
high speed switchers while keeping it available to kick start CPUs for
fork/exec loads. Suggestions welcome. I have a threaded testcase
(x264) where turning the think off costs ~40% throughput. Take that
same testcase (or ilk) to a big NUMA beast, and performance will very
likely suck just as bad as it does on my little Q6600 box.
Other than that, I'd be most happy to see the thing crawl back in it's
cave and _die_ despite the little gain it provides for a kbuild. It has
been (is) very annoying.
> One "symptom" of bad mutex contention can be that increasing the
> balancing rate can help a bit to reduce idle time (because it
> can get the woken thread which is holding a semaphore to run ASAP
> after we run out of runnable tasks in the system due to them
> hitting contention on that semaphore).
Yes, when mysql+oltp starts jamming up, load balancing helps bust up the
logjam somewhat, but that's not at all why newidle was activated..
> I really hope this change wasn't done in order to help -rt or
> something sad like sysbench on MySQL.
Newidle was activated to improve fork/exec CPU utilization. A nasty
side effect is that it tries to rip other loads to tatters.
> And btw, I'll stay out of mentioning anything about CFS development,
> but it really sucks to be continually making significant changes to
> domains balancing *and* per-runqueue scheduling at the same time :(
> It makes it even difficult to bisect things.
Yeah, balancing got jumbled up with desktop tweakage. Much fallout this
round, and some things still to be fixed back up.
-Mike
On Mon, Nov 23, 2009 at 03:37:39PM +0100, Mike Galbraith wrote:
> On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote:
> > Hi,
> >
> > I wonder why it was decided to do newidle balancing in the NUMA
> > domain? And with newidle_idx == 0 at that.
> >
> > This means that every time the CPU goes idle, every CPU in the
> > system gets a remote cacheline or two hit. Not very nice O(n^2)
> > behaviour on the interconnect. Not to mention trashing our
> > NUMA locality.
>
> Painful on little boxen too if left unchained.
Yep. It's an order of magnitude more expensive to go on the
interconnect rather than stay in LLC. So even on little systems,
new idle balancing can become an order of magnitude more
expensive.
On slightly larger systems, where you have an order of magnitude
more cores on remote nodes than local, new idle balancing can now
be two orders of magnitude more expensive.
> > And then I see some proposal to do ratelimiting of newidle
> > balancing :( Seems like hack upon hack making behaviour much more
> > complex.
>
> That's mine, and yeah, it is hackish. It just keeps newidle at bay for
> high speed switchers while keeping it available to kick start CPUs for
> fork/exec loads. Suggestions welcome. I have a threaded testcase
> (x264) where turning the think off costs ~40% throughput. Take that
> same testcase (or ilk) to a big NUMA beast, and performance will very
> likely suck just as bad as it does on my little Q6600 box.
>
> Other than that, I'd be most happy to see the thing crawl back in it's
> cave and _die_ despite the little gain it provides for a kbuild. It has
> been (is) very annoying.
Wait, you say it was activated to improve fork/exec CPU utilization?
For the x264 load? What do you mean by this? Do you mean it is doing
a lot of fork/exec/exits and load is not being spread quickly enough?
Or that NUMA allocations get screwed up because tasks don't get spread
out quickly enough before running?
In either case, I think newidle balancing is maybe not the right solution.
newidle balancing only checks the system state when the destination
CPU goes idle. fork events increase load at the source CPU. So for
example if you find newidle helps to pick up forks, then if the
newidle event happens to come in before the fork, we'll have to wait
for the next rebalance event.
So possibly making fork/exec balancing more aggressive might be a
better approach. This can be done by reducing the damping idx, or
perhaps some other conditions to reduce eg imbalance_pct or something
for forkexec balancing. Probably needs some studying of the workload
to work out why forkexec is failing.
> > One "symptom" of bad mutex contention can be that increasing the
> > balancing rate can help a bit to reduce idle time (because it
> > can get the woken thread which is holding a semaphore to run ASAP
> > after we run out of runnable tasks in the system due to them
> > hitting contention on that semaphore).
>
> Yes, when mysql+oltp starts jamming up, load balancing helps bust up the
> logjam somewhat, but that's not at all why newidle was activated..
OK good to know.
> > I really hope this change wasn't done in order to help -rt or
> > something sad like sysbench on MySQL.
>
> Newidle was activated to improve fork/exec CPU utilization. A nasty
> side effect is that it tries to rip other loads to tatters.
>
> > And btw, I'll stay out of mentioning anything about CFS development,
> > but it really sucks to be continually making significant changes to
> > domains balancing *and* per-runqueue scheduling at the same time :(
> > It makes it even difficult to bisect things.
>
> Yeah, balancing got jumbled up with desktop tweakage. Much fallout this
> round, and some things still to be fixed back up.
OK. This would be great if fixing up involves making things closer
to what they were rather than adding more complex behaviour on top
of other changes that broke stuff. And doing it in 2.6.32 would be
kind of nice...
On Mon, 2009-11-23 at 16:11 +0100, Nick Piggin wrote:
> Wait, you say it was activated to improve fork/exec CPU utilization?
> For the x264 load? What do you mean by this? Do you mean it is doing
> a lot of fork/exec/exits and load is not being spread quickly enough?
> Or that NUMA allocations get screwed up because tasks don't get spread
> out quickly enough before running?
>
> In either case, I think newidle balancing is maybe not the right solution.
> newidle balancing only checks the system state when the destination
> CPU goes idle. fork events increase load at the source CPU. So for
> example if you find newidle helps to pick up forks, then if the
> newidle event happens to come in before the fork, we'll have to wait
> for the next rebalance event.
>
> So possibly making fork/exec balancing more aggressive might be a
> better approach. This can be done by reducing the damping idx, or
> perhaps some other conditions to reduce eg imbalance_pct or something
> for forkexec balancing. Probably needs some studying of the workload
> to work out why forkexec is failing.
>From what I can remember from that workload it basically spawns tons of
very short lived threads, waits for a bunch to complete, goto 1.
Fork balancing only works until all cpus are active. But once a core
goes idle it's left idle until we hit a general load-balance cycle.
Newidle helps because it picks up these threads from other cpus,
completing the current batch sooner, allowing the program to continue
with the next.
There's just not much you can do from the fork() side of things once
you've got them all running.
> OK. This would be great if fixing up involves making things closer
> to what they were rather than adding more complex behaviour on top
> of other changes that broke stuff. And doing it in 2.6.32 would be
> kind of nice...
.32 is kind of closed, with us being at -rc8.
On Mon, Nov 23, 2009 at 04:21:44PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-11-23 at 16:11 +0100, Nick Piggin wrote:
>
> > Wait, you say it was activated to improve fork/exec CPU utilization?
> > For the x264 load? What do you mean by this? Do you mean it is doing
> > a lot of fork/exec/exits and load is not being spread quickly enough?
> > Or that NUMA allocations get screwed up because tasks don't get spread
> > out quickly enough before running?
> >
> > In either case, I think newidle balancing is maybe not the right solution.
> > newidle balancing only checks the system state when the destination
> > CPU goes idle. fork events increase load at the source CPU. So for
> > example if you find newidle helps to pick up forks, then if the
> > newidle event happens to come in before the fork, we'll have to wait
> > for the next rebalance event.
> >
> > So possibly making fork/exec balancing more aggressive might be a
> > better approach. This can be done by reducing the damping idx, or
> > perhaps some other conditions to reduce eg imbalance_pct or something
> > for forkexec balancing. Probably needs some studying of the workload
> > to work out why forkexec is failing.
>
> >From what I can remember from that workload it basically spawns tons of
> very short lived threads, waits for a bunch to complete, goto 1.
So basically about the least well performing or scalable possible
software architecture. This is exactly the wrong thing to optimise
for, guys.
The fact that you have to coax the scheduler into touching heaps
more remote cachelines and vastly increasing the amount of inter
node task migration should have been kind of a hint.
> Fork balancing only works until all cpus are active. But once a core
> goes idle it's left idle until we hit a general load-balance cycle.
> Newidle helps because it picks up these threads from other cpus,
> completing the current batch sooner, allowing the program to continue
> with the next.
>
> There's just not much you can do from the fork() side of things once
> you've got them all running.
It sounds like allowing fork balancing to be more aggressive could
definitely help.
> > OK. This would be great if fixing up involves making things closer
> > to what they were rather than adding more complex behaviour on top
> > of other changes that broke stuff. And doing it in 2.6.32 would be
> > kind of nice...
>
> .32 is kind of closed, with us being at -rc8.
It's a bad regression though.
On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote:
>
> > .32 is kind of closed, with us being at -rc8.
>
> It's a bad regression though.
Still haven't heard what workload, what machine, by how-much, nor any
reproducer.
And again, -rc8 is really late.
On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote:
> So basically about the least well performing or scalable possible
> software architecture. This is exactly the wrong thing to optimise
> for, guys.
Hm. Isn't fork/exec our daily bread?
> The fact that you have to coax the scheduler into touching heaps
> more remote cachelines and vastly increasing the amount of inter
> node task migration should have been kind of a hint.
>
>
> > Fork balancing only works until all cpus are active. But once a core
> > goes idle it's left idle until we hit a general load-balance cycle.
> > Newidle helps because it picks up these threads from other cpus,
> > completing the current batch sooner, allowing the program to continue
> > with the next.
> >
> > There's just not much you can do from the fork() side of things once
> > you've got them all running.
>
> It sounds like allowing fork balancing to be more aggressive could
> definitely help.
It doesn't. Task which is _already_ forked, placed and waiting over
yonder can't do spit for getting this cpu active again without running
so he can phone home. This isn't only observable with x264, it just
rubs our noses in it. It is also quite observable in a kbuild. What if
the waiter is your next fork?
-Mike
* Nick Piggin <[email protected]> wrote:
> > .32 is kind of closed, with us being at -rc8.
>
> It's a bad regression though.
It's about 3 months too late for that. Ideally we want performance
regressions to be looked for and reported when the patches go into the
devel tree. Failing that, -rc1 would be the good time to re-test
whatever workload you care about.
If you cannot test it in a regular fashion you can offload the testing
to us, by adding a similar/equivalent workload to 'perf bench sched'.
We'll make sure it stays sane.
Thanks,
Ingo
On Mon, Nov 23, 2009 at 01:46:15PM +0100, Ingo Molnar wrote:
>
> * Nick Piggin <[email protected]> wrote:
>
> > > (I hope i explained my point clearly enough.)
> > >
> > > No argument that it could be done cleaner - the duality right now of
> > > both having the fuzzy stats and the rate limiting should be decided
> > > one way or another.
> >
> > Well, I would say please keep domain balancing behaviour at least
> > somewhat close to how it was with O(1) scheduler at least until CFS is
> > more sorted out. There is no need to knee jerk because BFS is better
> > at something.
>
> Well, see the change (commit 0ec9fab3d) attached below - even in
> hindsight it was well argued by Mike with quite a few numbers: +68% in
> x264 performance.
Quite a few being one test case, and on a program with a horrible
parallelism design (rapid heavy weight forks to distribute small
units of work).
> If you think it's a bad change (combined with newidle-rate-limit) then
> please show the numbers that counter it and preferably extend 'perf
> bench sched' with the testcases that you care about most.
>
> IIRC (and Peter mentioned this too) Mike got into the whole kbuild angle
> due to comparing mainline scheduler to BFS - but that's really a
> positive thing IMO, not some "knee-jerk reaction".
It is a knee-jerk "we have to be better than BFS" reaction. It is
quite obvious that vasty increasing inter node balancing and remote
cacheline touches is not a good idea (unless it helps _reasonably_
designed programs without unduely hurting others).
And when you are changing all this desktop interactivity stuff, I
would ask please not to make these kinds of "improvements" to domains
balancing. I don't think that is too much to ask given track record
of scheduler problems.
Finally, I didn't realise I would have to be responsible for shooting
down every "ooh shiny" change that gets pushed into the scheduler. I
would have hoped they are ensured to be rather well thought out and
reviewed and tested first.
I do think it is a bad change, for extensive reasons I explained. It
will obviously cause regressions on workloads with lots of newidle
events that are un-balanceable (which would include _well tuned_ net
work and data base servers). And it does. I thought you'd have been
aware of that.
http://patchwork.kernel.org/patch/58931/
>
> Ingo
>
> ----------------------->
> >From 0ec9fab3d186d9cbb00c0f694d4a260d07c198d9 Mon Sep 17 00:00:00 2001
> From: Mike Galbraith <[email protected]>
> Date: Tue, 15 Sep 2009 15:07:03 +0200
> Subject: [PATCH] sched: Improve latencies and throughput
>
> Make the idle balancer more agressive, to improve a
> x264 encoding workload provided by Jason Garrett-Glaser:
>
> NEXT_BUDDY NO_LB_BIAS
> encoded 600 frames, 252.82 fps, 22096.60 kb/s
> encoded 600 frames, 250.69 fps, 22096.60 kb/s
> encoded 600 frames, 245.76 fps, 22096.60 kb/s
>
> NO_NEXT_BUDDY LB_BIAS
> encoded 600 frames, 344.44 fps, 22096.60 kb/s
> encoded 600 frames, 346.66 fps, 22096.60 kb/s
> encoded 600 frames, 352.59 fps, 22096.60 kb/s
>
> NO_NEXT_BUDDY NO_LB_BIAS
> encoded 600 frames, 425.75 fps, 22096.60 kb/s
> encoded 600 frames, 425.45 fps, 22096.60 kb/s
> encoded 600 frames, 422.49 fps, 22096.60 kb/s
>
> Peter pointed out that this is better done via newidle_idx,
> not via LB_BIAS, newidle balancing should look for where
> there is load _now_, not where there was load 2 ticks ago.
>
> Worst-case latencies are improved as well as no buddies
> means less vruntime spread. (as per prior lkml discussions)
>
> This change improves kbuild-peak parallelism as well.
>
> Reported-by: Jason Garrett-Glaser <[email protected]>
> Signed-off-by: Mike Galbraith <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> LKML-Reference: <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/ia64/include/asm/topology.h | 5 +++--
> arch/powerpc/include/asm/topology.h | 2 +-
> arch/sh/include/asm/topology.h | 3 ++-
> arch/x86/include/asm/topology.h | 4 +---
> include/linux/topology.h | 2 +-
> kernel/sched_features.h | 2 +-
> 6 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
> index 47f3c51..42f1673 100644
> --- a/arch/ia64/include/asm/topology.h
> +++ b/arch/ia64/include/asm/topology.h
> @@ -61,7 +61,7 @@ void build_cpu_to_node_map(void);
> .cache_nice_tries = 2, \
> .busy_idx = 2, \
> .idle_idx = 1, \
> - .newidle_idx = 2, \
> + .newidle_idx = 0, \
> .wake_idx = 0, \
> .forkexec_idx = 1, \
> .flags = SD_LOAD_BALANCE \
> @@ -87,10 +87,11 @@ void build_cpu_to_node_map(void);
> .cache_nice_tries = 2, \
> .busy_idx = 3, \
> .idle_idx = 2, \
> - .newidle_idx = 2, \
> + .newidle_idx = 0, \
> .wake_idx = 0, \
> .forkexec_idx = 1, \
> .flags = SD_LOAD_BALANCE \
> + | SD_BALANCE_NEWIDLE \
> | SD_BALANCE_EXEC \
> | SD_BALANCE_FORK \
> | SD_BALANCE_WAKE \
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index a6b220a..1a2c9eb 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -57,7 +57,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
> .cache_nice_tries = 1, \
> .busy_idx = 3, \
> .idle_idx = 1, \
> - .newidle_idx = 2, \
> + .newidle_idx = 0, \
> .wake_idx = 0, \
> .flags = SD_LOAD_BALANCE \
> | SD_BALANCE_EXEC \
> diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h
> index 9054e5c..c843677 100644
> --- a/arch/sh/include/asm/topology.h
> +++ b/arch/sh/include/asm/topology.h
> @@ -15,13 +15,14 @@
> .cache_nice_tries = 2, \
> .busy_idx = 3, \
> .idle_idx = 2, \
> - .newidle_idx = 2, \
> + .newidle_idx = 0, \
> .wake_idx = 0, \
> .forkexec_idx = 1, \
> .flags = SD_LOAD_BALANCE \
> | SD_BALANCE_FORK \
> | SD_BALANCE_EXEC \
> | SD_BALANCE_WAKE \
> + | SD_BALANCE_NEWIDLE \
> | SD_SERIALIZE, \
> .last_balance = jiffies, \
> .balance_interval = 1, \
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 4b1b335..7fafd1b 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -116,14 +116,12 @@ extern unsigned long node_remap_size[];
>
> # define SD_CACHE_NICE_TRIES 1
> # define SD_IDLE_IDX 1
> -# define SD_NEWIDLE_IDX 2
> # define SD_FORKEXEC_IDX 0
>
> #else
>
> # define SD_CACHE_NICE_TRIES 2
> # define SD_IDLE_IDX 2
> -# define SD_NEWIDLE_IDX 2
> # define SD_FORKEXEC_IDX 1
>
> #endif
> @@ -137,7 +135,7 @@ extern unsigned long node_remap_size[];
> .cache_nice_tries = SD_CACHE_NICE_TRIES, \
> .busy_idx = 3, \
> .idle_idx = SD_IDLE_IDX, \
> - .newidle_idx = SD_NEWIDLE_IDX, \
> + .newidle_idx = 0, \
> .wake_idx = 0, \
> .forkexec_idx = SD_FORKEXEC_IDX, \
> \
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index c87edcd..4298745 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -151,7 +151,7 @@ int arch_update_cpu_topology(void);
> .cache_nice_tries = 1, \
> .busy_idx = 2, \
> .idle_idx = 1, \
> - .newidle_idx = 2, \
> + .newidle_idx = 0, \
> .wake_idx = 0, \
> .forkexec_idx = 1, \
> \
> diff --git a/kernel/sched_features.h b/kernel/sched_features.h
> index 891ea0f..e98c2e8 100644
> --- a/kernel/sched_features.h
> +++ b/kernel/sched_features.h
> @@ -67,7 +67,7 @@ SCHED_FEAT(AFFINE_WAKEUPS, 1)
> * wakeup-preemption), since its likely going to consume data we
> * touched, increases cache locality.
> */
> -SCHED_FEAT(NEXT_BUDDY, 1)
> +SCHED_FEAT(NEXT_BUDDY, 0)
>
> /*
> * Prefer to schedule the task that ran last (when we did
On Mon, Nov 23, 2009 at 04:53:37PM +0100, Mike Galbraith wrote:
> On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote:
>
> > So basically about the least well performing or scalable possible
> > software architecture. This is exactly the wrong thing to optimise
> > for, guys.
>
> Hm. Isn't fork/exec our daily bread?
No. Not for handing out tiny chunks of work and attempting to do
them in parallel. There is this thing called Amdahl's law, and if
you write a parallel program that wantonly uses the heaviest
possible primitives in its serial sections, then it doesn't deserve
to go fast.
That is what IPC or shared memory is for. Vastly faster, vastly more
scalable, vastly easier for scheduler balancing (both via manual or
automatic placement).
> > The fact that you have to coax the scheduler into touching heaps
> > more remote cachelines and vastly increasing the amount of inter
> > node task migration should have been kind of a hint.
> >
> >
> > > Fork balancing only works until all cpus are active. But once a core
> > > goes idle it's left idle until we hit a general load-balance cycle.
> > > Newidle helps because it picks up these threads from other cpus,
> > > completing the current batch sooner, allowing the program to continue
> > > with the next.
> > >
> > > There's just not much you can do from the fork() side of things once
> > > you've got them all running.
> >
> > It sounds like allowing fork balancing to be more aggressive could
> > definitely help.
>
> It doesn't. Task which is _already_ forked, placed and waiting over
> yonder can't do spit for getting this cpu active again without running
> so he can phone home. This isn't only observable with x264, it just
> rubs our noses in it. It is also quite observable in a kbuild. What if
> the waiter is your next fork?
I'm not saying that vastly increasing task movement between NUMA
nodes won't *help* some workloads. Indeed they tend to be ones that
aren't very well parallelised (then it becomes critical to wake up
any waiter if a CPU becomes free because it might be holding a
heavily contended resource).
But can you apprciate that these are at one side of the spectrum of
workloads, and that others will much prefer to keep good affinity?
No matter how "nice" your workload is, you can't keep traffic off
the interconnect if the kernel screws up your numa placement.
And also, I'm not saying that we were at _exactly_ the right place
before and there was no room for improvement, but considering that
we didn't have a lot of active _regressions_ in the balancer, we
can really use that to our favour and concentrate changes in code
that does have regressions. And be really conservative and careful
with changes to the balancer.
On Mon, Nov 23, 2009 at 04:37:27PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote:
> >
> > > .32 is kind of closed, with us being at -rc8.
> >
> > It's a bad regression though.
>
> Still haven't heard what workload, what machine, by how-much, nor any
> reproducer.
http://patchwork.kernel.org/patch/58931/
> And again, -rc8 is really late.
Exactly the right time to sort out regressions, however.
On Mon, Nov 23, 2009 at 06:04:45PM +0100, Ingo Molnar wrote:
>
> * Nick Piggin <[email protected]> wrote:
>
> > > .32 is kind of closed, with us being at -rc8.
> >
> > It's a bad regression though.
>
> It's about 3 months too late for that. Ideally we want performance
Too late for what? Reporting and reverting a regression? I don't
think so. It is not my problem if patches aren't tested well
enough before being merged.
If we release a kernel with this known problematic scheduler behaviour
then it gives userspace application writers far harder targets, and
also it will give *some* 2.6.32 users regressions if we find it has
to be fixed in 2.6.33.
> regressions to be looked for and reported when the patches go into the
> devel tree. Failing that, -rc1 would be the good time to re-test
> whatever workload you care about.
>
> If you cannot test it in a regular fashion you can offload the testing
> to us, by adding a similar/equivalent workload to 'perf bench sched'.
> We'll make sure it stays sane.
What exactly tests *were* done on it? Anything that might possibly
trigger its obvious possibility for detremental behaviour? Something
like netperf?
On Tue, 2009-11-24 at 07:53 +0100, Nick Piggin wrote:
> On Mon, Nov 23, 2009 at 04:53:37PM +0100, Mike Galbraith wrote:
> > On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote:
> >
> > > So basically about the least well performing or scalable possible
> > > software architecture. This is exactly the wrong thing to optimise
> > > for, guys.
> >
> > Hm. Isn't fork/exec our daily bread?
>
> No. Not for handing out tiny chunks of work and attempting to do
> them in parallel. There is this thing called Amdahl's law, and if
> you write a parallel program that wantonly uses the heaviest
> possible primitives in its serial sections, then it doesn't deserve
> to go fast.
OK by me. A bit if idle time for kbuild is easily cured with telling
make to emit more jobs, so there's enough little jobs to go around.
If x264 is declared dainbramaged, that's fine with me too.
> That is what IPC or shared memory is for. Vastly faster, vastly more
> scalable, vastly easier for scheduler balancing (both via manual or
> automatic placement).
All well and good for apps that use them. Again, _I_ don't care either
way. As stated, I wouldn't cry if newidle died a gruesome death, it has
irritated me more than you would ever like to hear about ;-)
> > > The fact that you have to coax the scheduler into touching heaps
> > > more remote cachelines and vastly increasing the amount of inter
> > > node task migration should have been kind of a hint.
> > >
> > >
> > > > Fork balancing only works until all cpus are active. But once a core
> > > > goes idle it's left idle until we hit a general load-balance cycle.
> > > > Newidle helps because it picks up these threads from other cpus,
> > > > completing the current batch sooner, allowing the program to continue
> > > > with the next.
> > > >
> > > > There's just not much you can do from the fork() side of things once
> > > > you've got them all running.
> > >
> > > It sounds like allowing fork balancing to be more aggressive could
> > > definitely help.
> >
> > It doesn't. Task which is _already_ forked, placed and waiting over
> > yonder can't do spit for getting this cpu active again without running
> > so he can phone home. This isn't only observable with x264, it just
> > rubs our noses in it. It is also quite observable in a kbuild. What if
> > the waiter is your next fork?
>
> I'm not saying that vastly increasing task movement between NUMA
> nodes won't *help* some workloads. Indeed they tend to be ones that
> aren't very well parallelised (then it becomes critical to wake up
> any waiter if a CPU becomes free because it might be holding a
> heavily contended resource).
It absolutely will. Your counter arguments are also fully valid.
> But can you apprciate that these are at one side of the spectrum of
> workloads, and that others will much prefer to keep good affinity?
Yup. Anything with cache footprint. That's why I thumped newidle on
it's pointy head. Trouble is, there's no way to make it perfect for
both. As is, there's some pain for both. Maybe TOO much for big iron.
> No matter how "nice" your workload is, you can't keep traffic off
> the interconnect if the kernel screws up your numa placement.
Don't matter if it's NUMA pain. Pain is pain.
> And also, I'm not saying that we were at _exactly_ the right place
> before and there was no room for improvement, but considering that
> we didn't have a lot of active _regressions_ in the balancer, we
> can really use that to our favour and concentrate changes in code
> that does have regressions. And be really conservative and careful
> with changes to the balancer.
No argument against caution.
-Mike
On Tue, 2009-11-24 at 09:40 +0100, Mike Galbraith wrote:
> On Tue, 2009-11-24 at 07:53 +0100, Nick Piggin wrote:
> > On Mon, Nov 23, 2009 at 04:53:37PM +0100, Mike Galbraith wrote:
> > > On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote:
> > >
> > > > So basically about the least well performing or scalable possible
> > > > software architecture. This is exactly the wrong thing to optimise
> > > > for, guys.
> > >
> > > Hm. Isn't fork/exec our daily bread?
> >
> > No. Not for handing out tiny chunks of work and attempting to do
> > them in parallel. There is this thing called Amdahl's law, and if
> > you write a parallel program that wantonly uses the heaviest
> > possible primitives in its serial sections, then it doesn't deserve
> > to go fast.
>
> OK by me. A bit if idle time for kbuild is easily cured with telling
> make to emit more jobs, so there's enough little jobs to go around.
>
> If x264 is declared dainbramaged, that's fine with me too.
(P.S. I don't want to have to explain to users of any such thread happy
applications why they suck rocks under Linux though)
* Mike Galbraith <[email protected]> wrote:
> On Tue, 2009-11-24 at 09:40 +0100, Mike Galbraith wrote:
> > On Tue, 2009-11-24 at 07:53 +0100, Nick Piggin wrote:
> > > On Mon, Nov 23, 2009 at 04:53:37PM +0100, Mike Galbraith wrote:
> > > > On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote:
> > > >
> > > > > So basically about the least well performing or scalable possible
> > > > > software architecture. This is exactly the wrong thing to optimise
> > > > > for, guys.
> > > >
> > > > Hm. Isn't fork/exec our daily bread?
> > >
> > > No. Not for handing out tiny chunks of work and attempting to do
> > > them in parallel. There is this thing called Amdahl's law, and if
> > > you write a parallel program that wantonly uses the heaviest
> > > possible primitives in its serial sections, then it doesn't deserve
> > > to go fast.
> >
> > OK by me. A bit if idle time for kbuild is easily cured with telling
> > make to emit more jobs, so there's enough little jobs to go around.
> >
> > If x264 is declared dainbramaged, that's fine with me too.
>
> (P.S. I don't want to have to explain to users of any such thread
> happy applications why they suck rocks under Linux though)
The 68% of speedup is pretty valid for your change and the workload isnt
particularly odd beyond the fast creation rate of threads - but i'd not
blame the app for that, Linux creates/destroys threads very fast. Your
followup load-balancing fix got queued up in the scheduler tree for
v2.6.33 ~two weeks ago:
1b9508f: sched: Rate-limit newidle
certainly handles some of the NUMA pain. If not, numbers telling us the
other story (and patches to extend 'perf bench sched' with matching
testcases) would be helpful, i'm not seeing problems on my NUMA testbox.
1b9508f was certainly too late for 2.6.32, but since it appears to be
fine it can be forwarded to [email protected] so it makes it into
2.6.32.1.
Thanks,
Ingo
* Nick Piggin <[email protected]> wrote:
> On Mon, Nov 23, 2009 at 06:04:45PM +0100, Ingo Molnar wrote:
> >
> > * Nick Piggin <[email protected]> wrote:
> >
> > > > .32 is kind of closed, with us being at -rc8.
> > >
> > > It's a bad regression though.
> >
> > It's about 3 months too late for that. Ideally we want performance
>
> Too late for what? Reporting and reverting a regression?
Yes, the revert would be too intrusive so late in the release cycle,
sorry. For such types of scheduler reverts we rarely go beyond -rc5 -
the risks of breaking something are too big.
Note that we have queued up a fix for that about two weeks ago for
v2.6.33 - that fix could be forwarded to [email protected] for 2.6.32.1
merging.
> [...] I don't think so. It is not my problem if patches aren't tested
> well enough before being merged.
There's certainly a wide performance testing being done for scheduler
relevant workloads all the time. If you dont like the end result you are
welcome to help out testing and fixing things (and certainly both).
I'd also welcome help from you to extend 'perf bench sched' with new
testcases. (it has only two so far - the tool just got started so
there's lots of low hanging fruits.) 'perf bench sched numa-cost' would
certainly be usefulful to have.
Thanks,
Ingo
> Quite a few being one test case, and on a program with a horrible
> parallelism design (rapid heavy weight forks to distribute small
> units of work).
> If x264 is declared dainbramaged, that's fine with me too.
We did multiple benchmarks using a thread pool and it did not help.
If you want to declare our app "braindamaged", feel free, but pooling
threads to avoid re-creation gave no benefit whatsoever. If you think
the parallelism methodology is wrong as a whole, you're basically
saying that Linux shouldn't be used for video compression, because
this is the exact same threading model used by almost every single
video encoder ever made. There are actually a few that use
slice-based threading, but those are actually even worse from your
perspective, because slice-based threading spawns mulitple threads PER
FRAME instead of one per frame.
Because of the inter-frame dependencies in video coding it is
impossible to efficiently get a granularity of more than one thread
per frame. Pooling threads doesn't change the fact that you are
conceptually creating a thread for each frame--it just eliminates the
pthread_create call. In theory you could do one thread per group of
frames, but that is completely unrealistic for real-time encoding
(e.g. streaming), requires a catastrophically large amount of memory,
makes it impossible to track the bit buffer, and all other sorts of
bad stuff.
Jason Garrett-Glaser
On Tue, 2009-11-24 at 09:24 -0800, Jason Garrett-Glaser wrote:
> > Quite a few being one test case, and on a program with a horrible
> > parallelism design (rapid heavy weight forks to distribute small
> > units of work).
>
> > If x264 is declared dainbramaged, that's fine with me too.
>
> We did multiple benchmarks using a thread pool and it did not help.
Yes, I see no way it possibly could make any difference.
Well, there is one thing. We have this START_DEBIT thing, using a
thread pool avoids that very significant penalty. WRT idle->busy again
time though, it can't make any difference.
> If you want to declare our app "braindamaged", feel free, but pooling
> threads to avoid re-creation gave no benefit whatsoever. If you think
> the parallelism methodology is wrong as a whole, you're basically
> saying that Linux shouldn't be used for video compression, because
> this is the exact same threading model used by almost every single
> video encoder ever made. There are actually a few that use
> slice-based threading, but those are actually even worse from your
> perspective, because slice-based threading spawns mulitple threads PER
> FRAME instead of one per frame.
>
> Because of the inter-frame dependencies in video coding it is
> impossible to efficiently get a granularity of more than one thread
> per frame. Pooling threads doesn't change the fact that you are
> conceptually creating a thread for each frame--it just eliminates the
> pthread_create call. In theory you could do one thread per group of
> frames, but that is completely unrealistic for real-time encoding
> (e.g. streaming), requires a catastrophically large amount of memory,
> makes it impossible to track the bit buffer, and all other sorts of
> bad stuff.
I don't consider x264 to be braindamaged btw, I consider it to be a very
nice testcase for the scheduler. As soon as I saw the problem it
highlighted so well, it became a permanent member of my collection.
-Mike
On Tue, Nov 24, 2009 at 09:24:26AM -0800, Jason Garrett-Glaser wrote:
> > Quite a few being one test case, and on a program with a horrible
> > parallelism design (rapid heavy weight forks to distribute small
> > units of work).
>
> > If x264 is declared dainbramaged, that's fine with me too.
>
> We did multiple benchmarks using a thread pool and it did not help.
> If you want to declare our app "braindamaged", feel free, but pooling
> threads to avoid re-creation gave no benefit whatsoever. If you think
> the parallelism methodology is wrong as a whole, you're basically
> saying that Linux shouldn't be used for video compression, because
> this is the exact same threading model used by almost every single
> video encoder ever made. There are actually a few that use
> slice-based threading, but those are actually even worse from your
> perspective, because slice-based threading spawns mulitple threads PER
> FRAME instead of one per frame.
>
> Because of the inter-frame dependencies in video coding it is
> impossible to efficiently get a granularity of more than one thread
> per frame. Pooling threads doesn't change the fact that you are
> conceptually creating a thread for each frame--it just eliminates the
> pthread_create call. In theory you could do one thread per group of
> frames, but that is completely unrealistic for real-time encoding
> (e.g. streaming), requires a catastrophically large amount of memory,
> makes it impossible to track the bit buffer, and all other sorts of
> bad stuff.
If you can scale to N threads by having 1 frame per thread, then
you can scale to N/2 threads and have 2 frames per thread. Can't
you?
Is your problem in scaling to a large N?
On Tue, Nov 24, 2009 at 10:11:24AM +0100, Ingo Molnar wrote:
>
> * Mike Galbraith <[email protected]> wrote:
>
> > On Tue, 2009-11-24 at 09:40 +0100, Mike Galbraith wrote:
> > > On Tue, 2009-11-24 at 07:53 +0100, Nick Piggin wrote:
> > > > On Mon, Nov 23, 2009 at 04:53:37PM +0100, Mike Galbraith wrote:
> > > > > On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote:
> > > > >
> > > > > > So basically about the least well performing or scalable possible
> > > > > > software architecture. This is exactly the wrong thing to optimise
> > > > > > for, guys.
> > > > >
> > > > > Hm. Isn't fork/exec our daily bread?
> > > >
> > > > No. Not for handing out tiny chunks of work and attempting to do
> > > > them in parallel. There is this thing called Amdahl's law, and if
> > > > you write a parallel program that wantonly uses the heaviest
> > > > possible primitives in its serial sections, then it doesn't deserve
> > > > to go fast.
> > >
> > > OK by me. A bit if idle time for kbuild is easily cured with telling
> > > make to emit more jobs, so there's enough little jobs to go around.
> > >
> > > If x264 is declared dainbramaged, that's fine with me too.
> >
> > (P.S. I don't want to have to explain to users of any such thread
> > happy applications why they suck rocks under Linux though)
>
> The 68% of speedup is pretty valid for your change and the workload isnt
> particularly odd beyond the fast creation rate of threads - but i'd not
> blame the app for that, Linux creates/destroys threads very fast. Your
> followup load-balancing fix got queued up in the scheduler tree for
> v2.6.33 ~two weeks ago:
>
> 1b9508f: sched: Rate-limit newidle
>
> certainly handles some of the NUMA pain. If not, numbers telling us the
> other story (and patches to extend 'perf bench sched' with matching
> testcases) would be helpful, i'm not seeing problems on my NUMA testbox.
Yay for more complexity. And it doesn't fix all problems introduced
because it doesn't improve the now inherently far weaker NUMA affinity.
> 1b9508f was certainly too late for 2.6.32, but since it appears to be
> fine it can be forwarded to [email protected] so it makes it into
> 2.6.32.1.
I totally disagree with this development / release process of yours.
In my opinion it is *not* ok to leave a wake of known regressions with
features that help some (non-regression) case.
If it was agreed that rate limiting newidle is the right fix for the
regression, and it was agreed that it is too late for the next release,
then IMO the correct way to approach the release is to revert the
offending patch until the new, combined, approach is tested and
validated.
Doing it your way, you release a 2.6.32 kernel with a known regression
and also a case that gets a big speedup. This makes it impossible now
to revert that patch without causing a regression for someone. And
then you think you have a way to fix it, but not enough testing so it
could be the case that it causes yet other regressions or does not fix
it well enough.
On Mon, Nov 30, 2009 at 12:19 AM, Nick Piggin <[email protected]> wrote:
> On Tue, Nov 24, 2009 at 09:24:26AM -0800, Jason Garrett-Glaser wrote:
>> > Quite a few being one test case, and on a program with a horrible
>> > parallelism design (rapid heavy weight forks to distribute small
>> > units of work).
>>
>> > If x264 is declared dainbramaged, that's fine with me too.
>>
>> We did multiple benchmarks using a thread pool and it did not help.
>> If you want to declare our app "braindamaged", feel free, but pooling
>> threads to avoid re-creation gave no benefit whatsoever. ?If you think
>> the parallelism methodology is wrong as a whole, you're basically
>> saying that Linux shouldn't be used for video compression, because
>> this is the exact same threading model used by almost every single
>> video encoder ever made. ?There are actually a few that use
>> slice-based threading, but those are actually even worse from your
>> perspective, because slice-based threading spawns mulitple threads PER
>> FRAME instead of one per frame.
>>
>> Because of the inter-frame dependencies in video coding it is
>> impossible to efficiently get a granularity of more than one thread
>> per frame. ?Pooling threads doesn't change the fact that you are
>> conceptually creating a thread for each frame--it just eliminates the
>> pthread_create call. ?In theory you could do one thread per group of
>> frames, but that is completely unrealistic for real-time encoding
>> (e.g. streaming), requires a catastrophically large amount of memory,
>> makes it impossible to track the bit buffer, and all other sorts of
>> bad stuff.
>
> If you can scale to N threads by having 1 frame per thread, then
> you can scale to N/2 threads and have 2 frames per thread. Can't
> you?
>
> Is your problem in scaling to a large N?
>
>
x264's threading is described here:
http://akuvian.org/src/x264/sliceless_threads.txt
By example (3 threads), simplified:
Step 0:
Frame 0: 0% done
Step 1:
Frame 0: 33% done
Frame 1: 0% done
Step 2:
Frame 0: 66% done
Frame 1: 33% done
Frame 2: 0% done
Step 3:
Frame 0: 100% done
Frame 1: 66% done
Frame 2: 33% done
Frame 3: 0% done
Step 4:
Frame 1: 100% done
Frame 2: 66% done
Frame 3: 33% done
Frame 4: 0% done
(etc)
The motion search is restricted so that, for example, in Step 3, frame
2 doesn't look beyond the completed 66% of frame 1.
There's room reserved in terms of height for sync so that each thread
doesn't have to be exactly in lock-step with the others. This avoids
most unnecessary waiting.
The problem is that each frame is inherently one "work unit". Its
dependencies all consist on the previous frame (Frame 1 depends on
Frame 0). It doesn't make any sense to try to lump multiple frames
together into a work unit when the dependencies don't work that way.
Just dumping two frames arbitrarily in one thread turns this into a
thread pool, which as mentioned previously probably wouldn't help
significantly. If you meant working on two frames simultaneously in
the same thread, that's even worse--it's going to be a cache thrashing
disaster, since the scheduler can no longer move two threads to
separate cores, and you now have two totally separate sets of
processing trying to dump themselves into the same cache.
Furthermore, that doesn't reduce the main limitation on threading: the
vertical height of the frame.
Also, another thing to note is that "fast thread creation" isn't the
only problem here: the changes to the scheduler gave x264 enormous
speed boosts even at *slower* encoding modes. One user reported a
gain from 25fps -> 39fps, for example; that's dozens of milliseconds
per thread, far longer than I would think would cause problems due to
threads being too short lived. You should probably consider doing
some testing with slower encoding as well, both in terms of fast
settings and high-resolution inputs--and slow settings with
low-resolution inputs, where the bottleneck is purely computational.
Some resources for such testing:
1. http://media.xiph.org/video/derf/ has a lot of free test clips (HD
ones at the bottom).
2. x264 --help lists a set of presets from "ultrafast" to "placebo"
which can be used for testing purposes. "veryslow" and "placebo" are
probably not very suitable as they often tend to be horrifically
lookahead-bottlenecked.
Jason