OK, I fixed the graphs so you can actually read them now ;-)
http://test.kernel.org/perf/kernbench.elm3b6.png (x86_64 4x)
http://test.kernel.org/perf/kernbench.moe.png (NUMA-Q)
http://test.kernel.org/perf/kernbench.elm3b132.png (4x SMP ia32)
Both seems significantly slower on -mm (mm is green line)
If I look at diffprofile between 2.6.15 and 2.6.15-mm1, it just looks
like we have lots more idle time. You got strange scheduler changes in
there, that you've been carrying for a long time (2.6.14-mm1 at least)?
or HZ piddling? See to be mainly getting much more idle time.
Diffprofile:
1278 39.6% default_idle
1261 10.1% total
243 1518.8% find_get_page
220 0.0% copy_user_generic
100 357.1% free_hot_cold_page
...
-106 -93.0% __pagevec_free
-125 -100.0% __free_pages_ok
-239 -100.0% copy_user_generic_c
-242 -100.0% find_trylock_page
Original profiles:
http://test.kernel.org/19657/002.kernbench.test/profiling/profile.text
(2.6.15)
http://test.kernel.org/19794/002.kernbench.test/profiling/profile.text
(2.6.15-mm1)
Martin Bligh <[email protected]> wrote:
>
> OK, I fixed the graphs so you can actually read them now ;-)
They're cute.
> http://test.kernel.org/perf/kernbench.elm3b6.png (x86_64 4x)
> http://test.kernel.org/perf/kernbench.moe.png (NUMA-Q)
> http://test.kernel.org/perf/kernbench.elm3b132.png (4x SMP ia32)
>
> Both seems significantly slower on -mm (mm is green line)
Well, 1% slower. -mm has permanent not-for-linus debug things, some of
which are expected to have a performance impact. I don't know whether
they'd have a 1% impact though.
> If I look at diffprofile between 2.6.15 and 2.6.15-mm1, it just looks
> like we have lots more idle time.
Yes, we do. It'd be useful to test -git7..
> You got strange scheduler changes in
> there, that you've been carrying for a long time (2.6.14-mm1 at least)?
> or HZ piddling? See to be mainly getting much more idle time.
Yes, there are CPU scheduler changes, although much fewer than usual.
Ingo, any suggestions as to a culprit?
Andrew Morton wrote:
> Martin Bligh <[email protected]> wrote:
>
>>OK, I fixed the graphs so you can actually read them now ;-)
>
> They're cute.
Thanks. Hopefully we can also make them useful ;-)
Have finally got some time to play, and am working on a "push"
comparison model that'll send you email pro-actively.
>>http://test.kernel.org/perf/kernbench.elm3b6.png (x86_64 4x)
>>http://test.kernel.org/perf/kernbench.moe.png (NUMA-Q)
>>http://test.kernel.org/perf/kernbench.elm3b132.png (4x SMP ia32)
>>
>>Both seems significantly slower on -mm (mm is green line)
>
>
> Well, 1% slower. -mm has permanent not-for-linus debug things, some of
> which are expected to have a performance impact. I don't know whether
> they'd have a 1% impact though.
OK, fair enough. Can I turn them off somehow to check? it's 10% on
NUMA-Q. The good news is that it's stayed in -mm for long time, so ...
am praying.
Is cool to have extra debug stuff. It does make it harder to check perf
though ... if we can do runs both with and without, it'd be ideal, I
guess. I'd like to be able to spot perf degredations before they hit
mainline.
>>If I look at diffprofile between 2.6.15 and 2.6.15-mm1, it just looks
>>like we have lots more idle time.
>
>
> Yes, we do. It'd be useful to test -git7..
Will do. it does all of them.
>>You got strange scheduler changes in
>>there, that you've been carrying for a long time (2.6.14-mm1 at least)?
>>or HZ piddling? See to be mainly getting much more idle time.
>
> Yes, there are CPU scheduler changes, although much fewer than usual.
> Ingo, any suggestions as to a culprit?
I'd truncated all -mm info in the filtering before 2.6.14 .. am putting
it back so we can see clearly ... done. Look again.
Seems to have gone wrong between 2.6.14-rc1-mm1 and 2.6.14-rc2-mm1 ?
See http://test.kernel.org/perf/kernbench.moe.png for clearest effect.
M.
Martin Bligh <[email protected]> wrote:
>
> Seems to have gone wrong between 2.6.14-rc1-mm1 and 2.6.14-rc2-mm1 ?
> See http://test.kernel.org/perf/kernbench.moe.png for clearest effect.
akpm:/usr/src/25> diff -u pc/2.6.14-rc1-mm1-series pc/2.6.14-rc2-mm1-series| grep sched
+move-tasklist-walk-from-cfq-iosched-to-elevatorc.patch
sched-consider-migration-thread-with-smp-nice.patch
sched-better-wake-balancing-3.patch
+sched-modified-nice-support-for-smp-load-balancing.patch
#sched-rationalise-resched-and-cpu_idle.patch
#fixup-resched-and-arch-cpu_idle.patch
Try reverting that sucker?
On Wed, 11 Jan 2006 12:41 pm, Martin Bligh wrote:
> Seems to have gone wrong between 2.6.14-rc1-mm1 and 2.6.14-rc2-mm1 ?
> See http://test.kernel.org/perf/kernbench.moe.png for clearest effect.
The only new scheduler patch at that time was this:
+sched-modified-nice-support-for-smp-load-balancing.patch
which was Peter's modifications to my smp nice support. cc'ed Peter
I guess we need to check whether reversing this patch helps.
Con
Martin Bligh <[email protected]> wrote:
>
> > Well, 1% slower. -mm has permanent not-for-linus debug things, some of
> > which are expected to have a performance impact. I don't know whether
> > they'd have a 1% impact though.
>
> OK, fair enough. Can I turn them off somehow to check? it's 10% on
> NUMA-Q. The good news is that it's stayed in -mm for long time, so ...
> am praying.
There are quite a lot of them.
list_del-debug.patch
page-owner-tracking-leak-detector.patch
page-owner-tracking-leak-detector-fix.patch
unplug-can-sleep.patch
firestream-warnings.patch
#periodically-scan-redzone-entries-and-slab-control-structures.patch
slab-leak-detector.patch
releasing-resources-with-children.patch
nr_blockdev_pages-in_interrupt-warning.patch
nmi-lockup-and-altsysrq-p-dumping-calltraces-on-_all_-cpus.patch
detect-atomic-counter-underflows.patch
sysfs-crash-debugging.patch
device-suspend-debug.patch
slab-cache-shrinker-statistics.patch
slab-cache-shrinker-statistics-fix.patch
mm-debug-dump-pageframes-on-bad_page.patch
debug-warn-if-we-sleep-in-an-irq-for-a-long-time.patch
debug-shared-irqs.patch
debug-shared-irqs-fix.patch
make-frame_pointer-default=y.patch
list_del-debug.patch, detect-atomic-counter-underflows.patch,
slab-cache-shrinker-statistics.patch will all have a small impact.
But not much.
Con Kolivas wrote:
> On Wed, 11 Jan 2006 12:41 pm, Martin Bligh wrote:
>
>>Seems to have gone wrong between 2.6.14-rc1-mm1 and 2.6.14-rc2-mm1 ?
>>See http://test.kernel.org/perf/kernbench.moe.png for clearest effect.
>
>
> The only new scheduler patch at that time was this:
> +sched-modified-nice-support-for-smp-load-balancing.patch
>
> which was Peter's modifications to my smp nice support. cc'ed Peter
>
This patch will probably have overhead implications as it will skip some
tasks during load balancing looking for ones whose bias_prio is small
enough to fit within the amount of bias to be moved. Because the
candidate tasks are ordered in the array by dynamic priority and not
nice (or bias_prio) the search is exhaustive. I need to think about
whether this can be made a little smarter e.g. skip to the next idx as
soon as a task whose bias_prio is too large is encountered. This would
have the effect of missing some tasks that could have been moved due but
these will generally be tasks with interactive bonuses causing them to
be in a lower slot in the queue than would otherwise be the case and as
they generally do small CPU runs not moving them probably won't make
much difference.
> I guess we need to check whether reversing this patch helps.
It would be interesting to see if it does.
If it does we probably have to wear the cost (and try to reduce it) as
without this change smp nice support is fairly ineffective due to the
fact that it moves exactly the same tasks as would be moved without it.
At the most it changes the frequency at which load balancing occurs.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
On Wed, 11 Jan 2006 01:38 pm, Peter Williams wrote:
> Con Kolivas wrote:
> > I guess we need to check whether reversing this patch helps.
>
> It would be interesting to see if it does.
>
> If it does we probably have to wear the cost (and try to reduce it) as
> without this change smp nice support is fairly ineffective due to the
> fact that it moves exactly the same tasks as would be moved without it.
> At the most it changes the frequency at which load balancing occurs.
I disagree. I think the current implementation changes the balancing according
to nice much more effectively than previously where by their very nature, low
priority tasks were balanced more frequently and ended up getting their own
cpu. No it does not provide firm 'nice' handling that we can achieve on UP
configurations but it is also free in throughput terms and miles better than
without it. I would like to see your more robust (and nicer code) solution
incorporated but I also want to see it cost us as little as possible. We
haven't confirmed anything just yet...
Cheers,
Con
Con Kolivas wrote:
> On Wed, 11 Jan 2006 01:38 pm, Peter Williams wrote:
>
>>Con Kolivas wrote:
>> > I guess we need to check whether reversing this patch helps.
>>
>>It would be interesting to see if it does.
>>
>>If it does we probably have to wear the cost (and try to reduce it) as
>>without this change smp nice support is fairly ineffective due to the
>>fact that it moves exactly the same tasks as would be moved without it.
>> At the most it changes the frequency at which load balancing occurs.
>
>
> I disagree. I think the current implementation changes the balancing according
> to nice much more effectively than previously where by their very nature, low
> priority tasks were balanced more frequently and ended up getting their own
> cpu. No it does not provide firm 'nice' handling that we can achieve on UP
> configurations but it is also free in throughput terms and miles better than
> without it. I would like to see your more robust (and nicer code) solution
> incorporated but I also want to see it cost us as little as possible. We
> haven't confirmed anything just yet...
Whether it turns out to be that or not ... 10% is a BIG frigging hit to
take doing something basic like kernel compilation. Hell, 1% is a big
hit to take, given the lengths we go to to buy that sort of benefit.
M.
Con Kolivas wrote:
> On Wed, 11 Jan 2006 01:38 pm, Peter Williams wrote:
>
>>Con Kolivas wrote:
>> > I guess we need to check whether reversing this patch helps.
>>
>>It would be interesting to see if it does.
>>
>>If it does we probably have to wear the cost (and try to reduce it) as
>>without this change smp nice support is fairly ineffective due to the
>>fact that it moves exactly the same tasks as would be moved without it.
>> At the most it changes the frequency at which load balancing occurs.
>
>
> I disagree. I think the current implementation changes the balancing according
> to nice much more effectively than previously where by their very nature, low
> priority tasks were balanced more frequently and ended up getting their own
> cpu.
I can't follow the logic here and I certainly don't see much difference
in practice.
> No it does not provide firm 'nice' handling that we can achieve on UP
> configurations but it is also free in throughput terms and miles better than
> without it. I would like to see your more robust (and nicer code) solution
> incorporated but I also want to see it cost us as little as possible. We
> haven't confirmed anything just yet...
Yes, that's true. I must admit that I wouldn't have expected the
increased overhead to be very big. In general, the "system" CPU time in
a kernbench is only about 1% of the total CPU time and the extra
overhead should be just a fraction of that.
It's possible that better distribution of niceness across CPU leads to
more preemption within a run queue (i.e. if there all the same priority
they won't preempt each other much) leading to more context switches.
But you wouldn't expect that to show up in kernbench as the tasks are
all the same niceness and usually end up with the same dynamic priority.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
On Wed, 11 Jan 2006 02:40 pm, Peter Williams wrote:
> Con Kolivas wrote:
> > I disagree. I think the current implementation changes the balancing
> > according to nice much more effectively than previously where by their
> > very nature, low priority tasks were balanced more frequently and ended
> > up getting their own cpu.
>
> I can't follow the logic here
cpu bound non interactive tasks have long timeslices. Tasks that have short
timeslices like interactive ones or cpu bound ones at nice 19 have short
timeslices. If a nice 0 and nice 19 task are running on the same cpu, the
nice 19 one is going to be spending most of its time waiting in the runqueue.
As soon as an idle cpu appears it will only pull a task that is waiting in a
runqueue... and that is going to be the low priority tasks.
> Yes, that's true. I must admit that I wouldn't have expected the
> increased overhead to be very big. In general, the "system" CPU time in
> a kernbench is only about 1% of the total CPU time and the extra
> overhead should be just a fraction of that.
Doesn't appear to be system time. Extra idle time does suggest poor balancing
though. Remember the effort I went to to avoid altering the delicate idle
balancing...
> It's possible that better distribution of niceness across CPU leads to
> more preemption within a run queue (i.e. if there all the same priority
> they won't preempt each other much) leading to more context switches.
Can't see how that is for what you say below.
> But you wouldn't expect that to show up in kernbench as the tasks are
> all the same niceness and usually end up with the same dynamic priority.
The whole of userspace on a kernbench run is going to be nice 0.
Let's wait till we confirm or deny this patch is responsible before
postulating further.
Cheers,
Con
Con Kolivas wrote:
> On Wed, 11 Jan 2006 02:40 pm, Peter Williams wrote:
>
>>Con Kolivas wrote:
>>
>>>I disagree. I think the current implementation changes the balancing
>>>according to nice much more effectively than previously where by their
>>>very nature, low priority tasks were balanced more frequently and ended
>>>up getting their own cpu.
>>
>>I can't follow the logic here
>
>
> cpu bound non interactive tasks have long timeslices. Tasks that have short
> timeslices like interactive ones or cpu bound ones at nice 19 have short
> timeslices.
Time slice size is dependent on nice (via static_prio) only, gets bigger
as static_prio gets smaller and only really effects the switching of
tasks from the active array to the expired array. This means that
programs with high nice values will tend to spend more time on the
expired array than the active array. Since the expired queue is always
examined first this makes them the most likely to be moved regardless of
the smp nice patch. This is independent of the amount of CPU a task
uses each time it gets onto the CPU which is what I think you were
alluding to.
> If a nice 0 and nice 19 task are running on the same cpu, the
> nice 19 one is going to be spending most of its time waiting in the runqueue.
> As soon as an idle cpu appears it will only pull a task that is waiting in a
> runqueue... and that is going to be the low priority tasks.
Because they're more likely to be on the expired array.
So the patch works by reducing the chance of any tasks being moved
during an idle rebalance. Surely this is likely to increase the amount
of idle time.
Maybe the idle balance should check the active arrays before it checks
the expired arrays? This will increase the chances of getting a high
priority task. The down side is that tasks on the active array are more
likely to be "cache warm" which is why the expired array is checked
first (hence the suggestion that this only apply to idle balancing).
But, as you say, let's wait and see what happens with the patch backed
out before we get too carried away.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Peter Williams wrote:
> Con Kolivas wrote:
>
>> On Wed, 11 Jan 2006 01:38 pm, Peter Williams wrote:
>>
>>> Con Kolivas wrote:
>>> > I guess we need to check whether reversing this patch helps.
>>>
>>> It would be interesting to see if it does.
>>>
>>> If it does we probably have to wear the cost (and try to reduce it) as
>>> without this change smp nice support is fairly ineffective due to the
>>> fact that it moves exactly the same tasks as would be moved without it.
>>> At the most it changes the frequency at which load balancing occurs.
>>
>>
>>
>> I disagree. I think the current implementation changes the balancing
>> according to nice much more effectively than previously where by their
>> very nature, low priority tasks were balanced more frequently and
>> ended up getting their own cpu.
>
>
> I can't follow the logic here and I certainly don't see much difference
> in practice.
I think I've figured out why I'm not seeing much difference in practice.
I'm only testing on 2 CPU systems and it seems to me that the main
difference that the SMP nice patch will have is in selecting which CPU
to steal tasks from (grabbing the one with the highest priority tasks)
and this is a non issue on a 2 CPU system. :-(
So I should revise my statement to say that it doesn't make much
difference if there's only 2 CPUs.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Peter Williams wrote:
> Peter Williams wrote:
>
>> Con Kolivas wrote:
>>
>>> On Wed, 11 Jan 2006 01:38 pm, Peter Williams wrote:
>>>
>>>> Con Kolivas wrote:
>>>> > I guess we need to check whether reversing this patch helps.
>>>>
>>>> It would be interesting to see if it does.
>>>>
>>>> If it does we probably have to wear the cost (and try to reduce it) as
>>>> without this change smp nice support is fairly ineffective due to the
>>>> fact that it moves exactly the same tasks as would be moved without
>>>> it.
>>>> At the most it changes the frequency at which load balancing occurs.
>>>
>>>
>>>
>>>
>>> I disagree. I think the current implementation changes the balancing
>>> according to nice much more effectively than previously where by
>>> their very nature, low priority tasks were balanced more frequently
>>> and ended up getting their own cpu.
>>
>>
>>
>> I can't follow the logic here and I certainly don't see much
>> difference in practice.
>
>
> I think I've figured out why I'm not seeing much difference in
> practice. I'm only testing on 2 CPU systems and it seems to me that
> the main difference that the SMP nice patch will have is in selecting
> which CPU to steal tasks from (grabbing the one with the highest
> priority tasks) and this is a non issue on a 2 CPU system. :-(
>
> So I should revise my statement to say that it doesn't make much
> difference if there's only 2 CPUs.
>
If nothing's niced, why would it be affecting scheduling decisions at all?
That seems broken to me ?
M.
Martin J. Bligh wrote:
> Peter Williams wrote:
>
>> Peter Williams wrote:
>>
>>> Con Kolivas wrote:
>>>
>>>> On Wed, 11 Jan 2006 01:38 pm, Peter Williams wrote:
>>>>
>>>>> Con Kolivas wrote:
>>>>> > I guess we need to check whether reversing this patch helps.
>>>>>
>>>>> It would be interesting to see if it does.
>>>>>
>>>>> If it does we probably have to wear the cost (and try to reduce it) as
>>>>> without this change smp nice support is fairly ineffective due to the
>>>>> fact that it moves exactly the same tasks as would be moved without
>>>>> it.
>>>>> At the most it changes the frequency at which load balancing occurs.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I disagree. I think the current implementation changes the balancing
>>>> according to nice much more effectively than previously where by
>>>> their very nature, low priority tasks were balanced more frequently
>>>> and ended up getting their own cpu.
>>>
>>>
>>>
>>>
>>> I can't follow the logic here and I certainly don't see much
>>> difference in practice.
>>
>>
>>
>> I think I've figured out why I'm not seeing much difference in
>> practice. I'm only testing on 2 CPU systems and it seems to me that
>> the main difference that the SMP nice patch will have is in selecting
>> which CPU to steal tasks from (grabbing the one with the highest
>> priority tasks) and this is a non issue on a 2 CPU system. :-(
>>
>> So I should revise my statement to say that it doesn't make much
>> difference if there's only 2 CPUs.
>>
>
> If nothing's niced, why would it be affecting scheduling decisions at all?
Load balancing decisions.
> That seems broken to me ?
But, yes, given that the problem goes away when the patch is removed
(which we're still waiting to see) it's broken. I think the problem is
probably due to the changed metric (i.e. biased load instead of simple
load) causing idle_balance() to fail more often (i.e. it decides to not
bother moving any tasks more often than it otherwise would) which would
explain the increased idle time being seen. This means that the fix
would be to review the criteria for deciding whether to move tasks in
idle_balance().
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
On Wednesday 11 January 2006 23:24, Peter Williams wrote:
> Martin J. Bligh wrote:
> > That seems broken to me ?
>
> But, yes, given that the problem goes away when the patch is removed
> (which we're still waiting to see) it's broken. I think the problem is
> probably due to the changed metric (i.e. biased load instead of simple
> load) causing idle_balance() to fail more often (i.e. it decides to not
> bother moving any tasks more often than it otherwise would) which would
> explain the increased idle time being seen. This means that the fix
> would be to review the criteria for deciding whether to move tasks in
> idle_balance().
Look back on my implementation. The problem as I saw it was that one task
alone with a biased load would suddenly make a runqueue look much busier than
it was supposed to so I special cased the runqueue that had precisely one
task.
Cheers,
Con
Con Kolivas wrote:
> On Wednesday 11 January 2006 23:24, Peter Williams wrote:
>
>>Martin J. Bligh wrote:
>>
>>>That seems broken to me ?
>>
>>But, yes, given that the problem goes away when the patch is removed
>>(which we're still waiting to see) it's broken. I think the problem is
>>probably due to the changed metric (i.e. biased load instead of simple
>>load) causing idle_balance() to fail more often (i.e. it decides to not
>>bother moving any tasks more often than it otherwise would) which would
>>explain the increased idle time being seen. This means that the fix
>>would be to review the criteria for deciding whether to move tasks in
>>idle_balance().
>
>
> Look back on my implementation. The problem as I saw it was that one task
> alone with a biased load would suddenly make a runqueue look much busier than
> it was supposed to so I special cased the runqueue that had precisely one
> task.
OK. I'll look at that.
But I was thinking more about the code that (in the original) handled
the case where the number of tasks to be moved was less than 1 but more
than 0 (i.e. the cases where "imbalance" would have been reduced to zero
when divided by SCHED_LOAD_SCALE). I think that I got that part wrong
and you can end up with a bias load to be moved which is less than any
of the bias_prio values for any queued tasks (in circumstances where the
original code would have rounded up to 1 and caused a move). I think
that the way to handle this problem is to replace 1 with "average bias
prio" within that logic. This would guarantee at least one task with a
bias_prio small enough to be moved.
I think that this analysis is a strong argument for my original patch
being the cause of the problem so I'll go ahead and generate a fix.
I'll try to have a patch available later this morning.
Peter
PS
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Index: MM-2.6.X/kernel/sched.c
===================================================================
--- MM-2.6.X.orig/kernel/sched.c 2006-01-12 09:23:38.000000000 +1100
+++ MM-2.6.X/kernel/sched.c 2006-01-12 10:44:50.000000000 +1100
@@ -2116,11 +2116,11 @@ find_busiest_group(struct sched_domain *
(avg_load - this_load) * this->cpu_power)
/ SCHED_LOAD_SCALE;
- if (*imbalance < SCHED_LOAD_SCALE) {
+ if (*imbalance < NICE_TO_BIAS_PRIO(0) * SCHED_LOAD_SCALE) {
unsigned long pwr_now = 0, pwr_move = 0;
unsigned long tmp;
- if (max_load - this_load >= SCHED_LOAD_SCALE*2) {
+ if (max_load - this_load >= NICE_TO_BIAS_PRIO(0) * SCHED_LOAD_SCALE*2) {
*imbalance = NICE_TO_BIAS_PRIO(0);
return busiest;
}
On Thu, 12 Jan 2006 11:54 am, Peter Williams wrote:
> Peter Williams wrote:
> > Con Kolivas wrote:
> >> On Wednesday 11 January 2006 23:24, Peter Williams wrote:
> >>> Martin J. Bligh wrote:
> >>>> That seems broken to me ?
> >>>
> >>> But, yes, given that the problem goes away when the patch is removed
> >>> (which we're still waiting to see) it's broken. I think the problem is
> >>> probably due to the changed metric (i.e. biased load instead of simple
> >>> load) causing idle_balance() to fail more often (i.e. it decides to not
> >>> bother moving any tasks more often than it otherwise would) which would
> >>> explain the increased idle time being seen. This means that the fix
> >>> would be to review the criteria for deciding whether to move tasks in
> >>> idle_balance().
> >>
> >> Look back on my implementation. The problem as I saw it was that one
> >> task alone with a biased load would suddenly make a runqueue look much
> >> busier than it was supposed to so I special cased the runqueue that
> >> had precisely one task.
> >
> > OK. I'll look at that.
>
> Addressed in a separate e-mail.
>
> > But I was thinking more about the code that (in the original) handled
> > the case where the number of tasks to be moved was less than 1 but more
> > than 0 (i.e. the cases where "imbalance" would have been reduced to zero
> > when divided by SCHED_LOAD_SCALE). I think that I got that part wrong
> > and you can end up with a bias load to be moved which is less than any
> > of the bias_prio values for any queued tasks (in circumstances where the
> > original code would have rounded up to 1 and caused a move). I think
> > that the way to handle this problem is to replace 1 with "average bias
> > prio" within that logic. This would guarantee at least one task with a
> > bias_prio small enough to be moved.
> >
> > I think that this analysis is a strong argument for my original patch
> > being the cause of the problem so I'll go ahead and generate a fix. I'll
> > try to have a patch available later this morning.
>
> Attached is a patch that addresses this problem. Unlike the description
> above it does not use "average bias prio" as that solution would be very
> complicated. Instead it makes the assumption that NICE_TO_BIAS_PRIO(0)
> is a "good enough" for this purpose as this is highly likely to be the
> median bias prio and the median is probably better for this purpose than
> the average.
This is a shot in the dark. We haven't confirmed 1. there is a problem 2. that
this is the problem nor 3. that this patch will fix the problem. I say we
wait for the results of 1. If the improved smp nice handling patch ends up
being responsible then it should not be merged upstream, and then this patch
can be tested on top.
Martin I know your work move has made it not your responsibility to test
backing out this change, but are you aware of anything being done to test
this hypothesis?
Con
Index: MM-2.6.X/kernel/sched.c
===================================================================
--- MM-2.6.X.orig/kernel/sched.c 2006-01-12 10:44:50.000000000 +1100
+++ MM-2.6.X/kernel/sched.c 2006-01-12 10:47:01.000000000 +1100
@@ -2052,6 +2052,7 @@ find_busiest_group(struct sched_domain *
unsigned long load;
int local_group;
int i;
+ unsigned int eligible_qs = 0;
local_group = cpu_isset(this_cpu, group->cpumask);
@@ -2065,8 +2066,11 @@ find_busiest_group(struct sched_domain *
/* Bias balancing toward cpus of our domain */
if (local_group)
load = target_load(i, load_idx);
- else
+ else {
load = source_load(i, load_idx);
+ if (cpu_rq(i)->nr_running > 1)
+ ++eligible_qs;
+ }
avg_load += load;
}
@@ -2080,7 +2084,7 @@ find_busiest_group(struct sched_domain *
if (local_group) {
this_load = avg_load;
this = group;
- } else if (avg_load > max_load) {
+ } else if (avg_load > max_load && eligible_qs) {
max_load = avg_load;
busiest = group;
}
@@ -2181,8 +2185,12 @@ static runqueue_t *find_busiest_queue(st
load = source_load(i, 0);
if (load > max_load) {
- max_load = load;
- busiest = cpu_rq(i);
+ runqueue_t *tmprq = cpu_rq(i);
+
+ if (tmprq->nr_running > 1) {
+ max_load = load;
+ busiest = tmprq;
+ }
}
}
Con Kolivas wrote:
> On Thu, 12 Jan 2006 11:54 am, Peter Williams wrote:
>
>>Peter Williams wrote:
>>
>>>Con Kolivas wrote:
>>>
>>>>On Wednesday 11 January 2006 23:24, Peter Williams wrote:
>>>>
>>>>>Martin J. Bligh wrote:
>>>>>
>>>>>>That seems broken to me ?
>>>>>
>>>>>But, yes, given that the problem goes away when the patch is removed
>>>>>(which we're still waiting to see) it's broken. I think the problem is
>>>>>probably due to the changed metric (i.e. biased load instead of simple
>>>>>load) causing idle_balance() to fail more often (i.e. it decides to not
>>>>>bother moving any tasks more often than it otherwise would) which would
>>>>>explain the increased idle time being seen. This means that the fix
>>>>>would be to review the criteria for deciding whether to move tasks in
>>>>>idle_balance().
>>>>
>>>>Look back on my implementation. The problem as I saw it was that one
>>>>task alone with a biased load would suddenly make a runqueue look much
>>>>busier than it was supposed to so I special cased the runqueue that
>>>>had precisely one task.
>>>
>>>OK. I'll look at that.
>>
>>Addressed in a separate e-mail.
>>
>>
>>>But I was thinking more about the code that (in the original) handled
>>>the case where the number of tasks to be moved was less than 1 but more
>>>than 0 (i.e. the cases where "imbalance" would have been reduced to zero
>>>when divided by SCHED_LOAD_SCALE). I think that I got that part wrong
>>>and you can end up with a bias load to be moved which is less than any
>>>of the bias_prio values for any queued tasks (in circumstances where the
>>>original code would have rounded up to 1 and caused a move). I think
>>>that the way to handle this problem is to replace 1 with "average bias
>>>prio" within that logic. This would guarantee at least one task with a
>>>bias_prio small enough to be moved.
>>>
>>>I think that this analysis is a strong argument for my original patch
>>>being the cause of the problem so I'll go ahead and generate a fix. I'll
>>>try to have a patch available later this morning.
>>
>>Attached is a patch that addresses this problem. Unlike the description
>>above it does not use "average bias prio" as that solution would be very
>>complicated. Instead it makes the assumption that NICE_TO_BIAS_PRIO(0)
>>is a "good enough" for this purpose as this is highly likely to be the
>>median bias prio and the median is probably better for this purpose than
>>the average.
>
>
> This is a shot in the dark. We haven't confirmed 1. there is a problem 2. that
> this is the problem nor 3. that this patch will fix the problem.
I disagree. I think that there is a clear mistake in my original patch
that this patch fixes.
> I say we
> wait for the results of 1. If the improved smp nice handling patch ends up
> being responsible then it should not be merged upstream, and then this patch
> can be tested on top.
>
> Martin I know your work move has made it not your responsibility to test
> backing out this change, but are you aware of anything being done to test
> this hypothesis?
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
On Thu, 12 Jan 2006 12:29 pm, Peter Williams wrote:
> Con Kolivas wrote:
> > This is a shot in the dark. We haven't confirmed 1. there is a problem 2.
> > that this is the problem nor 3. that this patch will fix the problem.
>
> I disagree. I think that there is a clear mistake in my original patch
> that this patch fixes.
I agree with you on that. The real concern is that we were just about to push
it upstream. So where does this leave us? I propose we delay merging the
"improved smp nice handling" patch into mainline pending your further
changes.
Cheers,
Con
> This is a shot in the dark. We haven't confirmed 1. there is a problem 2. that
> this is the problem nor 3. that this patch will fix the problem. I say we
> wait for the results of 1. If the improved smp nice handling patch ends up
> being responsible then it should not be merged upstream, and then this patch
> can be tested on top.
>
> Martin I know your work move has made it not your responsibility to test
> backing out this change, but are you aware of anything being done to test
> this hypothesis?
Yup, Andy queued the job ... just waiting for it to pop out the system
M.
Con Kolivas wrote:
> On Thu, 12 Jan 2006 12:29 pm, Peter Williams wrote:
>
>>Con Kolivas wrote:
>>
>>>This is a shot in the dark. We haven't confirmed 1. there is a problem 2.
>>>that this is the problem nor 3. that this patch will fix the problem.
>>
>>I disagree. I think that there is a clear mistake in my original patch
>>that this patch fixes.
>
>
> I agree with you on that. The real concern is that we were just about to push
> it upstream. So where does this leave us? I propose we delay merging the
> "improved smp nice handling" patch into mainline pending your further
> changes.
I think that they're already in 2.6.15 minus my "move load not tasks"
modification which I was expecting to go into 2.6.16. Is that what you
meant?
If so I think this is a small and obvious fix that shouldn't delay the
merging of "move load not tasks" into the mainline. But it's not my call.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Peter Williams wrote:
> Con Kolivas wrote:
>
>> On Thu, 12 Jan 2006 12:29 pm, Peter Williams wrote:
>>
>>> Con Kolivas wrote:
>>>
>>>> This is a shot in the dark. We haven't confirmed 1. there is a
>>>> problem 2.
>>>> that this is the problem nor 3. that this patch will fix the problem.
>>>
>>>
>>> I disagree. I think that there is a clear mistake in my original patch
>>> that this patch fixes.
>>
>>
>>
>> I agree with you on that. The real concern is that we were just about
>> to push it upstream. So where does this leave us? I propose we delay
>> merging the
>> "improved smp nice handling" patch into mainline pending your further
>> changes.
>
>
> I think that they're already in 2.6.15 minus my "move load not tasks"
> modification which I was expecting to go into 2.6.16. Is that what you
> meant?
>
> If so I think this is a small and obvious fix that shouldn't delay the
> merging of "move load not tasks" into the mainline. But it's not my call.
BTW, in reference to a question from last night ... -git7 seems fine. So
if you merged it already, it wasn't that ;-)
M.
On Thu, 12 Jan 2006 01:23 pm, Peter Williams wrote:
> Con Kolivas wrote:
> > On Thu, 12 Jan 2006 12:29 pm, Peter Williams wrote:
> >>Con Kolivas wrote:
> >>>This is a shot in the dark. We haven't confirmed 1. there is a problem
> >>> 2. that this is the problem nor 3. that this patch will fix the
> >>> problem.
> >>
> >>I disagree. I think that there is a clear mistake in my original patch
> >>that this patch fixes.
> >
> > I agree with you on that. The real concern is that we were just about to
> > push it upstream. So where does this leave us? I propose we delay
> > merging the "improved smp nice handling" patch into mainline pending your
> > further changes.
>
> I think that they're already in 2.6.15 minus my "move load not tasks"
> modification which I was expecting to go into 2.6.16. Is that what you
> meant?
Yes, akpm included your patch under the name of something like "improved smp
nice handling". My smp nice patches went into mainline 2.6.15 and were
extensively tested for performance regressions. This is about your move load
not tasks modification which is the patch in question that might be affecting
performance.
Con
> If so I think this is a small and obvious fix that shouldn't delay the
> merging of "move load not tasks" into the mainline. But it's not my call.
>
> Peter
Martin Bligh wrote:
>
>> This is a shot in the dark. We haven't confirmed 1. there is a
>> problem 2. that this is the problem nor 3. that this patch will fix
>> the problem. I say we wait for the results of 1. If the improved smp
>> nice handling patch ends up being responsible then it should not be
>> merged upstream, and then this patch can be tested on top.
>>
>> Martin I know your work move has made it not your responsibility to
>> test backing out this change, but are you aware of anything being
>> done to test this hypothesis?
>
>
OK, backing out that patch seems to fix it. Thanks Andy ;-)
M.
On Thu, 12 Jan 2006 01:26 pm, Martin Bligh wrote:
> Peter Williams wrote:
> > Con Kolivas wrote:
> >> On Thu, 12 Jan 2006 12:29 pm, Peter Williams wrote:
> >>> Con Kolivas wrote:
> >>>> This is a shot in the dark. We haven't confirmed 1. there is a
> >>>> problem 2.
> >>>> that this is the problem nor 3. that this patch will fix the problem.
> >>>
> >>> I disagree. I think that there is a clear mistake in my original patch
> >>> that this patch fixes.
> >>
> >> I agree with you on that. The real concern is that we were just about
> >> to push it upstream. So where does this leave us? I propose we delay
> >> merging the
> >> "improved smp nice handling" patch into mainline pending your further
> >> changes.
> >
> > I think that they're already in 2.6.15 minus my "move load not tasks"
> > modification which I was expecting to go into 2.6.16. Is that what you
> > meant?
> >
> > If so I think this is a small and obvious fix that shouldn't delay the
> > merging of "move load not tasks" into the mainline. But it's not my
> > call.
>
> BTW, in reference to a question from last night ... -git7 seems fine. So
> if you merged it already, it wasn't that ;-)
Thanks and looks like the results are in from 2.6.14-rc2-mm1 with the patch
backed out.
Drumroll....
http://test.kernel.org/perf/kernbench.moe.png
The performance goes back to a range similar to 2.6.14-rc1-mm1 (see
2.6.14-rc2-mm1 + 20328 in blue). Unfortunately this does implicate this
patch. Can we put it back into -mm only and allow Peter's tweaks/fixes to go
on top and have it tested some more before going upstream?
Peter, Ingo?
Cheers,
Con
On Thu, 12 Jan 2006 05:35 pm, Martin J. Bligh wrote:
> Martin Bligh wrote:
> >> This is a shot in the dark. We haven't confirmed 1. there is a
> >> problem 2. that this is the problem nor 3. that this patch will fix
> >> the problem. I say we wait for the results of 1. If the improved smp
> >> nice handling patch ends up being responsible then it should not be
> >> merged upstream, and then this patch can be tested on top.
> >>
> >> Martin I know your work move has made it not your responsibility to
> >> test backing out this change, but are you aware of anything being
> >> done to test this hypothesis?
>
> OK, backing out that patch seems to fix it. Thanks Andy ;-)
Yes thanks!
I just noticed the result myself and emailed on another branch of this email
thread. Gah
Con
Con Kolivas wrote:
> On Thu, 12 Jan 2006 05:35 pm, Martin J. Bligh wrote:
>
>>Martin Bligh wrote:
>>
>>>>This is a shot in the dark. We haven't confirmed 1. there is a
>>>>problem 2. that this is the problem nor 3. that this patch will fix
>>>>the problem. I say we wait for the results of 1. If the improved smp
>>>>nice handling patch ends up being responsible then it should not be
>>>>merged upstream, and then this patch can be tested on top.
>>>>
>>>>Martin I know your work move has made it not your responsibility to
>>>>test backing out this change, but are you aware of anything being
>>>>done to test this hypothesis?
>>
>>OK, backing out that patch seems to fix it. Thanks Andy ;-)
>
>
> Yes thanks!
>
> I just noticed the result myself and emailed on another branch of this email
> thread. Gah
It would nice to get a run with the patch back in and my fix applied.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
>>
>> But I was thinking more about the code that (in the original) handled
>> the case where the number of tasks to be moved was less than 1 but
>> more than 0 (i.e. the cases where "imbalance" would have been reduced
>> to zero when divided by SCHED_LOAD_SCALE). I think that I got that
>> part wrong and you can end up with a bias load to be moved which is
>> less than any of the bias_prio values for any queued tasks (in
>> circumstances where the original code would have rounded up to 1 and
>> caused a move). I think that the way to handle this problem is to
>> replace 1 with "average bias prio" within that logic. This would
>> guarantee at least one task with a bias_prio small enough to be moved.
>>
>> I think that this analysis is a strong argument for my original patch
>> being the cause of the problem so I'll go ahead and generate a fix.
>> I'll try to have a patch available later this morning.
>
>
> Attached is a patch that addresses this problem. Unlike the description
> above it does not use "average bias prio" as that solution would be very
> complicated. Instead it makes the assumption that NICE_TO_BIAS_PRIO(0)
> is a "good enough" for this purpose as this is highly likely to be the
> median bias prio and the median is probably better for this purpose than
> the average.
>
> Signed-off-by: Peter Williams <[email protected]>
Doesn't fix the perf issue.
M.
> Peter
>
>
> ------------------------------------------------------------------------
>
> Index: MM-2.6.X/kernel/sched.c
> ===================================================================
> --- MM-2.6.X.orig/kernel/sched.c 2006-01-12 09:23:38.000000000 +1100
> +++ MM-2.6.X/kernel/sched.c 2006-01-12 10:44:50.000000000 +1100
> @@ -2116,11 +2116,11 @@ find_busiest_group(struct sched_domain *
> (avg_load - this_load) * this->cpu_power)
> / SCHED_LOAD_SCALE;
>
> - if (*imbalance < SCHED_LOAD_SCALE) {
> + if (*imbalance < NICE_TO_BIAS_PRIO(0) * SCHED_LOAD_SCALE) {
> unsigned long pwr_now = 0, pwr_move = 0;
> unsigned long tmp;
>
> - if (max_load - this_load >= SCHED_LOAD_SCALE*2) {
> + if (max_load - this_load >= NICE_TO_BIAS_PRIO(0) * SCHED_LOAD_SCALE*2) {
> *imbalance = NICE_TO_BIAS_PRIO(0);
> return busiest;
> }
Martin Bligh wrote:
>
>>>
>>> But I was thinking more about the code that (in the original) handled
>>> the case where the number of tasks to be moved was less than 1 but
>>> more than 0 (i.e. the cases where "imbalance" would have been reduced
>>> to zero when divided by SCHED_LOAD_SCALE). I think that I got that
>>> part wrong and you can end up with a bias load to be moved which is
>>> less than any of the bias_prio values for any queued tasks (in
>>> circumstances where the original code would have rounded up to 1 and
>>> caused a move). I think that the way to handle this problem is to
>>> replace 1 with "average bias prio" within that logic. This would
>>> guarantee at least one task with a bias_prio small enough to be moved.
>>>
>>> I think that this analysis is a strong argument for my original patch
>>> being the cause of the problem so I'll go ahead and generate a fix.
>>> I'll try to have a patch available later this morning.
>>
>>
>>
>> Attached is a patch that addresses this problem. Unlike the
>> description above it does not use "average bias prio" as that solution
>> would be very complicated. Instead it makes the assumption that
>> NICE_TO_BIAS_PRIO(0) is a "good enough" for this purpose as this is
>> highly likely to be the median bias prio and the median is probably
>> better for this purpose than the average.
>>
>> Signed-off-by: Peter Williams <[email protected]>
>
>
> Doesn't fix the perf issue.
OK, thanks. I think there's a few more places where SCHED_LOAD_SCALE
needs to be multiplied by NICE_TO_BIAS_PRIO(0). Basically, anywhere
that it's added to, subtracted from or compared to a load. In those
cases it's being used as a scaled version of 1 and we need a scaled
version of NICE_TO_BIAS_PRIO(0). I'll have another patch later today.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Peter Williams wrote:
> Martin Bligh wrote:
>
>>
>>>>
>>>> But I was thinking more about the code that (in the original)
>>>> handled the case where the number of tasks to be moved was less than
>>>> 1 but more than 0 (i.e. the cases where "imbalance" would have been
>>>> reduced to zero when divided by SCHED_LOAD_SCALE). I think that I
>>>> got that part wrong and you can end up with a bias load to be moved
>>>> which is less than any of the bias_prio values for any queued tasks
>>>> (in circumstances where the original code would have rounded up to 1
>>>> and caused a move). I think that the way to handle this problem is
>>>> to replace 1 with "average bias prio" within that logic. This would
>>>> guarantee at least one task with a bias_prio small enough to be moved.
>>>>
>>>> I think that this analysis is a strong argument for my original
>>>> patch being the cause of the problem so I'll go ahead and generate a
>>>> fix. I'll try to have a patch available later this morning.
>>>
>>>
>>>
>>>
>>> Attached is a patch that addresses this problem. Unlike the
>>> description above it does not use "average bias prio" as that
>>> solution would be very complicated. Instead it makes the assumption
>>> that NICE_TO_BIAS_PRIO(0) is a "good enough" for this purpose as this
>>> is highly likely to be the median bias prio and the median is
>>> probably better for this purpose than the average.
>>>
>>> Signed-off-by: Peter Williams <[email protected]>
>>
>>
>>
>> Doesn't fix the perf issue.
>
>
> OK, thanks. I think there's a few more places where SCHED_LOAD_SCALE
> needs to be multiplied by NICE_TO_BIAS_PRIO(0). Basically, anywhere
> that it's added to, subtracted from or compared to a load. In those
> cases it's being used as a scaled version of 1 and we need a scaled
This would have been better said as "the load generated by 1 task"
rather than just "a scaled version of 1". Numerically, they're the same
but one is clearer than the other and makes it more obvious why we need
NICE_TO_BIAS_PRIO(0) * SCHED_LOAD_SCALE and where we need it.
> version of NICE_TO_BIAS_PRIO(0). I'll have another patch later today.
I'm just testing this at the moment.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Index: MM-2.6.X/kernel/sched.c
===================================================================
--- MM-2.6.X.orig/kernel/sched.c 2006-01-13 14:53:34.000000000 +1100
+++ MM-2.6.X/kernel/sched.c 2006-01-13 15:11:19.000000000 +1100
@@ -1042,7 +1042,8 @@ void kick_process(task_t *p)
static unsigned long source_load(int cpu, int type)
{
runqueue_t *rq = cpu_rq(cpu);
- unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
+ unsigned long load_now = (rq->prio_bias * SCHED_LOAD_SCALE) /
+ NICE_TO_BIAS_PRIO(0);
if (type == 0)
return load_now;
@@ -1056,7 +1057,8 @@ static unsigned long source_load(int cpu
static inline unsigned long target_load(int cpu, int type)
{
runqueue_t *rq = cpu_rq(cpu);
- unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
+ unsigned long load_now = (rq->prio_bias * SCHED_LOAD_SCALE) /
+ NICE_TO_BIAS_PRIO(0);
if (type == 0)
return load_now;
@@ -1322,7 +1324,8 @@ static int try_to_wake_up(task_t *p, uns
* of the current CPU:
*/
if (sync)
- tl -= p->bias_prio * SCHED_LOAD_SCALE;
+ tl -= (p->bias_prio * SCHED_LOAD_SCALE) /
+ NICE_TO_BIAS_PRIO(0);
if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
@@ -2159,7 +2162,7 @@ find_busiest_group(struct sched_domain *
}
/* Get rid of the scaling factor, rounding down as we divide */
- *imbalance = *imbalance / SCHED_LOAD_SCALE;
+ *imbalance = (*imbalance * NICE_TO_BIAS_PRIO(0)) / SCHED_LOAD_SCALE;
return busiest;
out_balanced:
@@ -2472,7 +2475,8 @@ static void rebalance_tick(int this_cpu,
struct sched_domain *sd;
int i;
- this_load = this_rq->prio_bias * SCHED_LOAD_SCALE;
+ this_load = (this_rq->prio_bias * SCHED_LOAD_SCALE) /
+ NICE_TO_BIAS_PRIO(0);
/* Update our load */
for (i = 0; i < 3; i++) {
unsigned long new_load = this_load;
Peter Williams wrote:
> Peter Williams wrote:
>
>> Peter Williams wrote:
>>
>>> Martin Bligh wrote:
>>>
>>>>
>>>>>>
>>>>>> But I was thinking more about the code that (in the original)
>>>>>> handled the case where the number of tasks to be moved was less
>>>>>> than 1 but more than 0 (i.e. the cases where "imbalance" would
>>>>>> have been reduced to zero when divided by SCHED_LOAD_SCALE). I
>>>>>> think that I got that part wrong and you can end up with a bias
>>>>>> load to be moved which is less than any of the bias_prio values
>>>>>> for any queued tasks (in circumstances where the original code
>>>>>> would have rounded up to 1 and caused a move). I think that the
>>>>>> way to handle this problem is to replace 1 with "average bias
>>>>>> prio" within that logic. This would guarantee at least one task
>>>>>> with a bias_prio small enough to be moved.
>>>>>>
>>>>>> I think that this analysis is a strong argument for my original
>>>>>> patch being the cause of the problem so I'll go ahead and generate
>>>>>> a fix. I'll try to have a patch available later this morning.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Attached is a patch that addresses this problem. Unlike the
>>>>> description above it does not use "average bias prio" as that
>>>>> solution would be very complicated. Instead it makes the
>>>>> assumption that NICE_TO_BIAS_PRIO(0) is a "good enough" for this
>>>>> purpose as this is highly likely to be the median bias prio and the
>>>>> median is probably better for this purpose than the average.
>>>>>
>>>>> Signed-off-by: Peter Williams <[email protected]>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Doesn't fix the perf issue.
>>>
>>>
>>>
>>>
>>> OK, thanks. I think there's a few more places where SCHED_LOAD_SCALE
>>> needs to be multiplied by NICE_TO_BIAS_PRIO(0). Basically, anywhere
>>> that it's added to, subtracted from or compared to a load. In those
>>> cases it's being used as a scaled version of 1 and we need a scaled
>>
>>
>>
>> This would have been better said as "the load generated by 1 task"
>> rather than just "a scaled version of 1". Numerically, they're the
>> same but one is clearer than the other and makes it more obvious why
>> we need NICE_TO_BIAS_PRIO(0) * SCHED_LOAD_SCALE and where we need it.
>>
>>> version of NICE_TO_BIAS_PRIO(0). I'll have another patch later today.
>>
>>
>>
>> I'm just testing this at the moment.
>
>
> Attached is a new patch to fix the excessive idle problem. This patch
> takes a new approach to the problem as it was becoming obvious that
> trying to alter the load balancing code to cope with biased load was
> harder than it seemed.
>
> This approach reverts to the old load values but weights them according
> to tasks' bias_prio values. This means that any assumptions by the load
> balancing code that the load generated by a single task is
> SCHED_LOAD_SCALE will still hold. Then, in find_busiest_group(), the
> imbalance is scaled back up to bias_prio scale so that move_tasks() can
> move biased load rather than tasks.
>
> One advantage of this is that when there are no non zero niced tasks the
> processing will be mathematically the same as the original code.
> Kernbench results from a 2 CPU Celeron 550Mhz system are:
>
> Average Optimal -j 8 Load Run:
> Elapsed Time 1056.16 (0.831102)
> User Time 1906.54 (1.38447)
> System Time 182.086 (0.973386)
> Percent CPU 197 (0)
> Context Switches 48727.2 (249.351)
> Sleeps 27623.4 (413.913)
>
> This indicates that, on average, 98.9% of the total available CPU was
> used by the build.
Here's the numbers for the same machine with the "improved smp nice
handling" completely removed i.e. back to 2.6.15 version.
Average Optimal -j 8 Load Run:
Elapsed Time 1059.95 (1.19324)
User Time 1914.94 (1.11102)
System Time 181.846 (0.916695)
Percent CPU 197.4 (0.547723)
Context Switches 40917.4 (469.184)
Sleeps 26654 (320.824)
>
> Signed-off-by: Peter Williams <[email protected]>
>
> BTW I think that we need to think about a slightly more complex nice to
> bias mapping function. The current one gives a nice==19 1/20 of the
> bias of a nice=0 task but only gives nice=-20 tasks twice the bias of a
> nice=0 task. I don't think this is a big problem as the majority of non
> nice==0 tasks will have positive nice but should be looked at for a
> future enhancement.
>
> Peter
>
>
> ------------------------------------------------------------------------
>
> Index: MM-2.6.X/kernel/sched.c
> ===================================================================
> --- MM-2.6.X.orig/kernel/sched.c 2006-01-13 14:53:34.000000000 +1100
> +++ MM-2.6.X/kernel/sched.c 2006-01-13 15:11:19.000000000 +1100
> @@ -1042,7 +1042,8 @@ void kick_process(task_t *p)
> static unsigned long source_load(int cpu, int type)
> {
> runqueue_t *rq = cpu_rq(cpu);
> - unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
> + unsigned long load_now = (rq->prio_bias * SCHED_LOAD_SCALE) /
> + NICE_TO_BIAS_PRIO(0);
>
> if (type == 0)
> return load_now;
> @@ -1056,7 +1057,8 @@ static unsigned long source_load(int cpu
> static inline unsigned long target_load(int cpu, int type)
> {
> runqueue_t *rq = cpu_rq(cpu);
> - unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
> + unsigned long load_now = (rq->prio_bias * SCHED_LOAD_SCALE) /
> + NICE_TO_BIAS_PRIO(0);
>
> if (type == 0)
> return load_now;
> @@ -1322,7 +1324,8 @@ static int try_to_wake_up(task_t *p, uns
> * of the current CPU:
> */
> if (sync)
> - tl -= p->bias_prio * SCHED_LOAD_SCALE;
> + tl -= (p->bias_prio * SCHED_LOAD_SCALE) /
> + NICE_TO_BIAS_PRIO(0);
>
> if ((tl <= load &&
> tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> @@ -2159,7 +2162,7 @@ find_busiest_group(struct sched_domain *
> }
>
> /* Get rid of the scaling factor, rounding down as we divide */
> - *imbalance = *imbalance / SCHED_LOAD_SCALE;
> + *imbalance = (*imbalance * NICE_TO_BIAS_PRIO(0)) / SCHED_LOAD_SCALE;
> return busiest;
>
> out_balanced:
> @@ -2472,7 +2475,8 @@ static void rebalance_tick(int this_cpu,
> struct sched_domain *sd;
> int i;
>
> - this_load = this_rq->prio_bias * SCHED_LOAD_SCALE;
> + this_load = (this_rq->prio_bias * SCHED_LOAD_SCALE) /
> + NICE_TO_BIAS_PRIO(0);
> /* Update our load */
> for (i = 0; i < 3; i++) {
> unsigned long new_load = this_load;
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
>
> Attached is a new patch to fix the excessive idle problem. This patch
> takes a new approach to the problem as it was becoming obvious that
> trying to alter the load balancing code to cope with biased load was
> harder than it seemed.
>
OK, well the last one didn't work.
(peterw-fix-excessive-idling-with-smp-move-load-not-tasks)
If Andy is feeling very kind, perhaps he will kick this one too.
This one is for real, right? ;-)
M.
Peter Williams wrote:
> Attached is a new patch to fix the excessive idle problem. This patch
> takes a new approach to the problem as it was becoming obvious that
> trying to alter the load balancing code to cope with biased load was
> harder than it seemed.
Ok. Tried testing different-approach-to-smp-nice-problem against the
transition release 2.6.14-rc2-mm1 but it doesn't apply. Am testing
against 2.6.15-mm3 right now. Will let you know.
-apw
Andy Whitcroft wrote:
> Peter Williams wrote:
>
>
>>Attached is a new patch to fix the excessive idle problem. This patch
>>takes a new approach to the problem as it was becoming obvious that
>>trying to alter the load balancing code to cope with biased load was
>>harder than it seemed.
>
>
> Ok. Tried testing different-approach-to-smp-nice-problem against the
> transition release 2.6.14-rc2-mm1 but it doesn't apply. Am testing
> against 2.6.15-mm3 right now. Will let you know.
Doesn't appear to help if I am analysing the graphs right. Martin?
Also tried to back out the original patch against the 2.6.15-mm3 tree
but that also won't apply (2 rejects). Peter?
-apw
Andy Whitcroft wrote:
> Andy Whitcroft wrote:
>
>>Peter Williams wrote:
>>
>>
>>
>>>Attached is a new patch to fix the excessive idle problem. This patch
>>>takes a new approach to the problem as it was becoming obvious that
>>>trying to alter the load balancing code to cope with biased load was
>>>harder than it seemed.
>>
>>
>>Ok. Tried testing different-approach-to-smp-nice-problem against the
>>transition release 2.6.14-rc2-mm1 but it doesn't apply. Am testing
>>against 2.6.15-mm3 right now. Will let you know.
>
>
> Doesn't appear to help if I am analysing the graphs right. Martin?
Nope. still broken.
> Also tried to back out the original patch against the 2.6.15-mm3 tree
> but that also won't apply (2 rejects). Peter?
>
> -apw
Index: MM-2.6.X/kernel/sched.c
===================================================================
--- MM-2.6.X.orig/kernel/sched.c 2006-01-13 18:09:49.000000000 +1100
+++ MM-2.6.X/kernel/sched.c 2006-01-13 18:21:38.000000000 +1100
@@ -63,16 +63,6 @@
#define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20)
#define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio)
-#ifdef CONFIG_SMP
-/*
- * Priority bias for load balancing ranges from 1 (nice==19) to 139 (RT
- * priority of 100).
- */
-#define NICE_TO_BIAS_PRIO(nice) (20 - (nice))
-#define PRIO_TO_BIAS_PRIO(prio) NICE_TO_BIAS_PRIO(PRIO_TO_NICE(prio))
-#define RTPRIO_TO_BIAS_PRIO(rp) (40 + (rp))
-#endif
-
/*
* 'User priority' is the nice value converted to something we
* can work with better when scaling various scheduler parameters,
@@ -695,53 +685,46 @@ static int effective_prio(task_t *p)
}
#ifdef CONFIG_SMP
-static inline void set_bias_prio(task_t *p)
-{
- if (rt_task(p)) {
- if (p == task_rq(p)->migration_thread)
- /*
- * The migration thread does the actual balancing. Do
- * not bias by its priority as the ultra high priority
- * will skew balancing adversely.
- */
- p->bias_prio = 0;
- else
- p->bias_prio = RTPRIO_TO_BIAS_PRIO(p->rt_priority);
- } else
- p->bias_prio = PRIO_TO_BIAS_PRIO(p->static_prio);
-}
-
-static inline void inc_prio_bias(runqueue_t *rq, const task_t *p)
+static inline void inc_prio_bias(runqueue_t *rq, int prio)
{
- rq->prio_bias += p->bias_prio;
+ rq->prio_bias += MAX_PRIO - prio;
}
-static inline void dec_prio_bias(runqueue_t *rq, const task_t *p)
+static inline void dec_prio_bias(runqueue_t *rq, int prio)
{
- rq->prio_bias -= p->bias_prio;
+ rq->prio_bias -= MAX_PRIO - prio;
}
static inline void inc_nr_running(task_t *p, runqueue_t *rq)
{
rq->nr_running++;
- inc_prio_bias(rq, p);
+ if (rt_task(p)) {
+ if (p != rq->migration_thread)
+ /*
+ * The migration thread does the actual balancing. Do
+ * not bias by its priority as the ultra high priority
+ * will skew balancing adversely.
+ */
+ inc_prio_bias(rq, p->prio);
+ } else
+ inc_prio_bias(rq, p->static_prio);
}
static inline void dec_nr_running(task_t *p, runqueue_t *rq)
{
rq->nr_running--;
- dec_prio_bias(rq, p);
+ if (rt_task(p)) {
+ if (p != rq->migration_thread)
+ dec_prio_bias(rq, p->prio);
+ } else
+ dec_prio_bias(rq, p->static_prio);
}
#else
-static inline void set_bias_prio(task_t *p)
-{
-}
-
-static inline void inc_prio_bias(runqueue_t *rq, const task_t *p)
+static inline void inc_prio_bias(runqueue_t *rq, int prio)
{
}
-static inline void dec_prio_bias(runqueue_t *rq, const task_t *p)
+static inline void dec_prio_bias(runqueue_t *rq, int prio)
{
}
@@ -1039,29 +1022,61 @@ void kick_process(task_t *p)
* We want to under-estimate the load of migration sources, to
* balance conservatively.
*/
-static unsigned long source_load(int cpu, int type)
+static inline unsigned long __source_load(int cpu, int type, enum idle_type idle)
{
runqueue_t *rq = cpu_rq(cpu);
- unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
+ unsigned long running = rq->nr_running;
+ unsigned long source_load, cpu_load = rq->cpu_load[type-1],
+ load_now = running * SCHED_LOAD_SCALE;
if (type == 0)
- return load_now;
+ source_load = load_now;
+ else
+ source_load = min(cpu_load, load_now);
+
+ if (running > 1 || (idle == NOT_IDLE && running))
+ /*
+ * If we are busy rebalancing the load is biased by
+ * priority to create 'nice' support across cpus. When
+ * idle rebalancing we should only bias the source_load if
+ * there is more than one task running on that queue to
+ * prevent idle rebalance from trying to pull tasks from a
+ * queue with only one running task.
+ */
+ source_load = source_load * rq->prio_bias / running;
+
+ return source_load;
+}
- return min(rq->cpu_load[type-1], load_now);
+static inline unsigned long source_load(int cpu, int type)
+{
+ return __source_load(cpu, type, NOT_IDLE);
}
/*
* Return a high guess at the load of a migration-target cpu
*/
-static inline unsigned long target_load(int cpu, int type)
+static inline unsigned long __target_load(int cpu, int type, enum idle_type idle)
{
runqueue_t *rq = cpu_rq(cpu);
- unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
+ unsigned long running = rq->nr_running;
+ unsigned long target_load, cpu_load = rq->cpu_load[type-1],
+ load_now = running * SCHED_LOAD_SCALE;
if (type == 0)
- return load_now;
+ target_load = load_now;
+ else
+ target_load = max(cpu_load, load_now);
+
+ if (running > 1 || (idle == NOT_IDLE && running))
+ target_load = target_load * rq->prio_bias / running;
+
+ return target_load;
+}
- return max(rq->cpu_load[type-1], load_now);
+static inline unsigned long target_load(int cpu, int type)
+{
+ return __target_load(cpu, type, NOT_IDLE);
}
/*
@@ -1322,7 +1337,7 @@ static int try_to_wake_up(task_t *p, uns
* of the current CPU:
*/
if (sync)
- tl -= p->bias_prio * SCHED_LOAD_SCALE;
+ tl -= SCHED_LOAD_SCALE;
if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
@@ -1934,16 +1949,15 @@ int can_migrate_task(task_t *p, runqueue
* Called with both runqueues locked.
*/
static int move_tasks(runqueue_t *this_rq, int this_cpu, runqueue_t *busiest,
- unsigned long max_nr_move, long max_bias_move,
- struct sched_domain *sd, enum idle_type idle,
- int *all_pinned)
+ unsigned long max_nr_move, struct sched_domain *sd,
+ enum idle_type idle, int *all_pinned)
{
prio_array_t *array, *dst_array;
struct list_head *head, *curr;
int idx, pulled = 0, pinned = 0;
task_t *tmp;
- if (max_nr_move == 0 || max_bias_move == 0)
+ if (max_nr_move == 0)
goto out;
pinned = 1;
@@ -1986,8 +2000,7 @@ skip_queue:
curr = curr->prev;
- if (tmp->bias_prio > max_bias_move ||
- !can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {
+ if (!can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {
if (curr != head)
goto skip_queue;
idx++;
@@ -2001,13 +2014,9 @@ skip_queue:
pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
pulled++;
- max_bias_move -= tmp->bias_prio;
- /*
- * We only want to steal up to the prescribed number of tasks
- * and the prescribed amount of biased load.
- */
- if (pulled < max_nr_move && max_bias_move > 0) {
+ /* We only want to steal up to the prescribed number of tasks. */
+ if (pulled < max_nr_move) {
if (curr != head)
goto skip_queue;
idx++;
@@ -2028,7 +2037,7 @@ out:
/*
* find_busiest_group finds and returns the busiest CPU group within the
- * domain. It calculates and returns the amount of biased load which should be
+ * domain. It calculates and returns the number of tasks which should be
* moved to restore balance via the imbalance parameter.
*/
static struct sched_group *
@@ -2064,9 +2073,9 @@ find_busiest_group(struct sched_domain *
/* Bias balancing toward cpus of our domain */
if (local_group)
- load = target_load(i, load_idx);
+ load = __target_load(i, load_idx, idle);
else
- load = source_load(i, load_idx);
+ load = __source_load(i, load_idx, idle);
avg_load += load;
}
@@ -2121,7 +2130,7 @@ find_busiest_group(struct sched_domain *
unsigned long tmp;
if (max_load - this_load >= SCHED_LOAD_SCALE*2) {
- *imbalance = NICE_TO_BIAS_PRIO(0);
+ *imbalance = 1;
return busiest;
}
@@ -2154,7 +2163,7 @@ find_busiest_group(struct sched_domain *
if (pwr_move <= pwr_now)
goto out_balanced;
- *imbalance = NICE_TO_BIAS_PRIO(0);
+ *imbalance = 1;
return busiest;
}
@@ -2171,14 +2180,15 @@ out_balanced:
/*
* find_busiest_queue - find the busiest runqueue among the cpus in group.
*/
-static runqueue_t *find_busiest_queue(struct sched_group *group)
+static runqueue_t *find_busiest_queue(struct sched_group *group,
+ enum idle_type idle)
{
unsigned long load, max_load = 0;
runqueue_t *busiest = NULL;
int i;
for_each_cpu_mask(i, group->cpumask) {
- load = source_load(i, 0);
+ load = __source_load(i, 0, idle);
if (load > max_load) {
max_load = load;
@@ -2195,7 +2205,6 @@ static runqueue_t *find_busiest_queue(st
*/
#define MAX_PINNED_INTERVAL 512
-#define minus_1_or_zero(n) ((n) > 0 ? (n) - 1 : 0)
/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
* tasks if there is an imbalance.
@@ -2223,7 +2232,7 @@ static int load_balance(int this_cpu, ru
goto out_balanced;
}
- busiest = find_busiest_queue(group);
+ busiest = find_busiest_queue(group, idle);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
@@ -2243,7 +2252,6 @@ static int load_balance(int this_cpu, ru
*/
double_rq_lock(this_rq, busiest);
nr_moved = move_tasks(this_rq, this_cpu, busiest,
- minus_1_or_zero(busiest->nr_running),
imbalance, sd, idle, &all_pinned);
double_rq_unlock(this_rq, busiest);
@@ -2347,7 +2355,7 @@ static int load_balance_newidle(int this
goto out_balanced;
}
- busiest = find_busiest_queue(group);
+ busiest = find_busiest_queue(group, NEWLY_IDLE);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
goto out_balanced;
@@ -2362,7 +2370,6 @@ static int load_balance_newidle(int this
/* Attempt to move tasks */
double_lock_balance(this_rq, busiest);
nr_moved = move_tasks(this_rq, this_cpu, busiest,
- minus_1_or_zero(busiest->nr_running),
imbalance, sd, NEWLY_IDLE, NULL);
spin_unlock(&busiest->lock);
}
@@ -2443,8 +2450,7 @@ static void active_load_balance(runqueue
schedstat_inc(sd, alb_cnt);
- if (move_tasks(target_rq, target_cpu, busiest_rq, 1,
- RTPRIO_TO_BIAS_PRIO(100), sd, SCHED_IDLE, NULL))
+ if (move_tasks(target_rq, target_cpu, busiest_rq, 1, sd, SCHED_IDLE, NULL))
schedstat_inc(sd, alb_pushed);
else
schedstat_inc(sd, alb_failed);
@@ -2472,7 +2478,7 @@ static void rebalance_tick(int this_cpu,
struct sched_domain *sd;
int i;
- this_load = this_rq->prio_bias * SCHED_LOAD_SCALE;
+ this_load = this_rq->nr_running * SCHED_LOAD_SCALE;
/* Update our load */
for (i = 0; i < 3; i++) {
unsigned long new_load = this_load;
@@ -3612,19 +3618,18 @@ void set_user_nice(task_t *p, long nice)
array = p->array;
if (array) {
dequeue_task(p, array);
- dec_prio_bias(rq, p);
+ dec_prio_bias(rq, p->static_prio);
}
old_prio = p->prio;
new_prio = NICE_TO_PRIO(nice);
delta = new_prio - old_prio;
p->static_prio = NICE_TO_PRIO(nice);
- set_bias_prio(p);
p->prio += delta;
if (array) {
enqueue_task(p, array);
- inc_prio_bias(rq, p);
+ inc_prio_bias(rq, p->static_prio);
/*
* If the task increased its priority or is running and
* lowered its priority, then reschedule its CPU:
@@ -3767,7 +3772,6 @@ static void __setscheduler(struct task_s
if (policy == SCHED_BATCH)
p->sleep_avg = 0;
}
- set_bias_prio(p);
}
/**
@@ -6309,7 +6313,6 @@ void __init sched_init(void)
}
}
- set_bias_prio(&init_task);
/*
* The boot idle thread does lazy MMU switching as well:
*/
Martin Bligh wrote:
> Andy Whitcroft wrote:
>
>> Andy Whitcroft wrote:
>>
>>> Peter Williams wrote:
>>>
>>>
>>>
>>>> Attached is a new patch to fix the excessive idle problem. This patch
>>>> takes a new approach to the problem as it was becoming obvious that
>>>> trying to alter the load balancing code to cope with biased load was
>>>> harder than it seemed.
>>>
>>>
>>>
>>> Ok. Tried testing different-approach-to-smp-nice-problem against the
>>> transition release 2.6.14-rc2-mm1 but it doesn't apply. Am testing
>>> against 2.6.15-mm3 right now. Will let you know.
>>
>>
>>
>> Doesn't appear to help if I am analysing the graphs right. Martin?
>
>
> Nope. still broken.
Interesting. The only real difference between this and Con's original
patch is the stuff that he did in source_load() and target_load() to
nobble the bias when nr_running is 1 or less. With this new model it
should be possible to do something similar in those functions but I'll
hold off doing anything until a comparison against 2.6.15-mm3 with the
patch removed is available (as there are other scheduler changes in -mm3).
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Peter Williams wrote:
> Martin Bligh wrote:
>
>> Andy Whitcroft wrote:
>>
>>> Andy Whitcroft wrote:
>>>
>>>> Peter Williams wrote:
>>>>
>>>>
>>>>
>>>>> Attached is a new patch to fix the excessive idle problem. This patch
>>>>> takes a new approach to the problem as it was becoming obvious that
>>>>> trying to alter the load balancing code to cope with biased load was
>>>>> harder than it seemed.
>>>>
>>>>
>>>>
>>>>
>>>> Ok. Tried testing different-approach-to-smp-nice-problem against the
>>>> transition release 2.6.14-rc2-mm1 but it doesn't apply. Am testing
>>>> against 2.6.15-mm3 right now. Will let you know.
>>>
>>>
>>>
>>>
>>> Doesn't appear to help if I am analysing the graphs right. Martin?
>>
>>
>>
>> Nope. still broken.
>
>
> Interesting. The only real difference between this and Con's original
> patch is the stuff that he did in source_load() and target_load() to
> nobble the bias when nr_running is 1 or less. With this new model it
> should be possible to do something similar in those functions but I'll
> hold off doing anything until a comparison against 2.6.15-mm3 with the
> patch removed is available (as there are other scheduler changes in -mm3).
>
Ideally, balancing should be completely unaffected when all tasks are
of priority 0 which is what I thought yours did, and why I think the
current system is not great.
I'll probably end up taking a look at it one day, if it doesn't get fixed.
I think your patch is pretty close but I didn't quite look close enough to
work out what's going wrong.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Saturday 14 January 2006 16:03, Nick Piggin wrote:
> Ideally, balancing should be completely unaffected when all tasks are
> of priority 0 which is what I thought yours did, and why I think the
> current system is not great.
The current smp nice in mainline 2.6.15 is performance neutral when all are
the same priority. It's only this improved version in -mm that is not.
Con
Nick Piggin wrote:
> Peter Williams wrote:
>
>> Martin Bligh wrote:
>>
>>> Andy Whitcroft wrote:
>>>
>>>> Andy Whitcroft wrote:
>>>>
>>>>> Peter Williams wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Attached is a new patch to fix the excessive idle problem. This
>>>>>> patch
>>>>>> takes a new approach to the problem as it was becoming obvious that
>>>>>> trying to alter the load balancing code to cope with biased load was
>>>>>> harder than it seemed.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Ok. Tried testing different-approach-to-smp-nice-problem against the
>>>>> transition release 2.6.14-rc2-mm1 but it doesn't apply. Am testing
>>>>> against 2.6.15-mm3 right now. Will let you know.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Doesn't appear to help if I am analysing the graphs right. Martin?
>>>
>>>
>>>
>>>
>>> Nope. still broken.
>>
>>
>>
>> Interesting. The only real difference between this and Con's original
>> patch is the stuff that he did in source_load() and target_load() to
>> nobble the bias when nr_running is 1 or less. With this new model it
>> should be possible to do something similar in those functions but I'll
>> hold off doing anything until a comparison against 2.6.15-mm3 with the
>> patch removed is available (as there are other scheduler changes in
>> -mm3).
>>
>
> Ideally, balancing should be completely unaffected when all tasks are
> of priority 0 which is what I thought yours did, and why I think the
> current system is not great.
This latest change should make that happen as the weights for nice=0
tasks is unity.
>
> I'll probably end up taking a look at it one day, if it doesn't get fixed.
> I think your patch is pretty close
I thought that it was there :-).
I figured using the weights (which go away for nice=0 tasks) would make
it behave nicely with the rest of the load balancing code.
> but I didn't quite look close enough to
> work out what's going wrong.
My testing (albeit on an old 2 cpu Celeron) showed no statistically
significant difference between with the patch and without. If you
ignored the standard deviations and statistical practice and just looked
at the raw numbers you'd think it was better with the patch than without
it. :-)
I assume that Andy Whitcroft is doing a kernbench with the patch removed
from 2.6.15-mm3 (otherwise why would he ask for a patch to do that) and
I'm waiting to see how that compares with the run he did with it in.
There were other scheduling changes in 2.6.15-mm3 so I think this
comparison is needed in order to be sure that any degradation is still
due to my patch.
Peter
PS As load balancing maintainer, is the value 128 set in cement for
SCHED_LOAD_SCALE? The reason I ask is that if it was changed to be a
multiple of NICE_TO_BIAS_PRIO(0) (i.e. 20) my modification could be made
slightly more efficient.
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Con Kolivas wrote:
> On Saturday 14 January 2006 16:03, Nick Piggin wrote:
>
>>Ideally, balancing should be completely unaffected when all tasks are
>>of priority 0 which is what I thought yours did, and why I think the
>>current system is not great.
>
>
> The current smp nice in mainline 2.6.15 is performance neutral when all are
> the same priority. It's only this improved version in -mm that is not.
>
Oh sure. I wasn't talking about performance though.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Peter Williams wrote:
> Nick Piggin wrote:
>
> I figured using the weights (which go away for nice=0 tasks) would make
> it behave nicely with the rest of the load balancing code.
>
OK, if you keep working on it that would be great.
>> but I didn't quite look close enough to
>> work out what's going wrong.
>
>
> My testing (albeit on an old 2 cpu Celeron) showed no statistically
> significant difference between with the patch and without. If you
> ignored the standard deviations and statistical practice and just looked
> at the raw numbers you'd think it was better with the patch than without
> it. :-)
>
> I assume that Andy Whitcroft is doing a kernbench with the patch removed
> from 2.6.15-mm3 (otherwise why would he ask for a patch to do that) and
> I'm waiting to see how that compares with the run he did with it in.
> There were other scheduling changes in 2.6.15-mm3 so I think this
> comparison is needed in order to be sure that any degradation is still
> due to my patch.
>
> Peter
> PS As load balancing maintainer, is the value 128 set in cement for
> SCHED_LOAD_SCALE? The reason I ask is that if it was changed to be a
> multiple of NICE_TO_BIAS_PRIO(0) (i.e. 20) my modification could be made
> slightly more efficient.
Not set in stone but it should really be a power of two because there are
quite a lot of multiplies and divides done with it.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
>
> Attached is a new patch to fix the excessive idle problem. This patch
> takes a new approach to the problem as it was becoming obvious that
> trying to alter the load balancing code to cope with biased load was
> harder than it seemed.
>
> This approach reverts to the old load values but weights them
> according to tasks' bias_prio values. This means that any assumptions
> by the load balancing code that the load generated by a single task is
> SCHED_LOAD_SCALE will still hold. Then, in find_busiest_group(), the
> imbalance is scaled back up to bias_prio scale so that move_tasks()
> can move biased load rather than tasks.
>
OK, this one seems to fix the issue that I had, AFAICS. Congrats, and
thanks,
M.
> One advantage of this is that when there are no non zero niced tasks
> the processing will be mathematically the same as the original code.
> Kernbench results from a 2 CPU Celeron 550Mhz system are:
>
> Average Optimal -j 8 Load Run:
> Elapsed Time 1056.16 (0.831102)
> User Time 1906.54 (1.38447)
> System Time 182.086 (0.973386)
> Percent CPU 197 (0)
> Context Switches 48727.2 (249.351)
> Sleeps 27623.4 (413.913)
>
> This indicates that, on average, 98.9% of the total available CPU was
> used by the build.
>
> Signed-off-by: Peter Williams <[email protected]>
>
> BTW I think that we need to think about a slightly more complex nice
> to bias mapping function. The current one gives a nice==19 1/20 of
> the bias of a nice=0 task but only gives nice=-20 tasks twice the bias
> of a nice=0 task. I don't think this is a big problem as the majority
> of non nice==0 tasks will have positive nice but should be looked at
> for a future enhancement.
>
> Peter
>
>------------------------------------------------------------------------
>
>Index: MM-2.6.X/kernel/sched.c
>===================================================================
>--- MM-2.6.X.orig/kernel/sched.c 2006-01-13 14:53:34.000000000 +1100
>+++ MM-2.6.X/kernel/sched.c 2006-01-13 15:11:19.000000000 +1100
>@@ -1042,7 +1042,8 @@ void kick_process(task_t *p)
> static unsigned long source_load(int cpu, int type)
> {
> runqueue_t *rq = cpu_rq(cpu);
>- unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
>+ unsigned long load_now = (rq->prio_bias * SCHED_LOAD_SCALE) /
>+ NICE_TO_BIAS_PRIO(0);
>
> if (type == 0)
> return load_now;
>@@ -1056,7 +1057,8 @@ static unsigned long source_load(int cpu
> static inline unsigned long target_load(int cpu, int type)
> {
> runqueue_t *rq = cpu_rq(cpu);
>- unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
>+ unsigned long load_now = (rq->prio_bias * SCHED_LOAD_SCALE) /
>+ NICE_TO_BIAS_PRIO(0);
>
> if (type == 0)
> return load_now;
>@@ -1322,7 +1324,8 @@ static int try_to_wake_up(task_t *p, uns
> * of the current CPU:
> */
> if (sync)
>- tl -= p->bias_prio * SCHED_LOAD_SCALE;
>+ tl -= (p->bias_prio * SCHED_LOAD_SCALE) /
>+ NICE_TO_BIAS_PRIO(0);
>
> if ((tl <= load &&
> tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>@@ -2159,7 +2162,7 @@ find_busiest_group(struct sched_domain *
> }
>
> /* Get rid of the scaling factor, rounding down as we divide */
>- *imbalance = *imbalance / SCHED_LOAD_SCALE;
>+ *imbalance = (*imbalance * NICE_TO_BIAS_PRIO(0)) / SCHED_LOAD_SCALE;
> return busiest;
>
> out_balanced:
>@@ -2472,7 +2475,8 @@ static void rebalance_tick(int this_cpu,
> struct sched_domain *sd;
> int i;
>
>- this_load = this_rq->prio_bias * SCHED_LOAD_SCALE;
>+ this_load = (this_rq->prio_bias * SCHED_LOAD_SCALE) /
>+ NICE_TO_BIAS_PRIO(0);
> /* Update our load */
> for (i = 0; i < 3; i++) {
> unsigned long new_load = this_load;
>
>
Index: MM-2.6.X/kernel/sched.c
===================================================================
--- MM-2.6.X.orig/kernel/sched.c 2006-01-15 09:54:27.000000000 +1100
+++ MM-2.6.X/kernel/sched.c 2006-01-15 10:50:11.000000000 +1100
@@ -681,6 +681,13 @@ static int effective_prio(task_t *p)
}
#ifdef CONFIG_SMP
+/*
+ * To aid in avoiding the subversion of "niceness" due to uneven distribution
+ * of tasks with abnormal "nice" values accross CPUs the contribution that
+ * each task makes to its run queue's load is weighted according to its
+ * scheduling class and "nice" value. The bias_prio field holds the value
+ * used to calculate the weight for each task.
+ */
static inline void set_bias_prio(task_t *p)
{
if (rt_task(p)) {
@@ -718,6 +725,18 @@ static inline void dec_nr_running(task_t
rq->nr_running--;
dec_prio_bias(rq, p);
}
+
+/* convert biased priority to scaled weighted load */
+static inline unsigned long weighted_load(unsigned long bias)
+{
+ return (bias * SCHED_LOAD_SCALE) / NICE_TO_BIAS_PRIO(0);
+}
+
+/* convert scaled weighted load to unscaled biased load */
+static inline unsigned long biased_load(unsigned long wload)
+{
+ return (wload * NICE_TO_BIAS_PRIO(0)) / SCHED_LOAD_SCALE;
+}
#else
static inline void set_bias_prio(task_t *p)
{
@@ -1011,7 +1030,8 @@ void kick_process(task_t *p)
}
/*
- * Return a low guess at the load of a migration-source cpu.
+ * Return a low guess at the load of a migration-source cpu weighted
+ * according to the scheduling class and "nice" value.
*
* We want to under-estimate the load of migration sources, to
* balance conservatively.
@@ -1019,7 +1039,7 @@ void kick_process(task_t *p)
static unsigned long source_load(int cpu, int type)
{
runqueue_t *rq = cpu_rq(cpu);
- unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
+ unsigned long load_now = weighted_load(rq->prio_bias);
if (type == 0)
return load_now;
@@ -1028,12 +1048,13 @@ static unsigned long source_load(int cpu
}
/*
- * Return a high guess at the load of a migration-target cpu
+ * Return a high guess at the load of a migration-target cpu weighted
+ * according to the scheduling class and "nice" value.
*/
static inline unsigned long target_load(int cpu, int type)
{
runqueue_t *rq = cpu_rq(cpu);
- unsigned long load_now = rq->prio_bias * SCHED_LOAD_SCALE;
+ unsigned long load_now = weighted_load(rq->prio_bias);
if (type == 0)
return load_now;
@@ -1299,7 +1320,7 @@ static int try_to_wake_up(task_t *p, uns
* of the current CPU:
*/
if (sync)
- tl -= p->bias_prio * SCHED_LOAD_SCALE;
+ tl -= weighted_load(p->bias_prio);
if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
@@ -1903,9 +1924,9 @@ int can_migrate_task(task_t *p, runqueue
}
/*
- * move_tasks tries to move up to max_nr_move tasks from busiest to this_rq,
- * as part of a balancing operation within "domain". Returns the number of
- * tasks moved.
+ * move_tasks tries to move up to max_nr_move tasks and max_bias_move biased
+ * load from busiest to this_rq, as part of a balancing operation within
+ * "domain". Returns the number of tasks moved.
*
* Called with both runqueues locked.
*/
@@ -2134,8 +2155,11 @@ find_busiest_group(struct sched_domain *
return busiest;
}
- /* Get rid of the scaling factor, rounding down as we divide */
- *imbalance = *imbalance / SCHED_LOAD_SCALE;
+ /*
+ * Get rid of the scaling factor, rounding down as we divide and
+ * converting to biased load for use by move_tasks()
+ */
+ *imbalance = biased_load(*imbalance);
return busiest;
out_balanced:
@@ -2448,7 +2472,8 @@ static void rebalance_tick(int this_cpu,
struct sched_domain *sd;
int i;
- this_load = this_rq->prio_bias * SCHED_LOAD_SCALE;
+ /* weight load according to scheduling class and "nice" value */
+ this_load = weighted_load(this_rq->prio_bias);
/* Update our load */
for (i = 0; i < 3; i++) {
unsigned long new_load = this_load;
Index: MM-2.6.X/include/linux/sched.h
===================================================================
--- MM-2.6.X.orig/include/linux/sched.h 2006-01-15 09:54:27.000000000 +1100
+++ MM-2.6.X/include/linux/sched.h 2006-01-15 10:14:42.000000000 +1100
@@ -714,7 +714,7 @@ struct task_struct {
#endif
int prio, static_prio;
#ifdef CONFIG_SMP
- int bias_prio;
+ int bias_prio; /* load "weight" factor for load balancing purposes */
#endif
struct list_head run_list;
prio_array_t *array;
On Sunday 15 January 2006 11:05, Peter Williams wrote:
> Martin J. Bligh wrote:
> >> Attached is a new patch to fix the excessive idle problem. This patch
> >> takes a new approach to the problem as it was becoming obvious that
> >> trying to alter the load balancing code to cope with biased load was
> >> harder than it seemed.
> >>
> >> This approach reverts to the old load values but weights them
> >> according to tasks' bias_prio values. This means that any assumptions
> >> by the load balancing code that the load generated by a single task is
> >> SCHED_LOAD_SCALE will still hold. Then, in find_busiest_group(), the
> >> imbalance is scaled back up to bias_prio scale so that move_tasks()
> >> can move biased load rather than tasks.
> >
> > OK, this one seems to fix the issue that I had, AFAICS. Congrats, and
> > thanks,
>
> Terrific, thanks for testing.
>
> Con,
> Attached is a cleaned up version of this patch against 2.6.15-mm4 with
> some (hopefully helpful) comments added.
Great! Well done.
> Signed-off-by: Peter Williams <[email protected]>
Acked-by: Con Kolivas <[email protected]>
On Sunday 15 January 2006 11:05, Peter Williams wrote:
> Martin J. Bligh wrote:
> >> Attached is a new patch to fix the excessive idle problem. This patch
> >> takes a new approach to the problem as it was becoming obvious that
> >> trying to alter the load balancing code to cope with biased load was
> >> harder than it seemed.
> >>
> >> This approach reverts to the old load values but weights them
> >> according to tasks' bias_prio values. This means that any assumptions
> >> by the load balancing code that the load generated by a single task is
> >> SCHED_LOAD_SCALE will still hold. Then, in find_busiest_group(), the
> >> imbalance is scaled back up to bias_prio scale so that move_tasks()
> >> can move biased load rather than tasks.
> >
> > OK, this one seems to fix the issue that I had, AFAICS. Congrats, and
> > thanks,
>
> Terrific, thanks for testing.
>
> Con,
> Attached is a cleaned up version of this patch against 2.6.15-mm4 with
> some (hopefully helpful) comments added.
One minor quibble.
Cheers,
Con
---
These #defines have no cost on !CONFIG_SMP without the ifdefs.
Remove the unnecessary ifdefs.
Signed-off-by: Con Kolivas <[email protected]>
kernel/sched.c | 2 --
1 files changed, 2 deletions(-)
Index: linux-2.6.15-mm4/kernel/sched.c
===================================================================
--- linux-2.6.15-mm4.orig/kernel/sched.c
+++ linux-2.6.15-mm4/kernel/sched.c
@@ -63,7 +63,6 @@
#define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20)
#define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio)
-#ifdef CONFIG_SMP
/*
* Priority bias for load balancing ranges from 1 (nice==19) to 139 (RT
* priority of 100).
@@ -71,7 +70,6 @@
#define NICE_TO_BIAS_PRIO(nice) (20 - (nice))
#define PRIO_TO_BIAS_PRIO(prio) NICE_TO_BIAS_PRIO(PRIO_TO_NICE(prio))
#define RTPRIO_TO_BIAS_PRIO(rp) (40 + (rp))
-#endif
/*
* 'User priority' is the nice value converted to something we
* Peter Williams <[email protected]> wrote:
> Signed-off-by: Peter Williams <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
nice work! There's a small typo here:
> + * To aid in avoiding the subversion of "niceness" due to uneven distribution
> + * of tasks with abnormal "nice" values accross CPUs the contribution that
s/accross/across
Ingo
> Thanks and looks like the results are in from 2.6.14-rc2-mm1 with the patch
> backed out.
>
> Drumroll....
>
> http://test.kernel.org/perf/kernbench.moe.png
>
> The performance goes back to a range similar to 2.6.14-rc1-mm1 (see
> 2.6.14-rc2-mm1 + 20328 in blue). Unfortunately this does implicate this
> patch. Can we put it back into -mm only and allow Peter's tweaks/fixes to go
> on top and have it tested some more before going upstream?
Hmm. Looks like we didn't get this as fixed up as I thought. Moe seems
to be fixed (16x NUMA-Q), but elm3b132 is not (it's 4x, flat SMP ia32).
Look at the latest graphs ....
Is it possible it only got fixed for NUMA boxes?
M.
Martin Bligh wrote:
>
>> Thanks and looks like the results are in from 2.6.14-rc2-mm1 with the
>> patch backed out.
>>
>> Drumroll....
>>
>> http://test.kernel.org/perf/kernbench.moe.png
>>
>> The performance goes back to a range similar to 2.6.14-rc1-mm1 (see
>> 2.6.14-rc2-mm1 + 20328 in blue). Unfortunately this does implicate
>> this patch. Can we put it back into -mm only and allow Peter's
>> tweaks/fixes to go on top and have it tested some more before going
>> upstream?
>
>
> Hmm. Looks like we didn't get this as fixed up as I thought. Moe seems
> to be fixed (16x NUMA-Q), but elm3b132 is not (it's 4x, flat SMP ia32).
> Look at the latest graphs ....
>
> Is it possible it only got fixed for NUMA boxes?
It should be totally independent.
But anyhow, I can't see the problem. The only numbers that I can see
for 2.6.16-rc1-mm[1|2] (which are the ones with the latest fix) on this
graph are approx. 101 which is much the same as the best of the rest.
Or have I missed something?
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Peter Williams wrote:
> Martin Bligh wrote:
>
>>
>>> Thanks and looks like the results are in from 2.6.14-rc2-mm1 with the
>>> patch backed out.
>>>
>>> Drumroll....
>>>
>>> http://test.kernel.org/perf/kernbench.moe.png
>>>
>>> The performance goes back to a range similar to 2.6.14-rc1-mm1 (see
>>> 2.6.14-rc2-mm1 + 20328 in blue). Unfortunately this does implicate
>>> this patch. Can we put it back into -mm only and allow Peter's
>>> tweaks/fixes to go on top and have it tested some more before going
>>> upstream?
>>
>>
>>
>> Hmm. Looks like we didn't get this as fixed up as I thought. Moe seems
>> to be fixed (16x NUMA-Q), but elm3b132 is not (it's 4x, flat SMP ia32).
>> Look at the latest graphs ....
>>
>> Is it possible it only got fixed for NUMA boxes?
>
>
> It should be totally independent.
>
> But anyhow, I can't see the problem. The only numbers that I can see
> for 2.6.16-rc1-mm[1|2] (which are the ones with the latest fix) on this
> graph are approx. 101 which is much the same as the best of the rest. Or
> have I missed something?
Oops. I was looking at the graphs for Moe but
<http://test.kernel.org/perf/dbench.elm3b132.png> doesn't appear to be
demonstrating a problem either. Given the fluctuation in the 2.6.16-rc1
results (235, 234, 211, 228.5 and 237.5), the results for 2.6.16-rc1-mm1
(229) and 2.6.16-mm2 (219) aren't significantly different.
Peter
PS I have a modification for kernbench that calculates and displays the
standard deviations for the various averages if you're interested. This
would enable you to display 95% (say) confidence bars on the graphed
results which in turn makes it easier to spot significant differences.
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
> Oops. I was looking at the graphs for Moe but
> <http://test.kernel.org/perf/dbench.elm3b132.png> doesn't appear to be
> demonstrating a problem either. Given the fluctuation in the
> 2.6.16-rc1 results (235, 234, 211, 228.5 and 237.5), the results for
> 2.6.16-rc1-mm1 (229) and 2.6.16-mm2 (219) aren't significantly different.
I disagree. Look at the graph. mm results are consistent and stable, and
significantly worse than mainline.
> Peter
> PS I have a modification for kernbench that calculates and displays
> the standard deviations for the various averages if you're
> interested. This would enable you to display 95% (say) confidence
> bars on the graphed results which in turn makes it easier to spot
> significant differences.
Thanks, but I have that. What do you think those vertical bars on the
graph are for? ;-) They're deviation of 5 runs. I throw away the best
and worst first. If it was just random noise, then you'd get the same
variance *between* runs inside the -mm train and inside mainline. You
don't see that ...
Use the visuals in the graph .. it's very telling. -mm is *broken*.
It may well not be the same issue as last time though, I shouldn't
have jumped to that conclusion.
M.
Martin J. Bligh wrote:
>
>> Oops. I was looking at the graphs for Moe but
>> <http://test.kernel.org/perf/dbench.elm3b132.png> doesn't appear to be
>> demonstrating a problem either. Given the fluctuation in the
>> 2.6.16-rc1 results (235, 234, 211, 228.5 and 237.5), the results for
>> 2.6.16-rc1-mm1 (229) and 2.6.16-mm2 (219) aren't significantly different.
>
>
> I disagree. Look at the graph. mm results are consistent and stable, and
> significantly worse than mainline.
>
>> Peter
>> PS I have a modification for kernbench that calculates and displays
>> the standard deviations for the various averages if you're
>> interested. This would enable you to display 95% (say) confidence
>> bars on the graphed results which in turn makes it easier to spot
>> significant differences.
>
>
> Thanks, but I have that. What do you think those vertical bars on the
> graph are for? ;-) They're deviation of 5 runs. I throw away the best
> and worst first.
Not very good scientific practice :-)
> If it was just random noise, then you'd get the same
> variance *between* runs inside the -mm train and inside mainline. You
> don't see that ...
I was still looking at the wrong graph this time the dbench instead of
kernbench :-( (explains why I didn't see the error bars). Now I see it.
Looking at the other 6 kernbench graphs, I see that it also occurs for
elm3b70 but no others (including elm3b6 and elm3b67). Are there any
differences between the various elm3b systems that could explain this?
> Use the visuals in the graph .. it's very telling. -mm is *broken*.
> It may well not be the same issue as last time though, I shouldn't
> have jumped to that conclusion.
It's very hard to understand how it could be an issue on a system that
doesn't have a lot of abnormally niced (i.e. non zero) tasks that are
fairly active as it's now mathematically equivalent to the original in
the absence of such tasks. Do these two systems have many such tasks
running?
Would it be possible to get a run with the following patches backed out:
+sched-modified-nice-support-for-smp-load-balancing-fix.patch
+sched-modified-nice-support-for-smp-load-balancing-fix-fix.patch
Thanks
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
>> Thanks, but I have that. What do you think those vertical bars on the
>> graph are for? ;-) They're deviation of 5 runs. I throw away the best
>> and worst first.
>
>
> Not very good scientific practice :-)
Bollocks to scientific practice ;-) It works, it produces very stable
results, has done for a few years now. Things like cron or sunspots can
kick in. Yes, yes, I did stats at University ... but the real world
doesn't work like that. The visuals in the graph speak for it.
>
> Looking at the other 6 kernbench graphs, I see that it also occurs for
> elm3b70 but no others (including elm3b6 and elm3b67). Are there any
> differences between the various elm3b systems that could explain this?
>
Yes. They're all completely different architectures - there's a brief
description at the top of the main page. elm3b67 should be ignored, nay
thrown out of the window. It's an unstable POS that randomly loses
processors. I've removed it from the pages.
elm3b70 is PPC64 (8 cpu)
elm3b6 is x86_64.
elm3b132 is a 4x SMP ia32 Pentium 3.
moe is 16x NUMA-Q (ia32).
gekko-lp1 is a 2x PPC blade.
>> Use the visuals in the graph .. it's very telling. -mm is *broken*.
>> It may well not be the same issue as last time though, I shouldn't
>> have jumped to that conclusion.
>
>
> It's very hard to understand how it could be an issue on a system that
> doesn't have a lot of abnormally niced (i.e. non zero) tasks that are
> fairly active as it's now mathematically equivalent to the original in
> the absence of such tasks. Do these two systems have many such tasks
> running?
>
> Would it be possible to get a run with the following patches backed out:
>
> +sched-modified-nice-support-for-smp-load-balancing-fix.patch
> +sched-modified-nice-support-for-smp-load-balancing-fix-fix.patch
Yup, that should prove or disprove it. It's probably something
completely un-scheduler-related ;-)
M.
Martin J. Bligh wrote:
>
>>> Thanks, but I have that. What do you think those vertical bars on the
>>> graph are for? ;-) They're deviation of 5 runs. I throw away the best
>>> and worst first.
>>
>>
>>
>> Not very good scientific practice :-)
>
>
>
> Bollocks to scientific practice ;-) It works, it produces very stable
> results, has done for a few years now. Things like cron or sunspots can
> kick in. Yes, yes, I did stats at University ... but the real world
> doesn't work like that. The visuals in the graph speak for it.
>
>>
>> Looking at the other 6 kernbench graphs, I see that it also occurs for
>> elm3b70 but no others (including elm3b6 and elm3b67). Are there any
>> differences between the various elm3b systems that could explain this?
>>
> Yes. They're all completely different architectures - there's a brief
> description at the top of the main page. elm3b67 should be ignored, nay
> thrown out of the window. It's an unstable POS that randomly loses
> processors. I've removed it from the pages.
>
> elm3b70 is PPC64 (8 cpu)
> elm3b6 is x86_64.
> elm3b132 is a 4x SMP ia32 Pentium 3.
> moe is 16x NUMA-Q (ia32).
> gekko-lp1 is a 2x PPC blade.
>
>>> Use the visuals in the graph .. it's very telling. -mm is *broken*.
>>> It may well not be the same issue as last time though, I shouldn't
>>> have jumped to that conclusion.
>>
>>
>>
>> It's very hard to understand how it could be an issue on a system that
>> doesn't have a lot of abnormally niced (i.e. non zero) tasks that are
>> fairly active as it's now mathematically equivalent to the original in
>> the absence of such tasks. Do these two systems have many such tasks
>> running?
>>
>> Would it be possible to get a run with the following patches backed out:
>>
>> +sched-modified-nice-support-for-smp-load-balancing-fix.patch
>> +sched-modified-nice-support-for-smp-load-balancing-fix-fix.patch
>
>
>
> Yup, that should prove or disprove it. It's probably something
> completely un-scheduler-related ;-)
>
> M.
Looking at the latest results for 2.6.16-rc1-mm3, it appears to me that
this is no longer an issue. Do you agree?
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
>
> Looking at the latest results for 2.6.16-rc1-mm3, it appears to me
> that this is no longer an issue. Do you agree?
>
> Peter
Yes, -mm3 looks OK ... not sure what happened, but as long as it works
... ;-)
M.