2004-06-12 00:30:41

by Con Kolivas

[permalink] [raw]
Subject: [PATCH] Performance regression in 2.6.7-rc3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all

The OSDL robot monkeys revealed a massive reproducible regression in the
dbt3-pgsql benchmark which could be related to MBligh's measure regression.


2.6.7-rc2: http://khack.osdl.org/stp/293625/
Composite Query Processing Power Throughput Numerical Quantity
199.38 152.52 260.63

vs

2.6.7-rc3: http://khack.osdl.org/stp/293704/
Composite Query Processing Power Throughput Numerical Quantity
152.13 146.36 158.12


with a little bit of detective work and help from Wli we tracked down that
this patch caused it:
[PATCH] sched: improve wakeup-affinity
A massive increase in idle time was observed and the throughput dropped by 40%
Reversing this patch gave these results:

backsched1: http://khack.osdl.org/stp/293865/
Composite Query Processing Power Throughput Numerical Quantity
193.93 145.95 257.67


It may be best to reverse this patch until the regression is better
understood.
Here is a patch reversing it:

Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAyk4SZUg7+tp6mRURAk5RAKCKIxzmRS3u+gs8W5gVyGgvL6glUgCeNggF
8mBUT4HtP4g5d/MViraZ9ds=
=vj+s
-----END PGP SIGNATURE-----


Attachments:
(No filename) (1.15 kB)
backsched1.patch (824.00 B)
Download all attachments

2004-06-12 07:58:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3



Con Kolivas wrote:

>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>Hi all
>
>The OSDL robot monkeys revealed a massive reproducible regression in the
>dbt3-pgsql benchmark which could be related to MBligh's measure regression.
>
>

OK thanks, I'm looking into it.

2004-06-15 04:55:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3


* Con Kolivas <[email protected]> wrote:

> with a little bit of detective work and help from Wli we tracked down that
> this patch caused it:
> [PATCH] sched: improve wakeup-affinity

> A massive increase in idle time was observed and the throughput
> dropped by 40% Reversing this patch gave these results:

> backsched1: http://khack.osdl.org/stp/293865/
> Composite Query Processing Power Throughput Numerical Quantity
> 193.93 145.95 257.67
>
> It may be best to reverse this patch until the regression is better
> understood.

agreed. It is weird because Nick said that pgsql was tested with the
patch - and we applied the patch based on those good results. Nick?

Anyway, does the patch below fix the pgsql problem? It reverts to the
more agressive idle-balancing variant (which isnt strictly necessary for
the bw_pipe problem).

Ingo

Signed-off-by: Ingo Molnar <[email protected]>

--- linux/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1625,8 +1626,7 @@
return busiest;

out_balanced:
- if (busiest && idle != NOT_IDLE && max_load > SCHED_LOAD_SCALE) {
+ if (busiest && (idle == NEWLY_IDLE ||
+ (idle == IDLE && max_load > SCHED_LOAD_SCALE)) ) {
*imbalance = 1;
return busiest;
}

2004-06-15 13:12:48

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3

On Tue, 15 Jun 2004 14:56, Ingo Molnar wrote:
> * Con Kolivas <[email protected]> wrote:
> > with a little bit of detective work and help from Wli we tracked down
> > that this patch caused it:
> > [PATCH] sched: improve wakeup-affinity
> >
> > A massive increase in idle time was observed and the throughput
> > dropped by 40% Reversing this patch gave these results:
> >
> > backsched1: http://khack.osdl.org/stp/293865/
> > Composite Query Processing Power Throughput Numerical Quantity
> > 193.93 145.95 257.67
> >
> > It may be best to reverse this patch until the regression is better
> > understood.
>
> agreed. It is weird because Nick said that pgsql was tested with the
> patch - and we applied the patch based on those good results. Nick?
>
> Anyway, does the patch below fix the pgsql problem? It reverts to the
> more agressive idle-balancing variant (which isnt strictly necessary for
> the bw_pipe problem).

Better than the patch backed out but still worse than it was before:
http://khack.osdl.org/stp/293958/
Composite Query Processing Power Throughput Numerical Quantity
161.42 152.90 170.41

Con

2004-06-15 15:04:02

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3

Well, I found where at least half my regression on kernel compiles came
from. It's sched_balance_context. Which makes no sense to me ... given
that it seems to switch on CLONE_VM, and otherwise make no changes.
But I went and double-checked my results ... so ... confused.

The patch adds find_idlest_cpu and wake_up_forked_thread, but
otherwise just seems to do:

- if (!(clone_flags & CLONE_STOPPED))
- wake_up_forked_process(p); /* do this last */
- else
+ if (!(clone_flags & CLONE_STOPPED)) {
+ /*
+ * Do the wakeup last. On SMP we treat fork() and
+ * CLONE_VM separately, because fork() has already
+ * created cache footprint on this CPU (due to
+ * copying the pagetables), hence migration would
+ * probably be costy. Threads on the other hand
+ * have less traction to the current CPU, and if
+ * there's an imbalance then the scheduler can
+ * migrate this fresh thread now, before it
+ * accumulates a larger cache footprint:
+ */
+ if (clone_flags & CLONE_VM)
+ wake_up_forked_thread(p);
+ else
+ wake_up_forked_process(p);
+ } else
p->state = TASK_STOPPED;
++total_forks;

How the hell can that have any effect on non-threaded workloads? Perhaps
some part of kernel compile *is* multi-threaded. It does seem to get
called somehow ... from the profile:

129 find_idlest_cpu
83 wake_up_forked_thread

Here's the diffprofile between the two:


5835 4.0% total
1100 27.1% __copy_from_user_ll
627 3.6% do_anonymous_page
363 6.8% page_add_rmap
357 79.5% strnlen_user
338 43.3% finish_task_switch
308 3080.0% flush_signal_handlers
272 7.0% zap_pte_range
239 2.1% page_remove_rmap
230 16.9% free_hot_cold_page
224 9.7% buffered_rmqueue
196 44.9% pte_alloc_one
171 40.4% copy_process
162 450.0% complete
155 9.2% do_no_page
147 19.5% set_page_dirty
133 17.9% clear_page_tables
131 13.8% do_wp_page
129 0.0% find_idlest_cpu
121 3.4% find_trylock_page
...
-113 -7.0% atomic_dec_and_lock
-1062 -2.2% default_idle

Which looks to me just like worse task affinity.

I still think balance on clone is the wrong thing to do by default. On
anything but a benchmark, you have more than one process running on the
system, and you WANT to keep threads of that process on the same node,
not scatter them to the winds.

M.

2004-06-16 02:14:57

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3

Martin J. Bligh <[email protected]> wrote:
>
> How the hell can that have any effect on non-threaded workloads? Perhaps
> some part of kernel compile *is* multi-threaded. It does seem to get

make(1) with vfork(2) perhaps?
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2004-06-16 03:04:21

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3



Ingo Molnar wrote:

>* Con Kolivas <[email protected]> wrote:
>
>
>>with a little bit of detective work and help from Wli we tracked down that
>>this patch caused it:
>>[PATCH] sched: improve wakeup-affinity
>>
>
>>A massive increase in idle time was observed and the throughput
>>dropped by 40% Reversing this patch gave these results:
>>
>
>>backsched1: http://khack.osdl.org/stp/293865/
>>Composite Query Processing Power Throughput Numerical Quantity
>>193.93 145.95 257.67
>>
>>It may be best to reverse this patch until the regression is better
>>understood.
>>
>
>agreed. It is weird because Nick said that pgsql was tested with the
>patch - and we applied the patch based on those good results. Nick?
>
>

Sigh, yes, Mark did run a test for me, but I think it was dbt2-pgsql.
This one is dbt3-pgsql. Also, his system was a 4 logical CPU Xeon.

Strangely enough, Mark's setup was showing a fairly large too-much-idle
regression not long ago, while these 8-ways weren't.

Anyway, Linus has reverted my patch now, which is the right thing to
do. Your sync wakeup change is still in there, so that will hopefully
help bw_pipe scores.

I have some changes which bring throughput up to 240, however I'm not
sure if it would be wise to try to fix everything before 2.6.7.

I'd be happy for 2.6.7 to be released with kernel/sched.c as it is now.
What are your thoughts?

2004-06-16 03:08:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3

Herbert Xu wrote:
> Martin J. Bligh <[email protected]> wrote:
>
>>How the hell can that have any effect on non-threaded workloads? Perhaps
>>some part of kernel compile *is* multi-threaded. It does seem to get
>
>
> make(1) with vfork(2) perhaps?

I think balance on clone probably needs to be turned off by default
presently.

It slows down a simple thread creation test by a factor of 7(!) here,
and has quite a few not too difficult to imagine performance problems.

2004-06-16 03:12:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3



On Wed, 16 Jun 2004, Herbert Xu wrote:
>
> Martin J. Bligh <[email protected]> wrote:
> >
> > How the hell can that have any effect on non-threaded workloads? Perhaps
> > some part of kernel compile *is* multi-threaded. It does seem to get
>
> make(1) with vfork(2) perhaps?

Very likely. And in the vfork() case it is definitely WRONG to try to
reschedule (either threads _or_ processes), since the parent is going to
go to sleep real soon now.

I think this code:

if (clone_flags & CLONE_VM)
wake_up_forked_thread(p);
else
wake_up_forked_process(p);

is just wrong, and it should be replaced with

wake_up_new_process(p, clone_flags);

and then "wake_up_new_process()" can do the right thing, which is
basically:

if (clone_flags & CLONE_VFORK)
synchronous wakeup, same as pipe-will-block case
else if (clone_flags & CLONE_VM)
thread-wakeup-case
else
process-wakeup-case

No?

Linus

2004-06-16 03:14:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3



On Wed, 16 Jun 2004, Nick Piggin wrote:
>
> I think balance on clone probably needs to be turned off by default
> presently.
>
> It slows down a simple thread creation test by a factor of 7(!) here,
> and has quite a few not too difficult to imagine performance problems.

I agree. However, I still think we should do my suggested
"wake_up_new(p,clone_flags)" thing, and then have the logic on whether to
try to care about threading or not be in schedule.c, not in kernel/fork.c.

The fact is, fork.c shouldn't try to make scheduling decisions. But it
could inform the scheduler about the new process, and THAT can then make
the decisions.

Linus

2004-06-16 03:19:34

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3

Linus Torvalds wrote:
>
> On Wed, 16 Jun 2004, Nick Piggin wrote:
>
>>I think balance on clone probably needs to be turned off by default
>>presently.
>>
>>It slows down a simple thread creation test by a factor of 7(!) here,
>>and has quite a few not too difficult to imagine performance problems.
>
>
> I agree. However, I still think we should do my suggested
> "wake_up_new(p,clone_flags)" thing, and then have the logic on whether to
> try to care about threading or not be in schedule.c, not in kernel/fork.c.
>
> The fact is, fork.c shouldn't try to make scheduling decisions. But it
> could inform the scheduler about the new process, and THAT can then make
> the decisions.
>

I agree, it is a fine suggestion. Would be a trivial change, and
would clean things up nicely.

2004-06-16 03:44:29

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3

Quoting Nick Piggin <[email protected]>:
> Ingo Molnar wrote:
> >* Con Kolivas <[email protected]> wrote:
> >>with a little bit of detective work and help from Wli we tracked down that
>
> >>this patch caused it:
> >>[PATCH] sched: improve wakeup-affinity
> >>
> >
> >>A massive increase in idle time was observed and the throughput
> >>dropped by 40% Reversing this patch gave these results:
> >>
> >
> >>backsched1: http://khack.osdl.org/stp/293865/
> >>Composite Query Processing Power Throughput Numerical Quantity
> >>193.93 145.95 257.67
> >>
> >>It may be best to reverse this patch until the regression is better
> >>understood.
> >agreed. It is weird because Nick said that pgsql was tested with the
> >patch - and we applied the patch based on those good results. Nick?
> Sigh, yes, Mark did run a test for me, but I think it was dbt2-pgsql.
> This one is dbt3-pgsql. Also, his system was a 4 logical CPU Xeon.
>
> Strangely enough, Mark's setup was showing a fairly large too-much-idle
> regression not long ago, while these 8-ways weren't.
>
> Anyway, Linus has reverted my patch now, which is the right thing to
> do. Your sync wakeup change is still in there, so that will hopefully
> help bw_pipe scores.

The sync wakeup change was worth about 1% detriment to this benchmark which, if
offset by significant performance gains elsewhere, is worth tolerating.

Con

2004-06-16 07:32:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3


* Linus Torvalds <[email protected]> wrote:

> I agree. However, I still think we should do my suggested
> "wake_up_new(p,clone_flags)" thing, and then have the logic on whether
> to try to care about threading or not be in schedule.c, not in
> kernel/fork.c.
>
> The fact is, fork.c shouldn't try to make scheduling decisions. But it
> could inform the scheduler about the new process, and THAT can then
> make the decisions.

agreed, and i did it in a similar way initially (by adding the clone
flags to wake_up_process()) but went for the smaller patch. The only
reason i pushed it into fork.c initially was to avoid having to change
dozens of other files (most of them in various architectures) that use
wake_up_process(). It wasnt (and still isnt) clear at all whether we
want to do any fork/clone-time balancing.

Ingo

2004-06-16 14:54:56

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] Performance regression in 2.6.7-rc3

>> > How the hell can that have any effect on non-threaded workloads? Perhaps
>> > some part of kernel compile *is* multi-threaded. It does seem to get
>>
>> make(1) with vfork(2) perhaps?
>
> Very likely. And in the vfork() case it is definitely WRONG to try to
> reschedule (either threads _or_ processes), since the parent is going to
> go to sleep real soon now.
>
> I think this code:
>
> if (clone_flags & CLONE_VM)
> wake_up_forked_thread(p);
> else
> wake_up_forked_process(p);
>
> is just wrong, and it should be replaced with
>
> wake_up_new_process(p, clone_flags);
>
> and then "wake_up_new_process()" can do the right thing, which is
> basically:
>
> if (clone_flags & CLONE_VFORK)
> synchronous wakeup, same as pipe-will-block case
> else if (clone_flags & CLONE_VM)
> thread-wakeup-case
> else
> process-wakeup-case
>
> No?

Looks much better ... but I'd still dispute whether we need to throw
non-vfork threads cross node by default. I'd suggest that's disabled
by default, and is either enabled by a global userspace option, or a
per-process one (or the option of both). Most thing (except benchmarks)
simply don't want this in real life ...

M.