2003-08-04 16:06:00

by Cliff White

[permalink] [raw]
Subject: 2.6.0-test2-mm3 osdl-aim-7 regression


I see 2.6.0-test2-mm4 is already in our queue, so this may be
Old News. ( serves me right for taking a weekend off )
Performance of -mm3 falls off on the 4-cpu machines.

2-cpu ssytems
Kernel JPM
2.6.0-test2-mm3 1313.53
linux-2.6.0-test2 1320.68 (0.54 % +)

4-cpu systems
2.6.0-test2-mm3 4824.96
linux-2.6.0-test2 5381.20 ( 11.53 % + )

Full details at
http://developer.osdl.org/cliffw/reaim/index.html
code at
bk://developer.osdl.org/osdl-aim-7

cliffw



2003-08-06 05:22:15

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.0-test2-mm3 osdl-aim-7 regression

Cliff White <[email protected]> wrote:
>
>
> I see 2.6.0-test2-mm4 is already in our queue, so this may be
> Old News. ( serves me right for taking a weekend off )
> Performance of -mm3 falls off on the 4-cpu machines.
>
> 2-cpu ssytems
> Kernel JPM
> 2.6.0-test2-mm3 1313.53
> linux-2.6.0-test2 1320.68 (0.54 % +)
>
> 4-cpu systems
> 2.6.0-test2-mm3 4824.96
> linux-2.6.0-test2 5381.20 ( 11.53 % + )
>
> Full details at
> http://developer.osdl.org/cliffw/reaim/index.html
> code at
> bk://developer.osdl.org/osdl-aim-7
>

OK, I can reproduce this on 4way.

Binary searching (insert gratuitous rant about benchmarks that take more
than two minutes to complete) reveals that the slowdown is due to
sched-2.6.0-test2-mm2-A3.

So mm4 with everthing up to but not including sched-2.6.0-test2-mm2-A3:

Max Jobs per Minute 1467.06
Max Jobs per Minute 1478.82
Max Jobs per Minute 1473.36

3853.55s user 264.31s system 370% cpu 18:31.95 total

After adding sched-2.6.0-test2-mm2-A3:

Max Jobs per Minute 1375.63
Max Jobs per Minute 1278.40
Max Jobs per Minute 1293.11

4416.70s user 275.61s system 374% cpu 20:53.58 total

A 10% regression there, mainly user time.


The test is:

- build bk://developer.osdl.org/osdl-aim-7

- cd src

- time ./reaim -s4 -q -t -i4 -f./workfile.new_dbase -r3 -b -l./reaim.config



2003-08-06 19:10:43

by Cliff White

[permalink] [raw]
Subject: Re: 2.6.0-test2-mm3 osdl-aim-7 regression

> Cliff White <[email protected]> wrote:
> >
> >
> > I see 2.6.0-test2-mm4 is already in our queue, so this may be
> > Old News. ( serves me right for taking a weekend off )
> > Performance of -mm3 falls off on the 4-cpu machines.
> >
> > 2-cpu ssytems
> > Kernel JPM
> > 2.6.0-test2-mm3 1313.53
> > linux-2.6.0-test2 1320.68 (0.54 % +)
> >
> > 4-cpu systems
> > 2.6.0-test2-mm3 4824.96
> > linux-2.6.0-test2 5381.20 ( 11.53 % + )
> >
> > Full details at
> > http://developer.osdl.org/cliffw/reaim/index.html
> > code at
> > bk://developer.osdl.org/osdl-aim-7
> >
>
> OK, I can reproduce this on 4way.
>
> Binary searching (insert gratuitous rant about benchmarks that take more
> than two minutes to complete) reveals that the slowdown is due to
> sched-2.6.0-test2-mm2-A3.

[Rant response]
For a short test run, you can run a small number of iterations like this:
./reaim -s2 -e10 -i2 -f./workfile.new_dbase

( 2->10 users, increment by 2)

That takes about 5 minutes on our 4-way.

Or, run one iteration with a large user count:

./reaim -s25 -e25 -f ./workifile.foo

cliffw


>
> So mm4 with everthing up to but not including sched-2.6.0-test2-mm2-A3:
>
> Max Jobs per Minute 1467.06
> Max Jobs per Minute 1478.82
> Max Jobs per Minute 1473.36
>
> 3853.55s user 264.31s system 370% cpu 18:31.95 total
>
> After adding sched-2.6.0-test2-mm2-A3:
>
> Max Jobs per Minute 1375.63
> Max Jobs per Minute 1278.40
> Max Jobs per Minute 1293.11
>
> 4416.70s user 275.61s system 374% cpu 20:53.58 total
>
> A 10% regression there, mainly user time.
>
>
> The test is:
>
> - build bk://developer.osdl.org/osdl-aim-7
>
> - cd src
>
> - time ./reaim -s4 -q -t -i4 -f./workfile.new_dbase -r3 -b -l./reaim.config
>
>
>


2003-08-07 02:35:43

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.6.0-test2-mm3 osdl-aim-7 regression

On Thu, 7 Aug 2003 05:10, Cliff White wrote:
> > Binary searching (insert gratuitous rant about benchmarks that take more
> > than two minutes to complete) reveals that the slowdown is due to
> > sched-2.6.0-test2-mm2-A3.

This is most likely the round robinning of tasks every 25ms. The extra
overhead of nanosecond timing I doubt could make that size difference (but I
could be wrong). There is some tweaking of this round robinning in my code
which may help this, but it won't bring it back up to original performance I
believe. Two things to try are add my patches up to O12.3int first to see how
much (if at all!) it helps, and change TIMESLICE_GRANULARITY in sched.c to
(MAX_TIMESLICE) which basically disables it completely. If there is still a
drop in performance with this, the remainder is the extra locking/overhead in
nanosecond timing.

Con

2003-08-07 05:12:11

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.0-test2-mm3 osdl-aim-7 regression

Con Kolivas wrote:

>On Thu, 7 Aug 2003 05:10, Cliff White wrote:
>
>>>Binary searching (insert gratuitous rant about benchmarks that take more
>>>than two minutes to complete) reveals that the slowdown is due to
>>>sched-2.6.0-test2-mm2-A3.
>>>
>
>This is most likely the round robinning of tasks every 25ms. The extra
>overhead of nanosecond timing I doubt could make that size difference (but I
>could be wrong). There is some tweaking of this round robinning in my code
>which may help this, but it won't bring it back up to original performance I
>believe. Two things to try are add my patches up to O12.3int first to see how
>much (if at all!) it helps, and change TIMESLICE_GRANULARITY in sched.c to
>(MAX_TIMESLICE) which basically disables it completely. If there is still a
>drop in performance with this, the remainder is the extra locking/overhead in
>nanosecond timing.
>
>
What is the need for this round robining? Don't processes get a calculated
timeslice anyway?


2003-08-07 05:37:13

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.6.0-test2-mm3 osdl-aim-7 regression

On Thu, 7 Aug 2003 15:11, Nick Piggin wrote:
> What is the need for this round robining? Don't processes get a calculated
> timeslice anyway?

Nice to see you taking an unhealthy interest in the scheduler tweaks Nick.
This issue has been discussed before but it never hurts to review things.
I've uncc'ed the rest of the people in case we get carried away again. First
let me show you Ingo's comment in the relevant code section:

* Prevent a too long timeslice allowing a task to monopolize
* the CPU. We do this by splitting up the timeslice into
* smaller pieces.
*
* Note: this does not mean the task's timeslices expire or
* get lost in any way, they just might be preempted by
* another task of equal priority. (one with higher
* priority would have preempted this task already.) We
* requeue this task to the end of the list on this priority
* level, which is in essence a round-robin of tasks with
* equal priority.

I was gonna say second blah blah but I think the first paragraph explains the
issue.

Must we do this? No.

Should we? Probably.

How frequently should we do it? Once again I'll quote Ingo who said it's a
difficult question to answer.

The more frequently you round robin the lower the scheduler latency between
SCHED_OTHER tasks of the same priority. However, the longer the timeslice the
more benefit you get from cpu cache. Where is the sweet spot? Depends on the
hardware and your usage requirements of course, but Ingo has empirically
chosen 25ms after 50ms seemed too long. Basically cache trashing becomes a
real problem with timeslices below ~7ms on modern hardware in my limited
testing. A minor quirk in Ingo's original code means _occasionally_ a task
will be requeued with <3ms to go. It will be interesting to see if fixing
this (which O12.2+ does) makes a big difference or whether we need to
reconsider how frequently (if at all) we round robin tasks.

Con

2003-08-07 08:26:12

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.0-test2-mm3 osdl-aim-7 regression



Con Kolivas wrote:

>On Thu, 7 Aug 2003 15:11, Nick Piggin wrote:
>
>>What is the need for this round robining? Don't processes get a calculated
>>timeslice anyway?
>>
>
>Nice to see you taking an unhealthy interest in the scheduler tweaks Nick.
>This issue has been discussed before but it never hurts to review things.
>I've uncc'ed the rest of the people in case we get carried away again. First
>let me show you Ingo's comment in the relevant code section:
>
> * Prevent a too long timeslice allowing a task to monopolize
> * the CPU. We do this by splitting up the timeslice into
> * smaller pieces.
> *
> * Note: this does not mean the task's timeslices expire or
> * get lost in any way, they just might be preempted by
> * another task of equal priority. (one with higher
> * priority would have preempted this task already.) We
> * requeue this task to the end of the list on this priority
> * level, which is in essence a round-robin of tasks with
> * equal priority.
>
>I was gonna say second blah blah but I think the first paragraph explains the
>issue.
>
>Must we do this? No.
>
>Should we? Probably.
>
>How frequently should we do it? Once again I'll quote Ingo who said it's a
>difficult question to answer.
>

OK, I was just thinking it should get done automatically by virtue
of the regular timeslice allocation, dynamic priorities, etc.

It just sounds like another workaround due to the scheduler's inability
to properly manage priorities and (the large range of length of) timeslices.

>
>
>The more frequently you round robin the lower the scheduler latency between
>SCHED_OTHER tasks of the same priority. However, the longer the timeslice the
>more benefit you get from cpu cache. Where is the sweet spot? Depends on the
>hardware and your usage requirements of course, but Ingo has empirically
>chosen 25ms after 50ms seemed too long. Basically cache trashing becomes a
>real problem with timeslices below ~7ms on modern hardware in my limited
>testing. A minor quirk in Ingo's original code means _occasionally_ a task
>will be requeued with <3ms to go. It will be interesting to see if fixing
>this (which O12.2+ does) makes a big difference or whether we need to
>reconsider how frequently (if at all) we round robin tasks.
>

Why not have it dynamic? CPU hogs get longer timeslices (but of course
can be preempted by higher priorities).


2003-08-07 09:56:01

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.6.0-test2-mm3 osdl-aim-7 regression

On Thu, 7 Aug 2003 18:25, Nick Piggin wrote:
> >The more frequently you round robin the lower the scheduler latency
> > between SCHED_OTHER tasks of the same priority. However, the longer the
> > timeslice the more benefit you get from cpu cache. Where is the sweet
> > spot? Depends on the hardware and your usage requirements of course, but
> > Ingo has empirically chosen 25ms after 50ms seemed too long. Basically
> > cache trashing becomes a real problem with timeslices below ~7ms on
> > modern hardware in my limited testing. A minor quirk in Ingo's original
> > code means _occasionally_ a task will be requeued with <3ms to go. It
> > will be interesting to see if fixing this (which O12.2+ does) makes a big
> > difference or whether we need to reconsider how frequently (if at all) we
> > round robin tasks.
>
> Why not have it dynamic? CPU hogs get longer timeslices (but of course
> can be preempted by higher priorities).

Funny you should say that. Before Ingo merged his A3 changes, that's what my
version of them did.

Con

2003-08-07 10:07:05

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.0-test2-mm3 osdl-aim-7 regression



Con Kolivas wrote:

>On Thu, 7 Aug 2003 18:25, Nick Piggin wrote:
>
>>>The more frequently you round robin the lower the scheduler latency
>>>between SCHED_OTHER tasks of the same priority. However, the longer the
>>>timeslice the more benefit you get from cpu cache. Where is the sweet
>>>spot? Depends on the hardware and your usage requirements of course, but
>>>Ingo has empirically chosen 25ms after 50ms seemed too long. Basically
>>>cache trashing becomes a real problem with timeslices below ~7ms on
>>>modern hardware in my limited testing. A minor quirk in Ingo's original
>>>code means _occasionally_ a task will be requeued with <3ms to go. It
>>>will be interesting to see if fixing this (which O12.2+ does) makes a big
>>>difference or whether we need to reconsider how frequently (if at all) we
>>>round robin tasks.
>>>
>>Why not have it dynamic? CPU hogs get longer timeslices (but of course
>>can be preempted by higher priorities).
>>
>
>Funny you should say that. Before Ingo merged his A3 changes, that's what my
>version of them did.
>
>

Between you and me, I think this would be the right way to go if it
could be done right. I don't think wli, mjb and the rest of their
clique appreciate the 25ms reschedule!


2003-08-08 20:58:40

by Cliff White

[permalink] [raw]
Subject: Re: 2.6.0-test2-mm3 osdl-aim-7 regression

> On Thu, 7 Aug 2003 05:10, Cliff White wrote:
> > > Binary searching (insert gratuitous rant about benchmarks that take more
> > > than two minutes to complete) reveals that the slowdown is due to
> > > sched-2.6.0-test2-mm2-A3.
>
> This is most likely the round robinning of tasks every 25ms. The extra
> overhead of nanosecond timing I doubt could make that size difference (but I
> could be wrong). There is some tweaking of this round robinning in my code
> which may help this, but it won't bring it back up to original performance I
> believe. Two things to try are add my patches up to O12.3int first to see how
> much (if at all!) it helps, and change TIMESLICE_GRANULARITY in sched.c to
> (MAX_TIMESLICE) which basically disables it completely. If there is still a
> drop in performance with this, the remainder is the extra locking/overhead in
> nanosecond timing.
>
> Con
>
Added your patches to PLM, from your web site. We've had other issues slowing
up the
4-cpu queue, but the two CPU tests ran. On these smaller platforms, not seeing
big
difference between the patches.

STP id PLM# Kernel Name Workfile MaxJPM MaxUser Host %Change
277231 2042 CK-O13-O13.1int-1 new_dbase 1333.60 22 stp2-002 0.00
277230 2041 CK-O12.3-O13int-1 new_dbase 1344.23 24 stp2-003 0.80
277228 2040 CK-012.2-O12.3int-1 new_dbase 1328.86 22 stp2-002 -0.36
All are a bit better than stock:
276572 2020 linux-2.6.0-test2 new_dbase 1320.68 22 stp2-000 -0.96
----
Code location:
bk://developer.osdl.org/osdl-aim-7
More results:
http://developer.osdl.org/cliffw/reaim/index.html

Run parameters:

./reaim -s2 -x -t -i2 -f workfile.new_dbase -r3 -b -l./stp.config

cliffw