2005-12-27 18:08:34

by Paolo Ornati

[permalink] [raw]
Subject: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

Hello,

I've found an easy-to-reproduce-for-me test case that shows a totally
wrong priority calculation: basically a CPU-intensitive process gets
better priority than a disk-intensitive one (dd if=bigfile
of=/dev/null ...).

Seems impossible, isn't it?

---- THE NUMBERS with 2.6.15-rc7 -----

The test-case is the Xvid encoding of dvd-ripped track with transcode
(using "dvd::rip" interface). The copied-and-pasted command line is
this:

mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' &&
cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null
-i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972
--a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null
-o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p
'/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec
transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50
-b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y
xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 &&
echo DVDRIP_SUCCESS


Here there is a TOP snapshot while running it:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5721 paolo 16 0 115m 18m 2428 R 84.4 3.7 0:15.11 transcode
5736 paolo 25 0 50352 4516 1912 R 8.4 0.9 0:01.53 tcdecode
5725 paolo 15 0 115m 18m 2428 S 4.6 3.7 0:00.84 transcode
5738 paolo 18 0 115m 18m 2428 S 0.8 3.7 0:00.15 transcode
5734 paolo 25 0 20356 1140 920 S 0.6 0.2 0:00.12 tcdemux
5731 paolo 25 0 47312 2540 1996 R 0.4 0.5 0:00.08 tcdecode
5319 root 15 0 166m 16m 2584 S 0.2 3.2 0:25.06 X
5444 paolo 16 0 87116 22m 15m R 0.2 4.6 0:04.05 konsole
5716 paolo 16 0 10424 1160 876 R 0.2 0.2 0:00.06 top
5735 paolo 25 0 22364 1436 932 S 0.2 0.3 0:00.01 tcextract


DD running alone:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real 0m4.052s
user 0m0.000s
sys 0m0.209s

DD while transcoding:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real 0m26.121s
user 0m0.001s
sys 0m0.255s

---------------------------------------

I've tried older kernels finding that 2.6.11 is the first affected.

Going on with testing...

2.6.11-rc[1-5]:
2.6.11-rc3 bad
2.6.11-rc1 bad

2.6.10-bk[1-14]
2.6.10-bk7 good
2.6.10-bk11 good
2.6.10-bk13 bad
2.6.10-bk12 bad

So the problem was introduced with:
>> 2.6.10-bk12 09-Jan-2005 <<

The exact behaviour is different with 2.6.11/12/13/14... for example:
with 2.6.11 the priority of "transcode" is initially set to ~25 and go
down to 17/18 when running DD.

The problem doesn't seem 100% reproducible with every kernel, sometimes
a "BAD" kernel looks "GOOD"... or maybe it was me confused by too
much compile/install/reboot/test work ;)

Other INFO:
- I'm on x86_64
- preemption ON/OFF doesn't make any differences


Can anyone reproduce this?
IOW: is this affecting only my machine?

--
Paolo Ornati
Linux 2.6.15-rc7-gf89f5948 on x86_64


2005-12-27 21:47:27

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

On Tue, 27 Dec 2005 19:09:18 +0100
Paolo Ornati <[email protected]> wrote:

> Hello,
>
> I've found an easy-to-reproduce-for-me test case that shows a totally
> wrong priority calculation: basically a CPU-intensitive process gets
> better priority than a disk-intensitive one (dd if=bigfile
> of=/dev/null ...).
>
> Seems impossible, isn't it?
>
> ---- THE NUMBERS with 2.6.15-rc7 -----
>
> The test-case is the Xvid encoding of dvd-ripped track with transcode
> (using "dvd::rip" interface). The copied-and-pasted command line is
> this:
>
> mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' &&
> cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null
> -i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972
> --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null
> -o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p
> '/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec
> transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50
> -b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y
> xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 &&
> echo DVDRIP_SUCCESS
>
>
> Here there is a TOP snapshot while running it:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5721 paolo 16 0 115m 18m 2428 R 84.4 3.7 0:15.11 transcode
> 5736 paolo 25 0 50352 4516 1912 R 8.4 0.9 0:01.53 tcdecode
> 5725 paolo 15 0 115m 18m 2428 S 4.6 3.7 0:00.84 transcode
> 5738 paolo 18 0 115m 18m 2428 S 0.8 3.7 0:00.15 transcode
> 5734 paolo 25 0 20356 1140 920 S 0.6 0.2 0:00.12 tcdemux
> 5731 paolo 25 0 47312 2540 1996 R 0.4 0.5 0:00.08 tcdecode
> 5319 root 15 0 166m 16m 2584 S 0.2 3.2 0:25.06 X
> 5444 paolo 16 0 87116 22m 15m R 0.2 4.6 0:04.05 konsole
> 5716 paolo 16 0 10424 1160 876 R 0.2 0.2 0:00.06 top
> 5735 paolo 25 0 22364 1436 932 S 0.2 0.3 0:00.01 tcextract
>
>
> DD running alone:
>
> paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
> 128+0 records in
> 128+0 records out
>
> real 0m4.052s
> user 0m0.000s
> sys 0m0.209s
>
> DD while transcoding:
>
> paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
> 128+0 records in
> 128+0 records out
>
> real 0m26.121s
> user 0m0.001s
> sys 0m0.255s
>
> ---------------------------------------
>
> I've tried older kernels finding that 2.6.11 is the first affected.
>
> Going on with testing...
>
> 2.6.11-rc[1-5]:
> 2.6.11-rc3 bad
> 2.6.11-rc1 bad
>
> 2.6.10-bk[1-14]
> 2.6.10-bk7 good
> 2.6.10-bk11 good
> 2.6.10-bk13 bad
> 2.6.10-bk12 bad
>
> So the problem was introduced with:
> >> 2.6.10-bk12 09-Jan-2005 <<
>
> The exact behaviour is different with 2.6.11/12/13/14... for example:
> with 2.6.11 the priority of "transcode" is initially set to ~25 and go
> down to 17/18 when running DD.
>
> The problem doesn't seem 100% reproducible with every kernel, sometimes
> a "BAD" kernel looks "GOOD"... or maybe it was me confused by too
> much compile/install/reboot/test work ;)
>
> Other INFO:
> - I'm on x86_64
> - preemption ON/OFF doesn't make any differences
>
>
> Can anyone reproduce this?
> IOW: is this affecting only my machine?
>

Hello Con and Ingo... I've found that the above problem goes away
by reverting this:

http://linux.bkbits.net:8080/linux-2.6/cset@41e054c6pwNQXzErMxvfh4IpLPXA5A?nav=index.html|src/|src/include|src/include/linux|related/include/linux/sched.h

--------------------------------------------------

[PATCH] sched: remove_interactive_credit

Special casing tasks by interactive credit was helpful for preventing fully
cpu bound tasks from easily rising to interactive status.

However it did not select out tasks that had periods of being fully cpu
bound and then sleeping while waiting on pipes, signals etc. This led to a
more disproportionate share of cpu time.

Backing this out will no longer special case only fully cpu bound tasks,
and prevents the variable behaviour that occurs at startup before tasks
declare themseleves interactive or not, and speeds up application startup
slightly under certain circumstances. It does cost in interactivity
slightly as load rises but it is worth it for the fairness gains.

Signed-off-by: Con Kolivas <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

--------------------------------------------------


Maybe this change has revealed a scheduler weakness ?

I'm glad to test any patch or give more data :)

Bye,

--
Paolo Ornati
Linux 2.6.10-bk12-int_credit on x86_64

2005-12-27 23:27:18

by Con Kolivas

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

On Wednesday 28 December 2005 08:48, Paolo Ornati wrote:
> On Tue, 27 Dec 2005 19:09:18 +0100
>
> Paolo Ornati <[email protected]> wrote:
> > Hello,
> >
> > I've found an easy-to-reproduce-for-me test case that shows a totally
> > wrong priority calculation: basically a CPU-intensitive process gets
> > better priority than a disk-intensitive one (dd if=bigfile
> > of=/dev/null ...).
> >
> > Seems impossible, isn't it?
> >
> > ---- THE NUMBERS with 2.6.15-rc7 -----
> >
> > The test-case is the Xvid encoding of dvd-ripped track with transcode
> > (using "dvd::rip" interface). The copied-and-pasted command line is
> > this:
> >
> > mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' &&
> > cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null
> > -i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972
> > --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null
> > -o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p
> > '/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec
> > transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50
> > -b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y
> > xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 &&
> > echo DVDRIP_SUCCESS
> >
> >
> > Here there is a TOP snapshot while running it:
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 5721 paolo 16 0 115m 18m 2428 R 84.4 3.7 0:15.11 transcode
> > 5736 paolo 25 0 50352 4516 1912 R 8.4 0.9 0:01.53 tcdecode
> > 5725 paolo 15 0 115m 18m 2428 S 4.6 3.7 0:00.84 transcode
> > 5738 paolo 18 0 115m 18m 2428 S 0.8 3.7 0:00.15 transcode
> > 5734 paolo 25 0 20356 1140 920 S 0.6 0.2 0:00.12 tcdemux
> > 5731 paolo 25 0 47312 2540 1996 R 0.4 0.5 0:00.08 tcdecode
> > 5319 root 15 0 166m 16m 2584 S 0.2 3.2 0:25.06 X
> > 5444 paolo 16 0 87116 22m 15m R 0.2 4.6 0:04.05 konsole
> > 5716 paolo 16 0 10424 1160 876 R 0.2 0.2 0:00.06 top
> > 5735 paolo 25 0 22364 1436 932 S 0.2 0.3 0:00.01 tcextract
> >
> >
> > DD running alone:
> >
> > paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null
> > bs=1M count=128; umount space/ 128+0 records in
> > 128+0 records out
> >
> > real 0m4.052s
> > user 0m0.000s
> > sys 0m0.209s
> >
> > DD while transcoding:
> >
> > paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null
> > bs=1M count=128; umount space/ 128+0 records in
> > 128+0 records out
> >
> > real 0m26.121s
> > user 0m0.001s
> > sys 0m0.255s

Looking at your top output I see that transcode command generates 7 processes
all likely to be using cpu, and your DD slowdown is almost 7 times... I
assume it all goes away if you nice the transcode up by 3 or more.

> Hello Con and Ingo... I've found that the above problem goes away
> by reverting this:
>
> http://linux.bkbits.net:8080/linux-2.6/cset@41e054c6pwNQXzErMxvfh4IpLPXA5A?
>nav=index.html|src/|src/include|src/include/linux|related/include/linux/sche
>d.h
>
> --------------------------------------------------
>
> [PATCH] sched: remove_interactive_credit

The issue is that the scheduler interactivity estimator is a state machine and
can be fooled to some degree, and a cpu intensive task that just happens to
sleep a little bit gets significantly better priority than one that is fully
cpu bound all the time. Reverting that change is not a solution because it
can still be fooled by the same process sleeping lots for a few seconds or so
at startup and then changing to the cpu mostly-sleeping slightly behaviour.
This "fluctuating" behaviour is in my opinion worse which is why I removed
it.

Cheers,
Con


Attachments:
(No filename) (3.70 kB)
(No filename) (189.00 B)
Download all attachments

2005-12-27 23:59:17

by Peter Williams

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

Paolo Ornati wrote:
> On Tue, 27 Dec 2005 19:09:18 +0100
> Paolo Ornati <[email protected]> wrote:
>
>
>>Hello,
>>
>>I've found an easy-to-reproduce-for-me test case that shows a totally
>>wrong priority calculation: basically a CPU-intensitive process gets
>>better priority than a disk-intensitive one (dd if=bigfile
>>of=/dev/null ...).
>>
>>Seems impossible, isn't it?
>>
>>---- THE NUMBERS with 2.6.15-rc7 -----
>>
>>The test-case is the Xvid encoding of dvd-ripped track with transcode
>>(using "dvd::rip" interface). The copied-and-pasted command line is
>>this:
>>
>>mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' &&
>>cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null
>>-i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972
>>--a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null
>>-o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p
>>'/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec
>>transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50
>>-b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y
>>xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 &&
>>echo DVDRIP_SUCCESS
>>
>>
>>Here there is a TOP snapshot while running it:
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 5721 paolo 16 0 115m 18m 2428 R 84.4 3.7 0:15.11 transcode
>> 5736 paolo 25 0 50352 4516 1912 R 8.4 0.9 0:01.53 tcdecode
>> 5725 paolo 15 0 115m 18m 2428 S 4.6 3.7 0:00.84 transcode
>> 5738 paolo 18 0 115m 18m 2428 S 0.8 3.7 0:00.15 transcode
>> 5734 paolo 25 0 20356 1140 920 S 0.6 0.2 0:00.12 tcdemux
>> 5731 paolo 25 0 47312 2540 1996 R 0.4 0.5 0:00.08 tcdecode
>> 5319 root 15 0 166m 16m 2584 S 0.2 3.2 0:25.06 X
>> 5444 paolo 16 0 87116 22m 15m R 0.2 4.6 0:04.05 konsole
>> 5716 paolo 16 0 10424 1160 876 R 0.2 0.2 0:00.06 top
>> 5735 paolo 25 0 22364 1436 932 S 0.2 0.3 0:00.01 tcextract
>>
>>
>>DD running alone:
>>
>>paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
>>128+0 records in
>>128+0 records out
>>
>>real 0m4.052s
>>user 0m0.000s
>>sys 0m0.209s
>>
>>DD while transcoding:
>>
>>paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
>>128+0 records in
>>128+0 records out
>>
>>real 0m26.121s
>>user 0m0.001s
>>sys 0m0.255s
>>
>>---------------------------------------
>>
>>I've tried older kernels finding that 2.6.11 is the first affected.
>>
>>Going on with testing...
>>
>> 2.6.11-rc[1-5]:
>>2.6.11-rc3 bad
>>2.6.11-rc1 bad
>>
>> 2.6.10-bk[1-14]
>>2.6.10-bk7 good
>>2.6.10-bk11 good
>>2.6.10-bk13 bad
>>2.6.10-bk12 bad
>>
>>So the problem was introduced with:
>> >> 2.6.10-bk12 09-Jan-2005 <<
>>
>>The exact behaviour is different with 2.6.11/12/13/14... for example:
>>with 2.6.11 the priority of "transcode" is initially set to ~25 and go
>>down to 17/18 when running DD.
>>
>>The problem doesn't seem 100% reproducible with every kernel, sometimes
>>a "BAD" kernel looks "GOOD"... or maybe it was me confused by too
>>much compile/install/reboot/test work ;)
>>
>>Other INFO:
>>- I'm on x86_64
>>- preemption ON/OFF doesn't make any differences
>>
>>
>>Can anyone reproduce this?
>>IOW: is this affecting only my machine?
>>
>
>
> Hello Con and Ingo... I've found that the above problem goes away
> by reverting this:
>
> http://linux.bkbits.net:8080/linux-2.6/cset@41e054c6pwNQXzErMxvfh4IpLPXA5A?nav=index.html|src/|src/include|src/include/linux|related/include/linux/sched.h
>
> --------------------------------------------------
>
> [PATCH] sched: remove_interactive_credit
>
> Special casing tasks by interactive credit was helpful for preventing fully
> cpu bound tasks from easily rising to interactive status.
>
> However it did not select out tasks that had periods of being fully cpu
> bound and then sleeping while waiting on pipes, signals etc. This led to a
> more disproportionate share of cpu time.
>
> Backing this out will no longer special case only fully cpu bound tasks,
> and prevents the variable behaviour that occurs at startup before tasks
> declare themseleves interactive or not, and speeds up application startup
> slightly under certain circumstances. It does cost in interactivity
> slightly as load rises but it is worth it for the fairness gains.
>
> Signed-off-by: Con Kolivas <[email protected]>
> Acked-by: Ingo Molnar <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
> --------------------------------------------------
>
>
> Maybe this change has revealed a scheduler weakness ?
>
> I'm glad to test any patch or give more data :)
>
> Bye,
>

Any chance of you applying the PlugSched patches and seeing how the
other schedulers that it contains handle this situation?

The patch at:

<http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download>

should apply without problems to the 2.6.15-rc7 kernel.

Very Brief Documentation:

You can select a default scheduler at kernel build time. If you wish to
boot with a scheduler other than the default it can be selected at boot
time by adding:

cpusched=<scheduler>

to the boot command line where <scheduler> is one of: ingosched,
nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod. If you
don't change the default when you build the kernel the default scheduler
will be ingosched (which is the normal scheduler).

The scheduler in force on a running system can be determined by the
contents of:

/proc/scheduler

Control parameters for the scheduler can be read/set via files in:

/sys/cpusched/<scheduler>/

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2005-12-28 10:21:06

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

On Wed, 28 Dec 2005 10:59:13 +1100
Peter Williams <[email protected]> wrote:

> Any chance of you applying the PlugSched patches and seeing how the
> other schedulers that it contains handle this situation?
>
> The patch at:
>
> <http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download>
>
> should apply without problems to the 2.6.15-rc7 kernel.
>
> Very Brief Documentation:
>
> You can select a default scheduler at kernel build time. If you wish to
> boot with a scheduler other than the default it can be selected at boot
> time by adding:
>
> cpusched=<scheduler>
>
> to the boot command line where <scheduler> is one of: ingosched,
> nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod. If you
> don't change the default when you build the kernel the default scheduler
> will be ingosched (which is the normal scheduler).


First of all, this is the "pstree" structure of transcode an friends:

|-kdesktop---perl---sh---transcode-+-2*[sh-+-tccat]
| | |-tcdecode]
| | |-tcdemux]
| | `-tcextract]
| `-transcode---5*[transcode]


Results with various schedulers:

------------------------------------------------------------------------

1) nicksched: perfect! This is the behaviour I want.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5562 paolo 40 0 115m 18m 2428 R 82.2 3.7 0:22.16 transcode
5576 paolo 26 0 50348 4516 1912 S 9.5 0.9 0:02.43 tcdecode
5566 paolo 23 0 115m 18m 2428 S 4.6 3.7 0:01.24 transcode
5573 paolo 21 0 115m 18m 2428 S 0.9 3.7 0:00.22 transcode
5577 paolo 27 0 20356 1140 920 S 0.9 0.2 0:00.21 tcdemux
5295 root 20 0 167m 17m 3624 S 0.6 3.5 0:11.02 X
5579 paolo 20 0 47308 2540 1996 S 0.5 0.5 0:00.14 tcdecode
5574 paolo 20 0 20356 1144 920 S 0.4 0.2 0:00.11 tcdemux
...

transcode get recognized for what it is, and I/O bounded processes
don't even notice that it is running :)


2) staircase: bad, as you can see:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5582 paolo 26 0 115m 18m 2428 R 82.7 3.7 0:47.63 transcode
5599 paolo 39 0 50352 4516 1912 R 9.6 0.9 0:05.21 tcdecode
5586 paolo 0 0 115m 18m 2428 S 4.5 3.7 0:02.61 transcode
5622 paolo 39 0 4948 1520 412 R 1.1 0.3 0:00.15 dd
5591 paolo 0 0 115m 18m 2428 S 0.6 3.7 0:00.36 transcode
5575 paolo 0 0 98476 37m 9392 S 0.4 7.5 0:01.44 perl
5597 paolo 27 0 20356 1144 920 S 0.4 0.2 0:00.21 tcdemux
5475 paolo 0 0 86556 22m 15m S 0.2 4.5 0:01.24 konsole
5388 root 0 0 167m 17m 3208 S 0.1 3.4 0:03.16 X
5587 paolo 0 0 115m 18m 2428 S 0.1 3.7 0:00.03 transcode
5595 paolo 20 0 47312 2540 1996 S 0.1 0.5 0:00.14 tcdecode
5596 paolo 26 0 22672 1268 1020 S 0.1 0.2 0:00.03 tccat
5598 paolo 28 0 22364 1436 932 S 0.1 0.3 0:00.04 tcextract


And "DD" is affected badly:

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=128; umount space/ 128+0 records in
128+0 records out

real 0m6.341s
user 0m0.002s
sys 0m0.229s

While transcoding:

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m15.793s
user 0m0.001s
sys 0m0.374s


3) spa_no_frills: bad, but this is OK since it is Round Robin :)

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5356 paolo 20 0 115m 18m 2428 R 81.1 3.7 0:27.61 transcode
5371 paolo 20 0 50348 4516 1912 R 8.9 0.9 0:02.97 tcdecode
5360 paolo 20 0 115m 18m 2428 S 4.1 3.7 0:01.54 transcode
5378 paolo 20 0 4948 1520 412 D 1.4 0.3 0:00.29 dd
5364 paolo 20 0 20352 1144 920 S 0.9 0.2 0:00.20 tcdemux
5373 paolo 20 0 115m 18m 2428 S 0.7 3.7 0:00.32 transcode
5369 paolo 20 0 20356 1144 920 S 0.5 0.2 0:00.14 tcdemux
5205 root 20 0 165m 15m 2584 R 0.2 3.2 0:01.86 X


4) spa_ws: bad

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5334 paolo 32 0 115m 18m 2428 R 82.7 3.7 0:18.77 transcode
5349 paolo 32 0 50348 4516 1912 R 8.9 0.9 0:02.00 tcdecode
5338 paolo 21 0 115m 18m 2428 S 4.6 3.7 0:01.08 transcode
5356 paolo 32 0 4948 1520 412 D 1.1 0.3 0:00.12 dd
5351 paolo 32 0 115m 18m 2428 S 1.0 3.7 0:00.20 transcode
5199 root 21 0 165m 15m 2584 S 0.4 3.2 0:01.68 X
5347 paolo 32 0 20356 1140 920 S 0.4 0.2 0:00.08 tcdemux
5296 paolo 22 0 98472 37m 9392 S 0.2 7.5 0:01.47 perl
5299 paolo 21 0 86556 22m 15m S 0.2 4.4 0:00.75 konsole
5344 paolo 32 0 47308 2540 1996 S 0.2 0.5 0:00.07 tcdecode
5339 paolo 21 0 115m 18m 2428 S 0.1 3.7 0:00.01 transcode

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m8.112s
user 0m0.001s
sys 0m0.444s

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m29.222s
user 0m0.000s
sys 0m0.400s


5) spa_svr: surprise, surprise! Not all that bad. At least DD
gets better priority than transcode... and DD real time is only a bit
affected (8s --> ~9s).


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5334 paolo 33 0 115m 18m 2428 R 78.1 3.7 0:22.70 transcode
5349 paolo 28 0 50352 4516 1912 S 9.0 0.9 0:02.41 tcdecode
5338 paolo 25 0 115m 18m 2428 S 4.7 3.7 0:01.29 transcode
5363 paolo 27 0 4952 1520 412 R 4.7 0.3 0:00.25 dd
5342 paolo 33 0 20352 1140 920 S 1.6 0.2 0:00.21 tcdemux
5351 paolo 25 0 115m 18m 2428 S 0.8 3.7 0:00.23 transcode
5144 root 22 0 166m 16m 3120 S 0.4 3.3 0:01.85 X
5344 paolo 23 0 47308 2540 1996 S 0.4 0.5 0:00.13 tcdecode
5347 paolo 27 0 20356 1144 920 S 0.4 0.2 0:00.10 tcdemux
5231 paolo 22 0 86660 22m 15m S 0.2 4.5 0:00.95 konsole
5271 paolo 25 0 98476 37m 9396 S 0.2 7.5 0:01.54 perl
5341 paolo 23 0 22672 1268 1020 S 0.2 0.2 0:00.02 tccat


6) zaphod: more or less like spa_svr

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5308 paolo 34 0 115m 18m 2428 R 52.1 3.7 0:49.77 transcode
5323 paolo 32 0 50352 4516 1912 S 6.0 0.9 0:05.61 tcdecode
5356 paolo 28 0 4952 1520 412 D 3.5 0.3 0:00.28 dd
5312 paolo 28 0 115m 18m 2428 S 2.6 3.7 0:02.71 transcode
5325 paolo 31 0 115m 18m 2428 S 0.7 3.7 0:00.55 transcode
5316 paolo 37 0 20352 1140 920 S 0.4 0.2 0:00.33 tcdemux
5202 root 23 0 165m 15m 2584 S 0.2 3.1 0:01.57 X
5318 paolo 31 0 47312 2540 1996 S 0.2 0.5 0:00.28 tcdecode
5321 paolo 33 0 20356 1144 920 S 0.2 0.2 0:00.26 tcdemux
4760 messageb 25 0 13248 1068 848 S 0.1 0.2 0:00.07
dbus-daemon-1 5264 paolo 24 0 93920 17m 10m S 0.1 3.5
0:00.38 kded 5282 paolo 23 0 92712 19m 12m S 0.1 3.9
0:00.36 kdesktop


7) ingosched: bad, as already said in the original post

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5209 paolo 16 0 115m 18m 2428 R 72.0 3.7 0:22.13 transcode
5224 paolo 22 0 50348 4516 1912 R 8.4 0.9 0:02.44 tcdecode
5213 paolo 15 0 115m 18m 2428 S 4.2 3.7 0:01.24 transcode
5243 paolo 18 0 4948 1520 412 R 1.8 0.3 0:00.14 dd
5217 paolo 19 0 20356 1144 920 R 0.8 0.2 0:00.19 tcdemux
5108 root 15 0 165m 15m 2584 S 0.6 3.1 0:01.44 X
5226 paolo 15 0 115m 18m 2428 S 0.6 3.7 0:00.20 transcode
5216 paolo 18 0 22676 1268 1020 S 0.4 0.2 0:00.03 tccat
5219 paolo 18 0 47312 2540 1996 R 0.4 0.5 0:00.12 tcdecode
5222 paolo 18 0 20356 1144 920 S 0.4 0.2 0:00.10 tcdemux
5195 paolo 16 0 98488 37m 9392 S 0.2 7.5 0:01.41 perl
5198 paolo 16 0 86552 22m 15m R 0.2 4.4 0:00.66 konsole

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/
256+0 records in
256+0 records out

real 0m23.393s (instead of 8s)
user 0m0.001s
sys 0m0.418s

------------------------------------------------------------------------


So the winner for manifest superiority is "nicksched", it looks to me
even better than 2.6.10-bk12 (ingosched) with
"remove_interactive_credit" reverted.

:)

--
Paolo Ornati
Linux 2.6.15-rc5-plugsched on x86_64

2005-12-28 11:00:07

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

On Wed, 28 Dec 2005 10:26:58 +1100
Con Kolivas <[email protected]> wrote:

> Looking at your top output I see that transcode command generates 7 processes
> all likely to be using cpu, and your DD slowdown is almost 7 times... I
> assume it all goes away if you nice the transcode up by 3 or more.

Yes, if I nice everything to 3 or more (nice -n 3 dvdrip ...) it works,
but I would prefer a less weak scheduler (see the other email, with
results for various schedulers).

Another thing that I've noticed is that things tends to get worse when
"transcode" is running for long time (some hours). It happened to me
some times (with ingosched and also with staircase if I remember
correctly):
after some hours of running transcode (with me away from the
machine) I've found a totally UNUSABLE system. Transcode was the king
of the machine and everything else get almost no CPU time. Switching to
a Text-Console takes something like 10s (or something like that). When
I was finally logged in as root I've reniced transcode and companyto +19
and the system was usable again ;)

To get things even more STRANGE: another time that this happened I've
done the same thing except that I've reniced them to "0" (the same nice
level they were running) ---> And the system became usable again (with
the usual slow down but still usable).

This is what I remember. Now I think we can agree that there is
something wrong... no?

:)

--
Paolo Ornati
Linux 2.6.15-rc5-plugsched on x86_64

2005-12-28 11:19:43

by Con Kolivas

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

On Wednesday 28 December 2005 22:01, Paolo Ornati wrote:
> after some hours of running transcode (with me away from the
> machine) I've found a totally UNUSABLE system. Transcode was the king
> of the machine and everything else get almost no CPU time. Switching to
> a Text-Console takes something like 10s (or something like that). When
> I was finally logged in as root I've reniced transcode and companyto +19
> and the system was usable again ;)
>
> To get things even more STRANGE: another time that this happened I've
> done the same thing except that I've reniced them to "0" (the same nice
> level they were running) ---> And the system became usable again (with
> the usual slow down but still usable).
>
> This is what I remember. Now I think we can agree that there is
> something wrong... no?

This latter thing sounds more like your transcode job pushed everything out to
swap... You need to instrument this case better.

Con

2005-12-28 11:34:24

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

On Wed, 28 Dec 2005 22:19:23 +1100
Con Kolivas <[email protected]> wrote:

> This latter thing sounds more like your transcode job pushed everything out to
> swap... You need to instrument this case better.
>

I don't know. The combination Swapped Out Programs + "normal" priority
strangeness can potentially result in a total disaster... but why
renicing transcode to "0" gets the system usable again?

Next time I'll grab "/proc/meminfo"... and what other info can help to
understand?

--
Paolo Ornati
Linux 2.6.15-rc5-plugsched on x86_64

2005-12-28 13:38:12

by Peter Williams

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

Paolo Ornati wrote:
> On Wed, 28 Dec 2005 10:59:13 +1100
> Peter Williams <[email protected]> wrote:
>
>
>>Any chance of you applying the PlugSched patches and seeing how the
>>other schedulers that it contains handle this situation?
>>
>>The patch at:
>>
>><http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download>
>>
>>should apply without problems to the 2.6.15-rc7 kernel.
>>
>>Very Brief Documentation:
>>
>>You can select a default scheduler at kernel build time. If you wish to
>>boot with a scheduler other than the default it can be selected at boot
>>time by adding:
>>
>>cpusched=<scheduler>
>>
>>to the boot command line where <scheduler> is one of: ingosched,
>>nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod. If you
>>don't change the default when you build the kernel the default scheduler
>>will be ingosched (which is the normal scheduler).
>
>
>
> First of all, this is the "pstree" structure of transcode an friends:
>
> |-kdesktop---perl---sh---transcode-+-2*[sh-+-tccat]
> | | |-tcdecode]
> | | |-tcdemux]
> | | `-tcextract]
> | `-transcode---5*[transcode]
>
>
> Results with various schedulers:

First, thanks for doing this.

>
> ------------------------------------------------------------------------
>
> 1) nicksched: perfect! This is the behaviour I want.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5562 paolo 40 0 115m 18m 2428 R 82.2 3.7 0:22.16 transcode
> 5576 paolo 26 0 50348 4516 1912 S 9.5 0.9 0:02.43 tcdecode
> 5566 paolo 23 0 115m 18m 2428 S 4.6 3.7 0:01.24 transcode
> 5573 paolo 21 0 115m 18m 2428 S 0.9 3.7 0:00.22 transcode
> 5577 paolo 27 0 20356 1140 920 S 0.9 0.2 0:00.21 tcdemux
> 5295 root 20 0 167m 17m 3624 S 0.6 3.5 0:11.02 X
> 5579 paolo 20 0 47308 2540 1996 S 0.5 0.5 0:00.14 tcdecode
> 5574 paolo 20 0 20356 1144 920 S 0.4 0.2 0:00.11 tcdemux
> ...
>
> transcode get recognized for what it is, and I/O bounded processes
> don't even notice that it is running :)

Interesting. This one's more or less a dead scheduler and hasn't had
any development work done on it for some time. I just keep porting the
original version to new kernels.

>
>
> 2) staircase: bad, as you can see:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5582 paolo 26 0 115m 18m 2428 R 82.7 3.7 0:47.63 transcode
> 5599 paolo 39 0 50352 4516 1912 R 9.6 0.9 0:05.21 tcdecode
> 5586 paolo 0 0 115m 18m 2428 S 4.5 3.7 0:02.61 transcode
> 5622 paolo 39 0 4948 1520 412 R 1.1 0.3 0:00.15 dd
> 5591 paolo 0 0 115m 18m 2428 S 0.6 3.7 0:00.36 transcode
> 5575 paolo 0 0 98476 37m 9392 S 0.4 7.5 0:01.44 perl
> 5597 paolo 27 0 20356 1144 920 S 0.4 0.2 0:00.21 tcdemux
> 5475 paolo 0 0 86556 22m 15m S 0.2 4.5 0:01.24 konsole
> 5388 root 0 0 167m 17m 3208 S 0.1 3.4 0:03.16 X
> 5587 paolo 0 0 115m 18m 2428 S 0.1 3.7 0:00.03 transcode
> 5595 paolo 20 0 47312 2540 1996 S 0.1 0.5 0:00.14 tcdecode
> 5596 paolo 26 0 22672 1268 1020 S 0.1 0.2 0:00.03 tccat
> 5598 paolo 28 0 22364 1436 932 S 0.1 0.3 0:00.04 tcextract
>
>
> And "DD" is affected badly:
>
> paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
> of=/dev/null bs=1M count=128; umount space/ 128+0 records in
> 128+0 records out
>
> real 0m6.341s
> user 0m0.002s
> sys 0m0.229s
>
> While transcoding:
>
> paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
> of=/dev/null bs=1M count=256; umount space/ 256+0 records in
> 256+0 records out
>
> real 0m15.793s
> user 0m0.001s
> sys 0m0.374s
>
>
> 3) spa_no_frills: bad, but this is OK since it is Round Robin :)
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5356 paolo 20 0 115m 18m 2428 R 81.1 3.7 0:27.61 transcode
> 5371 paolo 20 0 50348 4516 1912 R 8.9 0.9 0:02.97 tcdecode
> 5360 paolo 20 0 115m 18m 2428 S 4.1 3.7 0:01.54 transcode
> 5378 paolo 20 0 4948 1520 412 D 1.4 0.3 0:00.29 dd
> 5364 paolo 20 0 20352 1144 920 S 0.9 0.2 0:00.20 tcdemux
> 5373 paolo 20 0 115m 18m 2428 S 0.7 3.7 0:00.32 transcode
> 5369 paolo 20 0 20356 1144 920 S 0.5 0.2 0:00.14 tcdemux
> 5205 root 20 0 165m 15m 2584 R 0.2 3.2 0:01.86 X
>

Yes, no surprises there.

>
> 4) spa_ws: bad
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5334 paolo 32 0 115m 18m 2428 R 82.7 3.7 0:18.77 transcode
> 5349 paolo 32 0 50348 4516 1912 R 8.9 0.9 0:02.00 tcdecode
> 5338 paolo 21 0 115m 18m 2428 S 4.6 3.7 0:01.08 transcode
> 5356 paolo 32 0 4948 1520 412 D 1.1 0.3 0:00.12 dd
> 5351 paolo 32 0 115m 18m 2428 S 1.0 3.7 0:00.20 transcode
> 5199 root 21 0 165m 15m 2584 S 0.4 3.2 0:01.68 X
> 5347 paolo 32 0 20356 1140 920 S 0.4 0.2 0:00.08 tcdemux
> 5296 paolo 22 0 98472 37m 9392 S 0.2 7.5 0:01.47 perl
> 5299 paolo 21 0 86556 22m 15m S 0.2 4.4 0:00.75 konsole
> 5344 paolo 32 0 47308 2540 1996 S 0.2 0.5 0:00.07 tcdecode
> 5339 paolo 21 0 115m 18m 2428 S 0.1 3.7 0:00.01 transcode
>
> paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
> of=/dev/null bs=1M count=256; umount space/ 256+0 records in
> 256+0 records out
>
> real 0m8.112s
> user 0m0.001s
> sys 0m0.444s
>
> paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
> of=/dev/null bs=1M count=256; umount space/ 256+0 records in
> 256+0 records out
>
> real 0m29.222s
> user 0m0.000s
> sys 0m0.400s

This one is aimed purely at good interactive responsiveness (i.e.
keyboard, mouse, X server and media players such as rythmbox/xmms) so no
real surprises here either.

>
>
> 5) spa_svr: surprise, surprise! Not all that bad. At least DD
> gets better priority than transcode... and DD real time is only a bit
> affected (8s --> ~9s).
>

This will be the "throughput bonus" in action. It's overall aim is to
reduce the time tasks spend on the runqueue waiting for CPU access
a.k.a. delay. It does this by using the system load and the average
amount of CPU time that the task uses each scheduling cycle to estimate
the expected delay for the task and gives it a bonus if the actual
average delays being experienced are bigger than this value.

It's intended for server systems not interactive systems as reducing
overall delay isn't necessarily good for interactive systems where the
aim is to quell the user's impatience by giving good latency to the
interactive tasks. These aims aren't always compatible.

>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5334 paolo 33 0 115m 18m 2428 R 78.1 3.7 0:22.70 transcode
> 5349 paolo 28 0 50352 4516 1912 S 9.0 0.9 0:02.41 tcdecode
> 5338 paolo 25 0 115m 18m 2428 S 4.7 3.7 0:01.29 transcode
> 5363 paolo 27 0 4952 1520 412 R 4.7 0.3 0:00.25 dd
> 5342 paolo 33 0 20352 1140 920 S 1.6 0.2 0:00.21 tcdemux
> 5351 paolo 25 0 115m 18m 2428 S 0.8 3.7 0:00.23 transcode
> 5144 root 22 0 166m 16m 3120 S 0.4 3.3 0:01.85 X
> 5344 paolo 23 0 47308 2540 1996 S 0.4 0.5 0:00.13 tcdecode
> 5347 paolo 27 0 20356 1144 920 S 0.4 0.2 0:00.10 tcdemux
> 5231 paolo 22 0 86660 22m 15m S 0.2 4.5 0:00.95 konsole
> 5271 paolo 25 0 98476 37m 9396 S 0.2 7.5 0:01.54 perl
> 5341 paolo 23 0 22672 1268 1020 S 0.2 0.2 0:00.02 tccat
>
>
> 6) zaphod: more or less like spa_svr

Zaphod includes the throughput bonus in its armoury which why it is
similar in performance to spa_svr.

>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5308 paolo 34 0 115m 18m 2428 R 52.1 3.7 0:49.77 transcode
> 5323 paolo 32 0 50352 4516 1912 S 6.0 0.9 0:05.61 tcdecode
> 5356 paolo 28 0 4952 1520 412 D 3.5 0.3 0:00.28 dd
> 5312 paolo 28 0 115m 18m 2428 S 2.6 3.7 0:02.71 transcode
> 5325 paolo 31 0 115m 18m 2428 S 0.7 3.7 0:00.55 transcode
> 5316 paolo 37 0 20352 1140 920 S 0.4 0.2 0:00.33 tcdemux
> 5202 root 23 0 165m 15m 2584 S 0.2 3.1 0:01.57 X
> 5318 paolo 31 0 47312 2540 1996 S 0.2 0.5 0:00.28 tcdecode
> 5321 paolo 33 0 20356 1144 920 S 0.2 0.2 0:00.26 tcdemux
> 4760 messageb 25 0 13248 1068 848 S 0.1 0.2 0:00.07
> dbus-daemon-1 5264 paolo 24 0 93920 17m 10m S 0.1 3.5
> 0:00.38 kded 5282 paolo 23 0 92712 19m 12m S 0.1 3.9
> 0:00.36 kdesktop
>
>
> 7) ingosched: bad, as already said in the original post
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5209 paolo 16 0 115m 18m 2428 R 72.0 3.7 0:22.13 transcode
> 5224 paolo 22 0 50348 4516 1912 R 8.4 0.9 0:02.44 tcdecode
> 5213 paolo 15 0 115m 18m 2428 S 4.2 3.7 0:01.24 transcode
> 5243 paolo 18 0 4948 1520 412 R 1.8 0.3 0:00.14 dd
> 5217 paolo 19 0 20356 1144 920 R 0.8 0.2 0:00.19 tcdemux
> 5108 root 15 0 165m 15m 2584 S 0.6 3.1 0:01.44 X
> 5226 paolo 15 0 115m 18m 2428 S 0.6 3.7 0:00.20 transcode
> 5216 paolo 18 0 22676 1268 1020 S 0.4 0.2 0:00.03 tccat
> 5219 paolo 18 0 47312 2540 1996 R 0.4 0.5 0:00.12 tcdecode
> 5222 paolo 18 0 20356 1144 920 S 0.4 0.2 0:00.10 tcdemux
> 5195 paolo 16 0 98488 37m 9392 S 0.2 7.5 0:01.41 perl
> 5198 paolo 16 0 86552 22m 15m R 0.2 4.4 0:00.66 konsole
>
> paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/
> 256+0 records in
> 256+0 records out
>
> real 0m23.393s (instead of 8s)
> user 0m0.001s
> sys 0m0.418s
>
> ------------------------------------------------------------------------
>
>
> So the winner for manifest superiority is "nicksched", it looks to me
> even better than 2.6.10-bk12 (ingosched) with
> "remove_interactive_credit" reverted.

Thanks for this data. It will enable me to make some mods to the
spa_xxx and zaphod schedulers.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2005-12-28 17:22:04

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

On Wed, 28 Dec 2005 12:35:47 +0100
Paolo Ornati <[email protected]> wrote:

> > This latter thing sounds more like your transcode job pushed everything out to
> > swap... You need to instrument this case better.
> >
>
> I don't know. The combination Swapped Out Programs + "normal" priority
> strangeness can potentially result in a total disaster... but why
> renicing transcode to "0" gets the system usable again?
>
> Next time I'll grab "/proc/meminfo"... and what other info can help to
> understand?

I'm doing some tests with "ingosched" and a bit longer trancoding
session and I've found a strange thing.

DD running time can change a LOT between one run and another (always
while transcode is running):

----------------------------------------------------------------------
paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m15.754s
user 0m0.000s
sys 0m0.500s

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m52.966s
user 0m0.000s
sys 0m0.504s

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m48.928s
user 0m0.004s
sys 0m0.524s
---------------------------------------------------------------------

Looking at top while running these I've seen that this is related to
how transcode's (the one that eats a lot of CPU) priority changes.

When only transcoding his priority is 16, but when also DD test is
running then that value fluctuate between 16 and 18.

DD priority is always 18 instead.

Usually transcode's prio go to 17/18 and DD runs in 15/20s, but
sometimes it doesn't fluctuate staying stuck to 16 and DD runs in ~50s.


PS:
I'm not yet able to reproduce the "totally unusable system" I was
talking about.

--
Paolo Ornati
Linux 2.6.15-rc7-plugsched on x86_64

2005-12-28 17:38:05

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

On Wed, 28 Dec 2005 18:23:23 +0100
Paolo Ornati <[email protected]> wrote:

> Usually transcode's prio go to 17/18 and DD runs in 15/20s, but
> sometimes it doesn't fluctuate staying stuck to 16 and DD runs in ~50s.

And now I've noticed that when that prority stops fluctuating, it stops
forever. Running the DD test again and again doesn't change anything!

For some reasons (running time?) the trancode priority is stuck to 16
and DD always performs very badly: ~50s (normally it should be 8s).

When I've noticed this the real running time of the trancode test is
about 35/40 min.

--
Paolo Ornati
Linux 2.6.15-rc7-plugsched on x86_64

2005-12-28 19:43:47

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

On Thu, 29 Dec 2005 00:38:08 +1100
Peter Williams <[email protected]> wrote:

> Thanks for this data. It will enable me to make some mods to the
> spa_xxx and zaphod schedulers.

Ok, but keep in mind that these numbers are just "snapshots". With
almost all schedulers the priority of the CPU-eater transcode and other
processes fluctuate a bit (an exception here is nicksched, that gives
priority 40 to transcode and never change it).

It seems also that small variation of the priority can affect seriously
my DD test case running time (expecially, I think, with schedulers that
give to "transcode" a better-or-equal priority than "dd" -->
ingosched/staircaise).

This is another mail in witch you weren't in CC that explains it better:

http://www.ussg.iu.edu/hypermail/linux/kernel/0512.3/0647.html

--
Paolo Ornati
Linux 2.6.15-rc7-plugsched on x86_64

2005-12-29 03:13:41

by Nick Piggin

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

Peter Williams wrote:
> Paolo Ornati wrote:

>> 1) nicksched: perfect! This is the behaviour I want.
>>
...
>>
>> transcode get recognized for what it is, and I/O bounded processes
>> don't even notice that it is running :)
>
>
> Interesting. This one's more or less a dead scheduler and hasn't had
> any development work done on it for some time. I just keep porting the
> original version to new kernels.
>

It isn't a dead scheduler any more than any of the other out of tree
schedulers are (which isn't saying much, unfortunately).

I've probably got a small number of cleanups and microoptimisations
relative to what you have (I can't remember exactly what you sucked up)
... but other than that there hasn't been much development work done for
some time because there is not much wrong with it.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-29 03:35:38

by Peter Williams

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

Nick Piggin wrote:
> Peter Williams wrote:
>
>> Paolo Ornati wrote:
>
>
>>> 1) nicksched: perfect! This is the behaviour I want.
>>>
> ...
>
>>>
>>> transcode get recognized for what it is, and I/O bounded processes
>>> don't even notice that it is running :)
>>
>>
>>
>> Interesting. This one's more or less a dead scheduler and hasn't had
>> any development work done on it for some time. I just keep porting
>> the original version to new kernels.
>>
>
> It isn't a dead scheduler any more than any of the other out of tree
> schedulers are (which isn't saying much, unfortunately).

Ingosched, staircase and my SPA schedulers are all evolving slowly.
Are there any out there that I don't have in PlugSched that you think
should be?

>
> I've probably got a small number of cleanups and microoptimisations
> relative to what you have (I can't remember exactly what you sucked up)
> ... but other than that there hasn't been much development work done for
> some time because there is not much wrong with it.
>

I was starting to think that you'd lost interest in this which is why I
said it was more or less dead. Sorry.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2005-12-29 08:11:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

Peter Williams wrote:
> Nick Piggin wrote:

>> It isn't a dead scheduler any more than any of the other out of tree
>> schedulers are (which isn't saying much, unfortunately).
>
>
> Ingosched, staircase and my SPA schedulers are all evolving slowly.
> Are there any out there that I don't have in PlugSched that you think
> should be?
>

Not that I know of...

>>
>> I've probably got a small number of cleanups and microoptimisations
>> relative to what you have (I can't remember exactly what you sucked up)
>> ... but other than that there hasn't been much development work done for
>> some time because there is not much wrong with it.
>>
>
> I was starting to think that you'd lost interest in this which is why I
> said it was more or less dead. Sorry.
>

No worries. I haven't lost interest so much as people seem to be fairly
happy with the current scheduler and least aren't busting my door down
for updates to nicksched ;)

I'll do a resynch for 2.6.15 though.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-30 13:53:46

by Paolo Ornati

[permalink] [raw]
Subject: [SCHED] wrong priority calc - SIMPLE test case

WAS: [SCHED] Totally WRONG prority calculation with specific test-case
(since 2.6.10-bk12)
http://lkml.org/lkml/2005/12/27/114/index.html

On Wed, 28 Dec 2005 10:26:58 +1100
Con Kolivas <[email protected]> wrote:

> The issue is that the scheduler interactivity estimator is a state machine and
> can be fooled to some degree, and a cpu intensive task that just happens to
> sleep a little bit gets significantly better priority than one that is fully
> cpu bound all the time. Reverting that change is not a solution because it
> can still be fooled by the same process sleeping lots for a few seconds or so
> at startup and then changing to the cpu mostly-sleeping slightly behaviour.
> This "fluctuating" behaviour is in my opinion worse which is why I removed
> it.

Trying to find a "as simple as possible" test case for this problem
(that I consider a BUG in priority calculation) I've come up with this
very simple program:

------ sched_fooler.c -------------------------------
#include <stdlib.h>
#include <unistd.h>

static void burn_cpu(unsigned int x)
{
static char buf[1024];
int i;

for (i=0; i < x; ++i)
buf[i%sizeof(buf)] = (x-i)*3;
}

int main(int argc, char **argv)
{
unsigned long burn;
if (argc != 2)
return 1;
burn = (unsigned long)atoi(argv[1]);
while(1) {
burn_cpu(burn*1000);
usleep(1);
}
return 0;
}
----------------------------------------------

paolo@tux ~/tmp/sched_fooler $ gcc sched_fooler.c
paolo@tux ~/tmp/sched_fooler $ ./a.out 5000

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5633 paolo 15 0 2392 320 252 S 77.7 0.1 0:18.84 a.out
5225 root 15 0 169m 19m 2956 S 2.0 3.9 0:13.17 X
5307 paolo 15 0 100m 22m 13m S 2.0 4.4 0:04.32 kicker


paolo@tux ~/tmp/sched_fooler $ ./a.out 10000

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5637 paolo 16 0 2396 320 252 R 87.4 0.1 0:13.38 a.out
5312 paolo 16 0 86636 22m 15m R 0.1 4.5 0:02.02 konsole
1 root 16 0 2552 560 468 S 0.0 0.1 0:00.71 init


If I only run 2 of these together the system becomes everything but
interactive ;)

./a.out 10000 & ./a.out 4000

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5714 paolo 15 0 2392 320 252 S 59.2 0.1 0:12.28 a.out
5713 paolo 16 0 2396 320 252 S 37.1 0.1 0:07.63 a.out


DD TEST: the usual DD test (to read 128 MB from a non-cached file =
disk bounded) says everything, numbers with 2.6.15-rc7:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real 0m4.052s
user 0m0.004s
sys 0m0.180s

START 2 OF THEM "./a.out 10000 & ./a.out 4000"

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real 2m4.578s
user 0m0.000s
sys 0m0.244s


This does't surprise me anymore, since DD gets priority 18 and these
two CPU eaters get 15/16.

As usual "nicksched" is NOT affected at all, IOW it gives to these CPU
eaters the priority that they deserve.

--
Paolo Ornati
Linux 2.6.15-rc7-g3603bc8d on x86_64

2005-12-31 02:06:12

by Peter Williams

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

Paolo Ornati wrote:
> WAS: [SCHED] Totally WRONG prority calculation with specific test-case
> (since 2.6.10-bk12)
> http://lkml.org/lkml/2005/12/27/114/index.html
>
> On Wed, 28 Dec 2005 10:26:58 +1100
> Con Kolivas <[email protected]> wrote:
>
>
>>The issue is that the scheduler interactivity estimator is a state machine and
>>can be fooled to some degree, and a cpu intensive task that just happens to
>>sleep a little bit gets significantly better priority than one that is fully
>>cpu bound all the time. Reverting that change is not a solution because it
>>can still be fooled by the same process sleeping lots for a few seconds or so
>>at startup and then changing to the cpu mostly-sleeping slightly behaviour.
>>This "fluctuating" behaviour is in my opinion worse which is why I removed
>>it.
>
>
> Trying to find a "as simple as possible" test case for this problem
> (that I consider a BUG in priority calculation) I've come up with this
> very simple program:
>
> ------ sched_fooler.c -------------------------------
> #include <stdlib.h>
> #include <unistd.h>
>
> static void burn_cpu(unsigned int x)
> {
> static char buf[1024];
> int i;
>
> for (i=0; i < x; ++i)
> buf[i%sizeof(buf)] = (x-i)*3;
> }
>
> int main(int argc, char **argv)
> {
> unsigned long burn;
> if (argc != 2)
> return 1;
> burn = (unsigned long)atoi(argv[1]);
> while(1) {
> burn_cpu(burn*1000);
> usleep(1);
> }
> return 0;
> }
> ----------------------------------------------
>
> paolo@tux ~/tmp/sched_fooler $ gcc sched_fooler.c
> paolo@tux ~/tmp/sched_fooler $ ./a.out 5000
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5633 paolo 15 0 2392 320 252 S 77.7 0.1 0:18.84 a.out
> 5225 root 15 0 169m 19m 2956 S 2.0 3.9 0:13.17 X
> 5307 paolo 15 0 100m 22m 13m S 2.0 4.4 0:04.32 kicker
>
>
> paolo@tux ~/tmp/sched_fooler $ ./a.out 10000
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5637 paolo 16 0 2396 320 252 R 87.4 0.1 0:13.38 a.out
> 5312 paolo 16 0 86636 22m 15m R 0.1 4.5 0:02.02 konsole
> 1 root 16 0 2552 560 468 S 0.0 0.1 0:00.71 init
>
>
> If I only run 2 of these together the system becomes everything but
> interactive ;)
>
> ./a.out 10000 & ./a.out 4000
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5714 paolo 15 0 2392 320 252 S 59.2 0.1 0:12.28 a.out
> 5713 paolo 16 0 2396 320 252 S 37.1 0.1 0:07.63 a.out
>
>
> DD TEST: the usual DD test (to read 128 MB from a non-cached file =
> disk bounded) says everything, numbers with 2.6.15-rc7:
>
> paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
> 128+0 records in
> 128+0 records out
>
> real 0m4.052s
> user 0m0.004s
> sys 0m0.180s
>
> START 2 OF THEM "./a.out 10000 & ./a.out 4000"
>
> paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
> 128+0 records in
> 128+0 records out
>
> real 2m4.578s
> user 0m0.000s
> sys 0m0.244s
>
>
> This does't surprise me anymore, since DD gets priority 18 and these
> two CPU eaters get 15/16.
>
> As usual "nicksched" is NOT affected at all, IOW it gives to these CPU
> eaters the priority that they deserve.
>

Attached is a testing version of a patch for modifying scheduler bonus
calculations that I'm working on. Although these changes aren't
targetted at the problem you are experiencing I believe that they may
help. My testing shows that sched_fooler instances don't get any
bonuses but I would appreciate it if you could try it out.

It is a patch against the 2.6.15-rc7 kernel and includes some other
scheduling patches from the -mm kernels.

Thanks
Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce


Attachments:
lial-test.patch (34.49 kB)

2005-12-31 08:13:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 02:52 PM 12/30/2005 +0100, Paolo Ornati wrote:
>WAS: [SCHED] Totally WRONG prority calculation with specific test-case
>(since 2.6.10-bk12)
>http://lkml.org/lkml/2005/12/27/114/index.html
>
>On Wed, 28 Dec 2005 10:26:58 +1100
>Con Kolivas <[email protected]> wrote:
>
> > The issue is that the scheduler interactivity estimator is a state
> machine and
> > can be fooled to some degree, and a cpu intensive task that just
> happens to
> > sleep a little bit gets significantly better priority than one that is
> fully
> > cpu bound all the time. Reverting that change is not a solution because it
> > can still be fooled by the same process sleeping lots for a few seconds
> or so
> > at startup and then changing to the cpu mostly-sleeping slightly
> behaviour.
> > This "fluctuating" behaviour is in my opinion worse which is why I removed
> > it.
>
>Trying to find a "as simple as possible" test case for this problem
>(that I consider a BUG in priority calculation) I've come up with this
>very simple program:
>
>------ sched_fooler.c -------------------------------

Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your
little proggy. Taking a quick peek at the rt scheduler changes, nothing
poked me in the eye, but by golly, I can't get this kernel to act up,
whereas 2.6.14-virgin does.

-Mike (off to stare harder rt patch)

2005-12-31 10:34:49

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Sat, 31 Dec 2005 13:06:04 +1100
Peter Williams <[email protected]> wrote:

> Attached is a testing version of a patch for modifying scheduler bonus
> calculations that I'm working on. Although these changes aren't
> targetted at the problem you are experiencing I believe that they may
> help. My testing shows that sched_fooler instances don't get any
> bonuses but I would appreciate it if you could try it out.
>
> It is a patch against the 2.6.15-rc7 kernel and includes some other
> scheduling patches from the -mm kernels.

Yes, this fixes both my test-case (transcode & little program), they
get priority 25 instead of ~16.

But the priority of DD is now ~23 and so it still suffers a bit:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real 0m8.549s (instead of 4s)
user 0m0.000s
sys 0m0.209s

--
Paolo Ornati
Linux 2.6.15-rc7-lial on x86_64

2005-12-31 10:52:17

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Sat, 31 Dec 2005 11:34:46 +0100
Paolo Ornati <[email protected]> wrote:

> > It is a patch against the 2.6.15-rc7 kernel and includes some other
> > scheduling patches from the -mm kernels.
>
> Yes, this fixes both my test-case (transcode & little program), they
> get priority 25 instead of ~16.
>
> But the priority of DD is now ~23 and so it still suffers a bit:

I forgot to mention that even the others "interactive" processes
don't get a good priority too.

Xorg for example, while only moving the cursor around, gets priority
23/24. And when cpu-eaters are running (at priority 25) it isn't happy
at all, the cursor begins to move in jerks and so on...

--
Paolo Ornati
Linux 2.6.15-rc7-lial on x86_64

2005-12-31 11:00:49

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Sat, 31 Dec 2005 09:13:24 +0100
Mike Galbraith <[email protected]> wrote:

> Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your
> little proggy. Taking a quick peek at the rt scheduler changes, nothing
> poked me in the eye, but by golly, I can't get this kernel to act up,
> whereas 2.6.14-virgin does.

Mmm... I get an infinite list of init segfaults trying to boot it. I've
tried disabling CONFIG_CC_OPTIMIZE_FOR_SIZE but it didn't help.

I'll try later with a simplier ".config".

--
Paolo Ornati
Linux 2.6.15-rc7-plugsched on x86_64

2005-12-31 11:12:20

by Con Kolivas

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Saturday 31 December 2005 21:52, Paolo Ornati wrote:
> On Sat, 31 Dec 2005 11:34:46 +0100
>
> Paolo Ornati <[email protected]> wrote:
> > > It is a patch against the 2.6.15-rc7 kernel and includes some other
> > > scheduling patches from the -mm kernels.
> >
> > Yes, this fixes both my test-case (transcode & little program), they
> > get priority 25 instead of ~16.
> >
> > But the priority of DD is now ~23 and so it still suffers a bit:
>
> I forgot to mention that even the others "interactive" processes
> don't get a good priority too.
>
> Xorg for example, while only moving the cursor around, gets priority
> 23/24. And when cpu-eaters are running (at priority 25) it isn't happy
> at all, the cursor begins to move in jerks and so on...

This is why Ingo, Nick and myself think that a tweak to the heavily field
tested current cpu scheduler is best for 2.6 rather than any gutting and
replacement of the interactivity estimator (which even though this scheme is
simple and easy to understand, it clearly is an example of). Given that we
have a "2.6 forever" policy, it also means any significant cpu scheduler
rewrite, even just of the interactivity estimator, is nigh on impossible to
implement.

Cheers,
Con

2005-12-31 13:44:18

by Peter Williams

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

Paolo Ornati wrote:
> On Sat, 31 Dec 2005 11:34:46 +0100
> Paolo Ornati <[email protected]> wrote:
>
>
>>>It is a patch against the 2.6.15-rc7 kernel and includes some other
>>>scheduling patches from the -mm kernels.
>>
>>Yes, this fixes both my test-case (transcode & little program), they
>>get priority 25 instead of ~16.
>>
>>But the priority of DD is now ~23 and so it still suffers a bit:
>
>
> I forgot to mention that even the others "interactive" processes
> don't get a good priority too.
>
> Xorg for example, while only moving the cursor around, gets priority
> 23/24. And when cpu-eaters are running (at priority 25) it isn't happy
> at all, the cursor begins to move in jerks and so on...
>

OK. This probably means that the parameters that control the mechanism
need tweaking.

There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which
contains the latency (in nanoseconds) that the scheduler considers
unacceptable for interactive programs. Try changing that value and see
if things improve? Making it smaller should help but if you make it too
small all the interactive tasks will end up with the same priority and
this could cause them to get in each other's way.

Thanks,
Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2005-12-31 15:12:08

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Sat, 31 Dec 2005 09:13:24 +0100
Mike Galbraith <[email protected]> wrote:

> Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your
> little proggy. Taking a quick peek at the rt scheduler changes, nothing
> poked me in the eye, but by golly, I can't get this kernel to act up,
> whereas 2.6.14-virgin does.

Ok, I've sucessfully booted 2.6.15-rc7-rt1 (I think that I was
having troubles with Thread Softirqs and/or Thread Hardirqs).

First thing: I've preemption disabled, but it shouldn't matter too much
since we are talking about priority calculation...

1) My program isn't defeated at all. If I start it with the same args
of the previous examples it "seems" defeated, but it isn't.

Lowering the "cpu burn argument" I can reproduce the problem again:

"./a.out 200 & ./a.out 333"

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5607 paolo 15 0 2396 320 252 R 56.1 0.1 0:06.79 a.out
5606 paolo 15 0 2396 324 252 R 38.7 0.1 0:04.55 a.out
1 root 16 0 2556 552 468 S 0.0 0.1 0:00.28 init


2) Priority fluctuation - very interesting: playing with the only arg
my program has I've found this:

./a.out 200
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5628 paolo 15 0 2392 320 252 R 48.5 0.1 0:18.34 a.out

./a.out 300
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5633 paolo 15 0 2392 324 252 S 50.1 0.1 0:09.42 a.out

./a.out 400
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5634 paolo 15 0 2392 320 252 S 66.7 0.1 0:06.31 a.out

./a.out 500
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5638 paolo 25 0 2396 320 252 R 67.7 0.1 0:14.78 a.out

./a.out 700
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5640 paolo 15 0 2392 320 252 S 80.1 0.1 0:25.88 a.out

./a.out 800
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5644 paolo 17 0 2396 320 252 R 79.6 0.1 0:26.54 a.out


In the "./a.out 500" case, the priority starts at something like 16 and
then slowly go up to 25 _BUT_ if I start my DD test my cpu-eater
priority goes quickly to 16!

The real world test case (transcode) is a bit harder to describe: its
priority usually goes up to 25, sometimes I've seen it fluctuating a
bit (like go to 19 and then back to 25).

When I start my DD test I've seen basically 2 different behaviours:

A) good
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5788 paolo 25 0 114m 18m 2440 R 82.2 3.7 0:10.16 transcode
5804 paolo 15 0 49860 4500 1896 S 8.5 0.9 0:00.99 tcdecode
5808 paolo 18 0 4952 1520 412 D 5.0 0.3 0:00.36 dd

B) bad
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5743 paolo 18 0 114m 18m 2440 R 75.0 3.7 0:26.79 transcode
5759 paolo 15 0 49864 4500 1896 S 7.8 0.9 0:02.71 tcdecode
5750 paolo 16 0 19840 1136 916 S 1.5 0.2 0:00.23 tcdemux
5201 root 15 0 167m 17m 3336 S 0.8 3.5 0:19.38 X
5764 paolo 18 0 4948 1520 412 R 0.7 0.3 0:00.04 dd

Sometimes happens A and sometimes happens B...

PS: probably all these numbers aren't 100% reproducible... this is what
happens on my PC.

--
Paolo Ornati
Linux 2.6.15-rc7-rt1 on x86_64

2005-12-31 16:31:35

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Sun, 01 Jan 2006 00:44:10 +1100
Peter Williams <[email protected]> wrote:

> OK. This probably means that the parameters that control the mechanism
> need tweaking.
>
> There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which
> contains the latency (in nanoseconds) that the scheduler considers
> unacceptable for interactive programs. Try changing that value and see
> if things improve? Making it smaller should help but if you make it too
> small all the interactive tasks will end up with the same priority and
> this could cause them to get in each other's way.

I've tried different values and sometimes I've got a good feeling BUT
the behaviour is too strange to say something.

Sometimes I get what I want (dd priority ~17 and CPU eaters prio
25), sometimes I get a total disaster (dd priority 17 and CPU eaters
prio 15/16) and sometimes I get something like DD prio 22 and CPU
eaters 23/24.

All this is not well related to "unacceptable_ia_latency" values.

What I think is that the priority calculation in ingosched and other
schedulers is in general too weak, while in other schedulers is rock
solid (read: nicksched).

Maybe is just that the smarter a scheduler want to be, the more fragile
it will be.

--
Paolo Ornati
Linux 2.6.15-rc7-lial on x86_64

2005-12-31 16:37:30

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 04:11 PM 12/31/2005 +0100, Paolo Ornati wrote:
>On Sat, 31 Dec 2005 09:13:24 +0100
>Mike Galbraith <[email protected]> wrote:
>
> > Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your
> > little proggy. Taking a quick peek at the rt scheduler changes, nothing
> > poked me in the eye, but by golly, I can't get this kernel to act up,
> > whereas 2.6.14-virgin does.
>
>Ok, I've sucessfully booted 2.6.15-rc7-rt1 (I think that I was
>having troubles with Thread Softirqs and/or Thread Hardirqs).
>
>First thing: I've preemption disabled, but it shouldn't matter too much
>since we are talking about priority calculation...

Mine is fully preemptible.

>1) My program isn't defeated at all. If I start it with the same args
>of the previous examples it "seems" defeated, but it isn't.
>
>Lowering the "cpu burn argument" I can reproduce the problem again:
>
>"./a.out 200 & ./a.out 333"
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5607 paolo 15 0 2396 320 252 R 56.1 0.1 0:06.79 a.out
> 5606 paolo 15 0 2396 324 252 R 38.7 0.1 0:04.55 a.out
> 1 root 16 0 2556 552 468 S 0.0 0.1 0:00.28 init

Strange. Using the exact same arguments, I do see some odd bouncing up to
high priorities, but they spend the vast majority of their time down at 25.

-Mike

2005-12-31 17:24:38

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Sat, 31 Dec 2005 17:37:11 +0100
Mike Galbraith <[email protected]> wrote:

> >"./a.out 200 & ./a.out 333"
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 5607 paolo 15 0 2396 320 252 R 56.1 0.1 0:06.79 a.out
> > 5606 paolo 15 0 2396 324 252 R 38.7 0.1 0:04.55 a.out
> > 1 root 16 0 2556 552 468 S 0.0 0.1 0:00.28 init
>
> Strange. Using the exact same arguments, I do see some odd bouncing up to
> high priorities, but they spend the vast majority of their time down at 25.

You shouldn't use "the same exact numbers", you should try different
args and see if you can reproduce the problem. Or maybe preemption
make some difference... I'll try with PREEMPT enabled and see.

--
Paolo Ornati
Linux 2.6.15-rc7-plugsched on x86_64

2005-12-31 17:42:26

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Sat, 31 Dec 2005 18:24:40 +0100
Paolo Ornati <[email protected]> wrote:

> You shouldn't use "the same exact numbers", you should try different
> args and see if you can reproduce the problem. Or maybe preemption
> make some difference... I'll try with PREEMPT enabled and see.

Ok, just tried with Complete Preemption: I can easly reproduce the
problem.

For example:

"./a.out 1000 & ./a.out 1239"

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5433 paolo 16 0 2396 324 252 R 50.3 0.1 0:34.67 a.out
5434 paolo 16 0 2392 320 252 R 47.4 0.1 0:30.76 a.out
265 root -48 -5 0 0 0 S 0.6 0.0 0:00.68 IRQ 12
5261 root 15 0 166m 16m 2844 S 0.6 3.3 0:04.81 X
5349 paolo 15 0 86640 22m 15m S 0.6 4.5 0:01.36 konsole
5344 paolo 15 0 98.8m 20m 12m S 0.3 4.1 0:01.64 kicker
5444 paolo 18 0 4948 1520 412 R 0.3 0.3 0:00.05 dd

--
Paolo Ornati
Linux 2.6.15-rc7-rt1 on x86_64

2005-12-31 22:04:55

by Peter Williams

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

Paolo Ornati wrote:
> On Sun, 01 Jan 2006 00:44:10 +1100
> Peter Williams <[email protected]> wrote:
>
>
>>OK. This probably means that the parameters that control the mechanism
>>need tweaking.
>>
>>There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which
>>contains the latency (in nanoseconds) that the scheduler considers
>>unacceptable for interactive programs. Try changing that value and see
>>if things improve? Making it smaller should help but if you make it too
>>small all the interactive tasks will end up with the same priority and
>>this could cause them to get in each other's way.
>
>
> I've tried different values and sometimes I've got a good feeling BUT
> the behaviour is too strange to say something.
>
> Sometimes I get what I want (dd priority ~17 and CPU eaters prio
> 25), sometimes I get a total disaster (dd priority 17 and CPU eaters
> prio 15/16) and sometimes I get something like DD prio 22 and CPU
> eaters 23/24.
>
> All this is not well related to "unacceptable_ia_latency" values.

OK. Thanks for trying it.

The feedback will be helpful in trying to improve the mechanisms.

>
> What I think is that the priority calculation in ingosched and other
> schedulers is in general too weak, while in other schedulers is rock
> solid (read: nicksched).
>
> Maybe is just that the smarter a scheduler want to be, the more fragile
> it will be.
>

Probably but this one is fairly simple.

I think the remaining problems with interactive responsiveness is that
bonuses increase too slowly when a latency problem is detected. I.e. a
task just gets one extra bonus point when an unacceptable latency is
detected regardless of how big the latency is. This means that it may
take several cycles for the bonus to be big enough to solve the problem.
I'm going to try making the bonus increment proportional to the size
of the latency w.r.t. the limit.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-01-01 11:39:32

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Sat, 31 Dec 2005 17:37:11 +0100
Mike Galbraith <[email protected]> wrote:

> Strange. Using the exact same arguments, I do see some odd bouncing up to
> high priorities, but they spend the vast majority of their time down at 25.

Mmmm... to make it more easly reproducible I've enlarged the sleep time
(1 microsecond is likely to be rounded too much and give different
results on different hardware/kernel/config...).

Compile this _without_ optimizations and try again:

------------------------------------------------
#include <stdlib.h>
#include <unistd.h>

char buf[1024];

static void burn_cpu(unsigned int x)
{
int i;

for (i=0; i < x; ++i)
buf[i%sizeof(buf)] = (x-i)*3;
}

int main(int argc, char **argv)
{
unsigned long burn;
if (argc != 2)
return 1;
burn = (unsigned long)atoi(argv[1]);
if (!burn)
return;
while(1) {
burn_cpu(burn*1000);
usleep(10000);
}
return 0;
}
-----------------------------------------


With "./a.out 3000" (and 2.6.15-rc7-rt1) I get this:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5485 paolo 15 0 2396 320 252 R 62.7 0.1 0:09.77 a.out


Try different values: 1000, 2000, 3000 ... are you able to reproduce it
now?


If yes, try to start 2 of them with something like this:

"./a.out 3000 & ./a.out 3161"

so they are NOT syncronized and they use almost all the CPU time:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5582 paolo 16 0 2396 320 252 S 45.7 0.1 0:05.52 a.out
5583 paolo 15 0 2392 320 252 S 45.7 0.1 0:05.49 a.out

This is the bad situation I hate: some cpu-eaters that eat all the CPU
time BUT have a really good priority only because they sleeps a bit.

--
Paolo Ornati
Linux 2.6.15-rc7-rt1 on x86_64

2006-01-02 09:15:52

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 12:39 PM 1/1/2006 +0100, Paolo Ornati wrote:
>On Sat, 31 Dec 2005 17:37:11 +0100
>Mike Galbraith <[email protected]> wrote:
>
> > Strange. Using the exact same arguments, I do see some odd bouncing up to
> > high priorities, but they spend the vast majority of their time down at 25.
>
>Mmmm... to make it more easly reproducible I've enlarged the sleep time
>(1 microsecond is likely to be rounded too much and give different
>results on different hardware/kernel/config...).
>
>Compile this _without_ optimizations and try again:

<snip>

>Try different values: 1000, 2000, 3000 ... are you able to reproduce it
>now?

Yeah. One instance running has to sustain roughly _95%_ cpu before it's
classified as a cpu piggy. Not good.

>If yes, try to start 2 of them with something like this:
>
>"./a.out 3000 & ./a.out 3161"
>
>so they are NOT syncronized and they use almost all the CPU time:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5582 paolo 16 0 2396 320 252 S 45.7 0.1 0:05.52 a.out
> 5583 paolo 15 0 2392 320 252 S 45.7 0.1 0:05.49 a.out
>
>This is the bad situation I hate: some cpu-eaters that eat all the CPU
>time BUT have a really good priority only because they sleeps a bit.

Yup, your proggy fools the interactivity estimator quite well. This
problem was addressed a long time ago, and thought to be more or less
cured. Guess not.

-Mike

2006-01-02 09:50:56

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Mon, 02 Jan 2006 10:15:43 +0100
Mike Galbraith <[email protected]> wrote:

> >This is the bad situation I hate: some cpu-eaters that eat all the CPU
> >time BUT have a really good priority only because they sleeps a bit.
>
> Yup, your proggy fools the interactivity estimator quite well. This
> problem was addressed a long time ago, and thought to be more or less
> cured. Guess not.

In my original real-life test case (transcode) I found that the problem
started with the removing of "interactive_credit":

http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/6aa5c93c379ae9e1/a9a83db6446edaf7?lnk=st&q=insubject%3Asched+author%3APaolo+author%3AOrnati&rnum=1&hl=en#a9a83db6446edaf7

This is not actually true... in fact that change only unhidden the
problem for that particular test-case.

With my little proggy I'm now able to reproduce the problem even with
"interactive_credit" applied (for example with a 2.6.10 kernel).

Said this, and since "nicksched" doesn't have this problem at all, it
is an ingosched (and others as well) problem.

--
Paolo Ornati
Linux 2.6.15-rc7-plugsched on x86_64

2006-01-09 11:11:52

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 10:15 AM 1/2/2006 +0100, Mike Galbraith wrote:
>At 12:39 PM 1/1/2006 +0100, Paolo Ornati wrote:
>>On Sat, 31 Dec 2005 17:37:11 +0100
>>Mike Galbraith <[email protected]> wrote:
>>
>> > Strange. Using the exact same arguments, I do see some odd bouncing up to
>> > high priorities, but they spend the vast majority of their time down
>> at 25.
>>
>>Mmmm... to make it more easly reproducible I've enlarged the sleep time
>>(1 microsecond is likely to be rounded too much and give different
>>results on different hardware/kernel/config...).
>>
>>Compile this _without_ optimizations and try again:
>
><snip>
>
>>Try different values: 1000, 2000, 3000 ... are you able to reproduce it
>>now?
>
>Yeah. One instance running has to sustain roughly _95%_ cpu before it's
>classified as a cpu piggy. Not good.
>
>>If yes, try to start 2 of them with something like this:
>>
>>"./a.out 3000 & ./a.out 3161"
>>
>>so they are NOT syncronized and they use almost all the CPU time:
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 5582 paolo 16 0 2396 320 252 S 45.7 0.1 0:05.52 a.out
>> 5583 paolo 15 0 2392 320 252 S 45.7 0.1 0:05.49 a.out
>>
>>This is the bad situation I hate: some cpu-eaters that eat all the CPU
>>time BUT have a really good priority only because they sleeps a bit.
>
>Yup, your proggy fools the interactivity estimator quite well. This
>problem was addressed a long time ago, and thought to be more or less
>cured. Guess not.

Care to try an experiment? I'd be very interested in knowing if the
attached patch cures the real-life problem you were investigating.

It attempts to catch tasks which the interactivity logic has misidentified,
and "pull their plug". It maintains a running plausibility check
(slice_avg) against sleep_avg, and if a sustained disparity appears, cuts
off a cpu burning task's supply of bonus points such that it has to "run on
battery" until the disparity decreases to within acceptable limits.

Obviously, anything that affects fairness _will_ affect interactivity to
some degree. This simple bolt-on throttle has delayed initiation and
accelerated release in the hopes of keeping it's impact acceptable. After
some initial testing, It doesn't _seem_ to suck.

-Mike


Attachments:
sched_throttle (3.54 kB)

2006-01-09 15:52:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 12:11 PM 1/9/2006 +0100, Mike Galbraith wrote:

>Care to try an experiment?...

Oops. I guess I should send one that's not mixed p1 and p0. Sorry about
that :-/

Anyway, if anyone wants to see a functional demonstration, just try
this. Remove the TASK_NONINTERACTIVE in fs/pipe.c in both the stock kernel
and this modified one so Davide Libenzi's excellent sleep pattern exploit
(irman2) can work [1], and do the below all at the same time ...

make -j4 bzImage
irman2
thud 3

With the stock kernel, I got bored after a half an hour, and stopped the
kernel build. It had produced 40 .o files. The modified kernel finished
in 20 minutes vs the 8 minutes it took to produce the same 504 .o files if
not under load.

-Mike

1. it just so happens that Davide wrote irman2 using pipes... he could
have done something else. if anyone doesn't think this is a fair test,
just use Paolo's much simpler exploit instead. the result will be about
the same.


Attachments:
sched_throttle (3.57 kB)
irman2.c (4.01 kB)
thud.c (1.25 kB)
Download all attachments

2006-01-09 16:08:26

by Con Kolivas

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Tuesday 10 January 2006 02:52, Mike Galbraith wrote:
> At 12:11 PM 1/9/2006 +0100, Mike Galbraith wrote:
> >Care to try an experiment?...
>
> Oops. I guess I should send one that's not mixed p1 and p0. Sorry about
> that :-/
>
> Anyway, if anyone wants to see a functional demonstration, just try
> this. Remove the TASK_NONINTERACTIVE in fs/pipe.c in both the stock kernel
> and this modified one so Davide Libenzi's excellent sleep pattern exploit
> (irman2) can work [1], and do the below all at the same time ...
>
> make -j4 bzImage
> irman2
> thud 3
>
> With the stock kernel, I got bored after a half an hour, and stopped the
> kernel build. It had produced 40 .o files. The modified kernel finished
> in 20 minutes vs the 8 minutes it took to produce the same 504 .o files if
> not under load.
>
> -Mike
>
> 1. it just so happens that Davide wrote irman2 using pipes... he could
> have done something else. if anyone doesn't think this is a fair test,
> just use Paolo's much simpler exploit instead. the result will be about
> the same.

Want to try with a few other schedulers using plugsched?

Con

2006-01-09 18:14:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 03:08 AM 1/10/2006 +1100, Con Kolivas wrote:
>On Tuesday 10 January 2006 02:52, Mike Galbraith wrote:
> >
> > Anyway, if anyone wants to see a functional demonstration, just try
> > this. Remove the TASK_NONINTERACTIVE in fs/pipe.c in both the stock kernel
> > and this modified one so Davide Libenzi's excellent sleep pattern exploit
> > (irman2) can work [1], and do the below all at the same time ...
> >
> > make -j4 bzImage
> > irman2
> > thud 3
> >...
>
>
>Want to try with a few other schedulers using plugsched?

Not really. These exploits were specifically aimed at the current
scheduler's Achilles Heel. Besides, that carries implications of replacing
this one, and I'm pretty fond of it.

-Mike

2006-01-09 20:01:14

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Mon, 09 Jan 2006 16:52:17 +0100
Mike Galbraith <[email protected]> wrote:

> >Care to try an experiment?...

Yes.

With my simple proggy things improve a bit:

./a.out 7000 & ./a.out 6537 &
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5579 paolo 19 0 2392 288 228 R 51.9 0.1 0:22.38 a.out
5578 paolo 20 0 2392 288 228 R 43.9 0.1 0:22.50 a.out

As you can see they don't get priority 15/16 anymore :)

DD test: 256 MB, ~8.5s (instead of 8)

In pratice the more CPU they use the more their priority is penalized...


BUT if I start more of them (3/4) I'm able to fool it.

"./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873"

2 TOP's snapshots:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5625 paolo 17 0 2392 288 228 R 31.6 0.1 0:10.74 a.out
5626 paolo 17 0 2392 288 228 R 28.8 0.1 0:09.16 a.out
5627 paolo 17 0 2392 288 228 R 22.2 0.1 0:07.59 a.out
5624 paolo 17 0 2392 288 228 R 17.4 0.1 0:08.67 a.out

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5626 paolo 16 0 2392 288 228 R 30.1 0.1 0:39.95 a.out
5627 paolo 16 0 2392 288 228 R 24.1 0.1 0:34.93 a.out
5625 paolo 18 0 2392 288 228 R 23.5 0.1 0:37.53 a.out
5624 paolo 18 0 2392 288 228 R 21.9 0.1 0:37.60 a.out
5193 root 15 0 167m 17m 2916 S 0.2 3.5 0:09.67 X
5638 paolo 18 0 4952 1468 372 R 0.2 0.3 0:00.15 dd

DD test (256MB): real 3m37.122s (instead of 8s)



REAL LIFE TEST (transcode)

While running only transcode it gets priority 25:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5857 paolo 25 0 114m 18m 2424 R 90.9 3.7 0:14.28 transcode
5873 paolo 19 0 49860 4452 1860 S 8.6 0.9 0:01.40 tcdecode
5308 paolo 16 0 86796 22m 15m R 0.2 4.4 0:06.26 konsole
5687 paolo 16 0 98648 37m 9348 S 0.2 7.5 0:02.11 perl
5872 paolo 24 0 21864 1064 600 S 0.2 0.2 0:00.01 tcextract


But if I run also the DD test, "transcode" priority start fluctuating
and can go down to 18/19 (from time to time) interfering with DD:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5694 paolo 19 0 114m 18m 2424 R 75.1 3.7 0:42.29 transcode
5710 paolo 25 0 49856 4452 1860 R 8.0 0.9 0:04.36 tcdecode
5726 paolo 18 0 4952 1468 372 R 4.0 0.3 0:00.77 dd


This seems to happen because also transcode is reading (not directly but
through pipes) from disk so the massive disk usage of DD interferes
with it, this leads to transcode using less CPU and getting better
priority.

The exact behaviour changes time to time... but seems to confirm my
teory.

I don't know how can "nicksched" keep transcode priority always to 40
even when I'm running the DD test... I should retry and see.


PS: yes, transcode is reading from disk, but SLOWLY... i think that a
good read-ahead should fullfill his needs even when doing the HD
stressing DD test, no?

--
Paolo Ornati
Linux 2.6.15-sched_trottle on x86_64

2006-01-09 20:23:13

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Mon, 9 Jan 2006 21:00:35 +0100
Paolo Ornati <[email protected]> wrote:

> I don't know how can "nicksched" keep transcode priority always to 40
> even when I'm running the DD test... I should retry and see.

I was wrong... the interfering effect is here also with nicksched.

The problem is that Linux is smarter than me --> sometimes I forget the
effect of the disk-cache ;)

For the DD test I always umount/(re)mount the partition where the
BIGFILE is stored to discard the associated cache... but I've forgotten
to do the same for the file that transcode is reading from ;)

--
Paolo Ornati
Linux 2.6.15-plugsched on x86_64

2006-01-10 07:08:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 09:00 PM 1/9/2006 +0100, Paolo Ornati wrote:
>On Mon, 09 Jan 2006 16:52:17 +0100
>Mike Galbraith <[email protected]> wrote:
>
> > >Care to try an experiment?...
>
>Yes.

Thanks.

>With my simple proggy things improve a bit:

<snip rest of good news>

>BUT if I start more of them (3/4) I'm able to fool it.
>
>"./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873"
>
>2 TOP's snapshots:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5625 paolo 17 0 2392 288 228 R 31.6 0.1 0:10.74 a.out
> 5626 paolo 17 0 2392 288 228 R 28.8 0.1 0:09.16 a.out
> 5627 paolo 17 0 2392 288 228 R 22.2 0.1 0:07.59 a.out
> 5624 paolo 17 0 2392 288 228 R 17.4 0.1 0:08.67 a.out
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5626 paolo 16 0 2392 288 228 R 30.1 0.1 0:39.95 a.out
> 5627 paolo 16 0 2392 288 228 R 24.1 0.1 0:34.93 a.out
> 5625 paolo 18 0 2392 288 228 R 23.5 0.1 0:37.53 a.out
> 5624 paolo 18 0 2392 288 228 R 21.9 0.1 0:37.60 a.out
> 5193 root 15 0 167m 17m 2916 S 0.2 3.5 0:09.67 X
> 5638 paolo 18 0 4952 1468 372 R 0.2 0.3 0:00.15 dd
>
>DD test (256MB): real 3m37.122s (instead of 8s)

Ok, I'll take another look. Those should be being throttled.

>REAL LIFE TEST (transcode)
>
>While running only transcode it gets priority 25:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5857 paolo 25 0 114m 18m 2424 R 90.9 3.7 0:14.28 transcode
> 5873 paolo 19 0 49860 4452 1860 S 8.6 0.9 0:01.40 tcdecode
> 5308 paolo 16 0 86796 22m 15m R 0.2 4.4 0:06.26 konsole
> 5687 paolo 16 0 98648 37m 9348 S 0.2 7.5 0:02.11 perl
> 5872 paolo 24 0 21864 1064 600 S 0.2 0.2 0:00.01 tcextract
>
>
>But if I run also the DD test, "transcode" priority start fluctuating
>and can go down to 18/19 (from time to time) interfering with DD:

19 shouldn't interfere from the cpu side, but 18 will.

> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5694 paolo 19 0 114m 18m 2424 R 75.1 3.7 0:42.29 transcode
> 5710 paolo 25 0 49856 4452 1860 R 8.0 0.9 0:04.36 tcdecode
> 5726 paolo 18 0 4952 1468 372 R 4.0 0.3 0:00.77 dd
>
>
>This seems to happen because also transcode is reading (not directly but
>through pipes) from disk so the massive disk usage of DD interferes
>with it, this leads to transcode using less CPU and getting better
>priority.

It can't be pipe waits, they're disabled in the kernel. Most likely the
credit we get for being activated without having yet been selected.

>The exact behaviour changes time to time... but seems to confirm my
>teory.
>
>I don't know how can "nicksched" keep transcode priority always to 40
>even when I'm running the DD test... I should retry and see.
>
>
>PS: yes, transcode is reading from disk, but SLOWLY... i think that a
>good read-ahead should fullfill his needs even when doing the HD
>stressing DD test, no?

Dunno.

-Mike

2006-01-10 12:07:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 08:08 AM 1/10/2006 +0100, Mike Galbraith wrote:

>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 5626 paolo 16 0 2392 288 228 R 30.1 0.1 0:39.95 a.out
>> 5627 paolo 16 0 2392 288 228 R 24.1 0.1 0:34.93 a.out
>> 5625 paolo 18 0 2392 288 228 R 23.5 0.1 0:37.53 a.out
>> 5624 paolo 18 0 2392 288 228 R 21.9 0.1 0:37.60 a.out
>> 5193 root 15 0 167m 17m 2916 S 0.2 3.5 0:09.67 X
>> 5638 paolo 18 0 4952 1468 372 R 0.2 0.3 0:00.15 dd
>>
>>DD test (256MB): real 3m37.122s (instead of 8s)
>
>Ok, I'll take another look. Those should be being throttled.

Can you please try this version? It tries harder to correct any
imbalance... hopefully not too hard. In case you didn't notice, you need
to let your exploits run for a bit before the throttling will take
effect. I intentionally start tasks off at 0 cpu usage, so it takes a bit
to come up to it's real usage.

-Mike

2006-01-10 12:56:08

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Tue, 10 Jan 2006 13:07:33 +0100
Mike Galbraith <[email protected]> wrote:

> Can you please try this version? It tries harder to correct any

It seems that you have forgotten the to attach the patch...

> imbalance... hopefully not too hard. In case you didn't notice, you need
> to let your exploits run for a bit before the throttling will take
> effect. I intentionally start tasks off at 0 cpu usage, so it takes a bit
> to come up to it's real usage.

Yes, I've seen it.

--
Paolo Ornati
Linux 2.6.15-plugsched on x86_64

2006-01-10 13:01:49

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 01:56 PM 1/10/2006 +0100, Paolo Ornati wrote:
>On Tue, 10 Jan 2006 13:07:33 +0100
>Mike Galbraith <[email protected]> wrote:
>
> > Can you please try this version? It tries harder to correct any
>
>It seems that you have forgotten the to attach the patch...

Drat. At least I'm not the first to ever do so :)

(double checks) It's now attached.

-Mike


Attachments:
sched_throttle2 (4.32 kB)

2006-01-10 13:55:27

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Tue, 10 Jan 2006 14:01:36 +0100
Mike Galbraith <[email protected]> wrote:

> > > Can you please try this version? It tries harder to correct any
> >
> >It seems that you have forgotten the to attach the patch...
>
> Drat. At least I'm not the first to ever do so :)

This version basically works like the the previous, except that it makes
the priority adjustment faster (that is fine).

However I can fool it the same way.

"./a.out 7000"
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5459 paolo 22 0 2392 288 228 S 71.3 0.1 0:09.47 a.out

"./a.out 3000"
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5493 paolo 19 0 2396 292 228 R 49.8 0.1 0:14.42 a.out

"./a.out 1500"
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5495 paolo 18 0 2396 288 228 S 33.4 0.1 0:09.60 a.out


Fooling it:

"./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873 &"
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5502 paolo 19 0 2392 288 228 R 27.0 0.1 0:05.64 a.out
5503 paolo 19 0 2396 288 228 R 26.0 0.1 0:07.50 a.out
5505 paolo 19 0 2396 292 228 R 25.6 0.1 0:07.24 a.out
5504 paolo 18 0 2392 288 228 R 21.0 0.1 0:06.78 a.out

(priorities fluctuate between 18/19)


Again with more of them:

./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873& ./a.out 6245 & ./a.out 5467 &

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5525 paolo 18 0 2396 288 228 R 26.4 0.1 0:07.48 a.out
5521 paolo 19 0 2396 288 228 R 22.0 0.1 0:09.00 a.out
5524 paolo 19 0 2392 288 228 R 19.6 0.1 0:07.21 a.out
5523 paolo 19 0 2392 288 228 R 13.0 0.1 0:10.60 a.out
5520 paolo 19 0 2392 288 228 R 11.0 0.1 0:08.46 a.out
5522 paolo 19 0 2396 288 228 R 7.8 0.1 0:07.14 a.out

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5528 paolo 18 0 2392 288 228 R 19.7 0.1 0:18.15 a.out
5533 paolo 15 0 2396 288 228 S 19.3 0.1 0:19.12 a.out
5531 paolo 18 0 2396 288 228 R 18.5 0.1 0:19.23 a.out
5532 paolo 17 0 2392 288 228 R 15.1 0.1 0:18.55 a.out
5529 paolo 18 0 2396 288 228 R 14.7 0.1 0:13.05 a.out
5530 paolo 18 0 2392 288 228 R 12.5 0.1 0:20.42 a.out

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5530 paolo 18 0 2392 288 228 R 21.0 0.1 0:25.42 a.out
5533 paolo 18 0 2396 288 228 R 20.2 0.1 0:24.75 a.out
5529 paolo 18 0 2396 288 228 R 16.2 0.1 0:17.68 a.out
5532 paolo 18 0 2392 288 228 R 14.8 0.1 0:23.33 a.out
5531 paolo 18 0 2396 288 228 R 14.4 0.1 0:23.96 a.out
5528 paolo 18 0 2392 288 228 R 13.6 0.1 0:23.03 a.out


--
Paolo Ornati
Linux 2.6.15-sched_trottle2 on x86_64

2006-01-10 15:18:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 02:53 PM 1/10/2006 +0100, Paolo Ornati wrote:
>On Tue, 10 Jan 2006 14:01:36 +0100
>Mike Galbraith <[email protected]> wrote:
>
> > > > Can you please try this version? It tries harder to correct any
> > >
> > >It seems that you have forgotten the to attach the patch...
> >
> > Drat. At least I'm not the first to ever do so :)
>
>This version basically works like the the previous, except that it makes
>the priority adjustment faster (that is fine).
>
>However I can fool it the same way.
>
>"./a.out 7000"
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5459 paolo 22 0 2392 288 228 S 71.3 0.1 0:09.47 a.out
>
>"./a.out 3000"
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5493 paolo 19 0 2396 292 228 R 49.8 0.1 0:14.42 a.out
>
>"./a.out 1500"
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5495 paolo 18 0 2396 288 228 S 33.4 0.1 0:09.60 a.out
>
>
>Fooling it:
>
>"./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873 &"
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5502 paolo 19 0 2392 288 228 R 27.0 0.1 0:05.64 a.out
> 5503 paolo 19 0 2396 288 228 R 26.0 0.1 0:07.50 a.out
> 5505 paolo 19 0 2396 292 228 R 25.6 0.1 0:07.24 a.out
> 5504 paolo 18 0 2392 288 228 R 21.0 0.1 0:06.78 a.out
>
>(priorities fluctuate between 18/19)
>
>
>Again with more of them:
>
>./a.out 7000 & ./a.out 6537 & ./a.out 6347 & ./a.out 5873& ./a.out 6245 &
>./a.out 5467 &
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5525 paolo 18 0 2396 288 228 R 26.4 0.1 0:07.48 a.out
> 5521 paolo 19 0 2396 288 228 R 22.0 0.1 0:09.00 a.out
> 5524 paolo 19 0 2392 288 228 R 19.6 0.1 0:07.21 a.out
> 5523 paolo 19 0 2392 288 228 R 13.0 0.1 0:10.60 a.out
> 5520 paolo 19 0 2392 288 228 R 11.0 0.1 0:08.46 a.out
> 5522 paolo 19 0 2396 288 228 R 7.8 0.1 0:07.14 a.out
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5528 paolo 18 0 2392 288 228 R 19.7 0.1 0:18.15 a.out
> 5533 paolo 15 0 2396 288 228 S 19.3 0.1 0:19.12 a.out
> 5531 paolo 18 0 2396 288 228 R 18.5 0.1 0:19.23 a.out
> 5532 paolo 17 0 2392 288 228 R 15.1 0.1 0:18.55 a.out
> 5529 paolo 18 0 2396 288 228 R 14.7 0.1 0:13.05 a.out
> 5530 paolo 18 0 2392 288 228 R 12.5 0.1 0:20.42 a.out
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5530 paolo 18 0 2392 288 228 R 21.0 0.1 0:25.42 a.out
> 5533 paolo 18 0 2396 288 228 R 20.2 0.1 0:24.75 a.out
> 5529 paolo 18 0 2396 288 228 R 16.2 0.1 0:17.68 a.out
> 5532 paolo 18 0 2392 288 228 R 14.8 0.1 0:23.33 a.out
> 5531 paolo 18 0 2396 288 228 R 14.4 0.1 0:23.96 a.out
> 5528 paolo 18 0 2392 288 228 R 13.6 0.1 0:23.03 a.out

I don't think it's being fooled, it's pulling these guys down, it's just
that the plausible limit goes up the more tasks you're sharing the cpu
with, so they hit the throttle less. If you start a bunch of tasks who
only do a short burn such that their individual cpu usage isn't much, but
together they use all available cpu, you can still severely starve
non-sleeping == low priority tasks. This patch won't help for that...
that's an entirely different problem, and fortunately one that doesn't seem
to happen much in real life otherwise folks would be bitching like
crazy. To guard against it, there are a lot of different things you can
do. All of the ways that I know of are very bad for the desktop user, and
fortunately for me, beyond the intended scope of this patch :)

-Mike

2006-01-13 01:13:37

by Con Kolivas

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Saturday 31 December 2005 00:52, Paolo Ornati wrote:
> WAS: [SCHED] Totally WRONG prority calculation with specific test-case
> (since 2.6.10-bk12)
> http://lkml.org/lkml/2005/12/27/114/index.html
>
> On Wed, 28 Dec 2005 10:26:58 +1100
>
> Con Kolivas <[email protected]> wrote:
> > The issue is that the scheduler interactivity estimator is a state
> > machine and can be fooled to some degree, and a cpu intensive task that
> > just happens to sleep a little bit gets significantly better priority
> > than one that is fully cpu bound all the time. Reverting that change is
> > not a solution because it can still be fooled by the same process
> > sleeping lots for a few seconds or so at startup and then changing to the
> > cpu mostly-sleeping slightly behaviour. This "fluctuating" behaviour is
> > in my opinion worse which is why I removed it.
>
> Trying to find a "as simple as possible" test case for this problem
> (that I consider a BUG in priority calculation) I've come up with this
> very simple program:

Hi Paolo.

Can you try the following patch on 2.6.15 please? I'm interested in how
adversely this affects interactive performance as well as whether it helps
your test case.

Thanks,
Con



---
include/linux/sched.h | 9 +++++-
kernel/sched.c | 72 ++++++++++++++++++++++----------------------------
2 files changed, 41 insertions(+), 40 deletions(-)

Index: linux-2.6.15/include/linux/sched.h
===================================================================
--- linux-2.6.15.orig/include/linux/sched.h
+++ linux-2.6.15/include/linux/sched.h
@@ -683,6 +683,13 @@ static inline void prefetch_stack(struct
struct audit_context; /* See audit.c */
struct mempolicy;

+enum sleep_type {
+ SLEEP_NORMAL,
+ SLEEP_NONINTERACTIVE,
+ SLEEP_INTERACTIVE,
+ SLEEP_INTERRUPTED,
+};
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
struct thread_info *thread_info;
@@ -704,7 +711,7 @@ struct task_struct {
unsigned long sleep_avg;
unsigned long long timestamp, last_ran;
unsigned long long sched_time; /* sched_clock time spent running */
- int activated;
+ enum sleep_type sleep_type;

unsigned long policy;
cpumask_t cpus_allowed;
Index: linux-2.6.15/kernel/sched.c
===================================================================
--- linux-2.6.15.orig/kernel/sched.c
+++ linux-2.6.15/kernel/sched.c
@@ -751,31 +751,22 @@ static int recalc_task_prio(task_t *p, u
* prevent them suddenly becoming cpu hogs and starving
* other processes.
*/
- if (p->mm && p->activated != -1 &&
+ if (p->mm && p->sleep_type != SLEEP_NONINTERACTIVE &&
sleep_time > INTERACTIVE_SLEEP(p)) {
p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
DEF_TIMESLICE);
} else {
+
/*
* The lower the sleep avg a task has the more
- * rapidly it will rise with sleep time.
+ * rapidly it will rise with sleep time. This enables
+ * tasks to rapidly recover to a low latency priority.
+ * If a task was sleeping with the noninteractive
+ * label do not apply this non-linear boost
*/
- sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
-
- /*
- * Tasks waking from uninterruptible sleep are
- * limited in their sleep_avg rise as they
- * are likely to be waiting on I/O
- */
- if (p->activated == -1 && p->mm) {
- if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
- sleep_time = 0;
- else if (p->sleep_avg + sleep_time >=
- INTERACTIVE_SLEEP(p)) {
- p->sleep_avg = INTERACTIVE_SLEEP(p);
- sleep_time = 0;
- }
- }
+ if (p->sleep_type != SLEEP_NONINTERACTIVE || p->mm)
+ sleep_time *=
+ (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;

/*
* This code gives a bonus to interactive tasks.
@@ -818,11 +809,7 @@ static void activate_task(task_t *p, run
if (!rt_task(p))
p->prio = recalc_task_prio(p, now);

- /*
- * This checks to make sure it's not an uninterruptible task
- * that is now waking up.
- */
- if (!p->activated) {
+ if (p->sleep_type != SLEEP_NONINTERACTIVE) {
/*
* Tasks which were woken up by interrupts (ie. hw events)
* are most likely of interactive nature. So we give them
@@ -831,13 +818,13 @@ static void activate_task(task_t *p, run
* on a CPU, first time around:
*/
if (in_interrupt())
- p->activated = 2;
+ p->sleep_type = SLEEP_INTERRUPTED;
else {
/*
* Normal first-time wakeups get a credit too for
* on-runqueue time, but it will be weighted down:
*/
- p->activated = 1;
+ p->sleep_type = SLEEP_INTERACTIVE;
}
}
p->timestamp = now;
@@ -1356,22 +1343,23 @@ out_activate:
if (old_state == TASK_UNINTERRUPTIBLE) {
rq->nr_uninterruptible--;
/*
- * Tasks on involuntary sleep don't earn
- * sleep_avg beyond just interactive state.
+ * Tasks waking from uninterruptible sleep are likely
+ * to be sleeping involuntarily on I/O and are otherwise
+ * cpu bound so label them as noninteractive.
*/
- p->activated = -1;
- }
+ p->sleep_type = SLEEP_NONINTERACTIVE;
+ } else

/*
* Tasks that have marked their sleep as noninteractive get
- * woken up without updating their sleep average. (i.e. their
- * sleep is handled in a priority-neutral manner, no priority
- * boost and no penalty.)
+ * woken up with their sleep average not weighted in an
+ * interactive way.
*/
- if (old_state & TASK_NONINTERACTIVE)
- __activate_task(p, rq);
- else
- activate_task(p, rq, cpu == this_cpu);
+ if (old_state & TASK_NONINTERACTIVE)
+ p->sleep_type = SLEEP_NONINTERACTIVE;
+
+
+ activate_task(p, rq, cpu == this_cpu);
/*
* Sync wakeups (i.e. those types of wakeups where the waker
* has indicated that it will leave the CPU in short order)
@@ -2938,6 +2926,12 @@ EXPORT_SYMBOL(sub_preempt_count);

#endif

+static inline int interactive_sleep(enum sleep_type sleep_type)
+{
+ return (sleep_type == SLEEP_INTERACTIVE ||
+ sleep_type == SLEEP_INTERRUPTED);
+}
+
/*
* schedule() is the main scheduler function.
*/
@@ -3063,12 +3057,12 @@ go_idle:
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);

- if (!rt_task(next) && next->activated > 0) {
+ if (!rt_task(next) && interactive_sleep(next->sleep_type)) {
unsigned long long delta = now - next->timestamp;
if (unlikely((long long)(now - next->timestamp) < 0))
delta = 0;

- if (next->activated == 1)
+ if (next->sleep_type == SLEEP_INTERACTIVE)
delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;

array = next->array;
@@ -3081,7 +3075,7 @@ go_idle:
} else
requeue_task(next, array);
}
- next->activated = 0;
+ next->sleep_type = SLEEP_NORMAL;
switch_tasks:
if (next == rq->idle)
schedstat_inc(rq, sched_goidle);


Attachments:
(No filename) (6.57 kB)
(No filename) (189.00 B)
Download all attachments

2006-01-13 01:32:20

by Con Kolivas

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Friday 13 January 2006 12:13, Con Kolivas wrote:
> On Saturday 31 December 2005 00:52, Paolo Ornati wrote:
> > WAS: [SCHED] Totally WRONG prority calculation with specific test-case
> > (since 2.6.10-bk12)
> > http://lkml.org/lkml/2005/12/27/114/index.html
> >
> > On Wed, 28 Dec 2005 10:26:58 +1100
> >
> > Con Kolivas <[email protected]> wrote:
> > > The issue is that the scheduler interactivity estimator is a state
> > > machine and can be fooled to some degree, and a cpu intensive task that
> > > just happens to sleep a little bit gets significantly better priority
> > > than one that is fully cpu bound all the time. Reverting that change is
> > > not a solution because it can still be fooled by the same process
> > > sleeping lots for a few seconds or so at startup and then changing to
> > > the cpu mostly-sleeping slightly behaviour. This "fluctuating"
> > > behaviour is in my opinion worse which is why I removed it.
> >
> > Trying to find a "as simple as possible" test case for this problem
> > (that I consider a BUG in priority calculation) I've come up with this
> > very simple program:
>
> Hi Paolo.
>
> Can you try the following patch on 2.6.15 please? I'm interested in how
> adversely this affects interactive performance as well as whether it helps
> your test case.

I should make it clear. This patch _will_ adversely affect interactivity
because your test case desires that I/O bound tasks get higher priority, and
this patch will do that. This means that I/O bound tasks will be more
noticeable now. The question is how much do we trade off one for the other.
We almost certainly are biased a little too much on the interactive side on
the mainline kernel at the moment.

Cheers,
Con

2006-01-13 10:45:40

by Paolo Ornati

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Fri, 13 Jan 2006 12:13:11 +1100
Con Kolivas <[email protected]> wrote:

> Can you try the following patch on 2.6.15 please? I'm interested in how
> adversely this affects interactive performance as well as whether it helps
> your test case.

"./a.out 5000 & ./a.out 5237 & ./a.out 5331 &"
"mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null
bs=1M count=256; umount space/"

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5445 paolo 16 0 2396 288 228 R 34.8 0.1 0:05.84 a.out
5446 paolo 15 0 2396 288 228 S 32.8 0.1 0:05.53 a.out
5444 paolo 16 0 2392 288 228 R 31.3 0.1 0:05.99 a.out
5443 paolo 16 0 10416 1104 848 R 0.2 0.2 0:00.01 top
5451 paolo 15 0 4948 1468 372 D 0.2 0.3 0:00.01 dd

DD test takes ~20 s (instead of 8s).

As you can see DD priority is now very good (15) but it still suffers
because also my test programs get good priority (15/16).


Things are BETTER on the real test case (transcode): this is because
transcode usually gets priority 16 and "dd" gets 15... so dd is quite
happy.

BUT what is STRANGE is this: usually transcode is stuck to priority 16
using about 88% of the CPU, but sometimes (don't know how to reproduce
it) its priority grows up to 25 and then stay to 25.

When transcode priority is 25 the DD test is obviously happy: in
particular 2 things can happen (this is expected because I've observed
this thing before):

1) priority of transcode stay to 25 (when the file transcode is
reading from, through pipes, IS cached).

2) CPU usage and priority of transcode go down (the file transcode is
reading from ISN'T cached and DD massive disk usage interferes with
this reading). When DD finish trancode priority go back to 25.

--
Paolo Ornati
Linux 2.6.15-kolivasPatch on x86_64

2006-01-13 10:52:15

by Con Kolivas

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Friday 13 January 2006 21:46, Paolo Ornati wrote:
> On Fri, 13 Jan 2006 12:13:11 +1100
>
> Con Kolivas <[email protected]> wrote:
> > Can you try the following patch on 2.6.15 please? I'm interested in how
> > adversely this affects interactive performance as well as whether it
> > helps your test case.
>
> "./a.out 5000 & ./a.out 5237 & ./a.out 5331 &"
> "mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null
> bs=1M count=256; umount space/"
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5445 paolo 16 0 2396 288 228 R 34.8 0.1 0:05.84 a.out
> 5446 paolo 15 0 2396 288 228 S 32.8 0.1 0:05.53 a.out
> 5444 paolo 16 0 2392 288 228 R 31.3 0.1 0:05.99 a.out
> 5443 paolo 16 0 10416 1104 848 R 0.2 0.2 0:00.01 top
> 5451 paolo 15 0 4948 1468 372 D 0.2 0.3 0:00.01 dd
>
> DD test takes ~20 s (instead of 8s).
>
> As you can see DD priority is now very good (15) but it still suffers
> because also my test programs get good priority (15/16).
>
>
> Things are BETTER on the real test case (transcode): this is because
> transcode usually gets priority 16 and "dd" gets 15... so dd is quite
> happy.

This seems a reasonable compromise. In your "test app" case you are using
quirky code to reproduce the worst case scenario. Given that with your quirky
setup you are using 3 cpu hogs (effectively) then slowing down dd from 8s to
20s seems an appropriate slowdown (as opposed to the many minutes you were
getting previously).

See my followup patches that I have posted following "[PATCH 0/5] sched -
interactivity updates". The first 3 patches are what you tested. These
patches are being put up for testing hopefully in -mm.

> BUT what is STRANGE is this: usually transcode is stuck to priority 16
> using about 88% of the CPU, but sometimes (don't know how to reproduce
> it) its priority grows up to 25 and then stay to 25.
>
> When transcode priority is 25 the DD test is obviously happy: in
> particular 2 things can happen (this is expected because I've observed
> this thing before):
>
> 1) priority of transcode stay to 25 (when the file transcode is
> reading from, through pipes, IS cached).
>
> 2) CPU usage and priority of transcode go down (the file transcode is
> reading from ISN'T cached and DD massive disk usage interferes with
> this reading). When DD finish trancode priority go back to 25.

I suspect this is entirely dependent on the balance between time spent reading
on disk, waiting on pipe and so on.

Thanks for your test case and testing!

Cheers,
Con


Attachments:
(No filename) (2.54 kB)
(No filename) (189.00 B)
Download all attachments

2006-01-13 13:02:06

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
>On Friday 13 January 2006 21:46, Paolo Ornati wrote:
> > On Fri, 13 Jan 2006 12:13:11 +1100
> >
> > Con Kolivas <[email protected]> wrote:
> > > Can you try the following patch on 2.6.15 please? I'm interested in how
> > > adversely this affects interactive performance as well as whether it
> > > helps your test case.
> >
> > "./a.out 5000 & ./a.out 5237 & ./a.out 5331 &"
> > "mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null
> > bs=1M count=256; umount space/"
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 5445 paolo 16 0 2396 288 228 R 34.8 0.1 0:05.84 a.out
> > 5446 paolo 15 0 2396 288 228 S 32.8 0.1 0:05.53 a.out
> > 5444 paolo 16 0 2392 288 228 R 31.3 0.1 0:05.99 a.out
> > 5443 paolo 16 0 10416 1104 848 R 0.2 0.2 0:00.01 top
> > 5451 paolo 15 0 4948 1468 372 D 0.2 0.3 0:00.01 dd
> >
> > DD test takes ~20 s (instead of 8s).
> >
> > As you can see DD priority is now very good (15) but it still suffers
> > because also my test programs get good priority (15/16).
> >
> >
> > Things are BETTER on the real test case (transcode): this is because
> > transcode usually gets priority 16 and "dd" gets 15... so dd is quite
> > happy.
>
>This seems a reasonable compromise. In your "test app" case you are using
>quirky code to reproduce the worst case scenario. Given that with your quirky
>setup you are using 3 cpu hogs (effectively) then slowing down dd from 8s to
>20s seems an appropriate slowdown (as opposed to the many minutes you were
>getting previously).

I'm sorry, but I heartily disagree. It's not a quirky setup, it's just
code that exposes a weakness just like thud, starve, irman,
irman2. Selectively bumping dd up into the upper tier won't do the other
things trivially starved to death one bit of good. On a more positive
note, I agree that dd should not be punished for waiting on disk.

>See my followup patches that I have posted following "[PATCH 0/5] sched -
>interactivity updates". The first 3 patches are what you tested. These
>patches are being put up for testing hopefully in -mm.

Then the (buggy) version of my simple throttling patch will need to come
out. (which is OK, I have a debugged potent++ version)

> > BUT what is STRANGE is this: usually transcode is stuck to priority 16
> > using about 88% of the CPU, but sometimes (don't know how to reproduce
> > it) its priority grows up to 25 and then stay to 25.
> >
> > When transcode priority is 25 the DD test is obviously happy: in
> > particular 2 things can happen (this is expected because I've observed
> > this thing before):
> >
> > 1) priority of transcode stay to 25 (when the file transcode is
> > reading from, through pipes, IS cached).
> >
> > 2) CPU usage and priority of transcode go down (the file transcode is
> > reading from ISN'T cached and DD massive disk usage interferes with
> > this reading). When DD finish trancode priority go back to 25.
>
>I suspect this is entirely dependent on the balance between time spent
>reading
>on disk, waiting on pipe and so on.

Grumble. Pipe sleep. That's another pet peeve of mine. Sleep is sleep
whether it's spelled interruptible_pipe or uninterruptible_semaphore.

-Mike

2006-01-13 14:35:12

by Con Kolivas

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
> At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
> >See my followup patches that I have posted following "[PATCH 0/5] sched -
> >interactivity updates". The first 3 patches are what you tested. These
> >patches are being put up for testing hopefully in -mm.
>
> Then the (buggy) version of my simple throttling patch will need to come
> out. (which is OK, I have a debugged potent++ version)

Your code need not be mutually exclusive with mine. I've simply damped the
current behaviour. Your sanity throttling is a good idea.

Con

2006-01-13 16:15:22

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote:
>On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
> > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
> > >See my followup patches that I have posted following "[PATCH 0/5] sched -
> > >interactivity updates". The first 3 patches are what you tested. These
> > >patches are being put up for testing hopefully in -mm.
> >
> > Then the (buggy) version of my simple throttling patch will need to come
> > out. (which is OK, I have a debugged potent++ version)
>
>Your code need not be mutually exclusive with mine. I've simply damped the
>current behaviour. Your sanity throttling is a good idea.

I didn't mean to imply that they're mutually exclusive, and after doing
some testing, I concluded that it (or something like it) is definitely
still needed. The version that's in mm2 _is_ buggy however, so ripping it
back out wouldn't hurt my delicate little feelings one bit. In fact, it
would give me some more time to instrument and test integration with your
changes. (Which I think are good btw because they remove what I considered
to be warts; the pipe and uninterruptible sleep barriers. Um... try irman2
now... pure evilness)

-Mike

2006-01-14 02:06:13

by Con Kolivas

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Saturday 14 January 2006 03:15, Mike Galbraith wrote:
> At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote:
> >On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
> > > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
> > > >See my followup patches that I have posted following "[PATCH 0/5]
> > > > sched - interactivity updates". The first 3 patches are what you
> > > > tested. These patches are being put up for testing hopefully in -mm.
> > >
> > > Then the (buggy) version of my simple throttling patch will need to
> > > come out. (which is OK, I have a debugged potent++ version)
> >
> >Your code need not be mutually exclusive with mine. I've simply damped the
> >current behaviour. Your sanity throttling is a good idea.
>
> I didn't mean to imply that they're mutually exclusive, and after doing
> some testing, I concluded that it (or something like it) is definitely
> still needed. The version that's in mm2 _is_ buggy however, so ripping it
> back out wouldn't hurt my delicate little feelings one bit. In fact, it
> would give me some more time to instrument and test integration with your
> changes.

Ok I've communicated this to Andrew (cc'ed here too) so he should remove your
patch pending a new version from you.

> (Which I think are good btw because they remove what I considered
> to be warts; the pipe and uninterruptible sleep barriers.

Yes I felt your abuse wrt to these in an earlier email...

> Um... try irman2
> now... pure evilness)

Hrm I've been using staircase which is immune for so long I'd all but
forgotten about this test case. Looking at your code I assume your changes
should help this?

Con

2006-01-14 02:56:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

At 01:05 PM 1/14/2006 +1100, Con Kolivas wrote:
>On Saturday 14 January 2006 03:15, Mike Galbraith wrote:
>
> > Um... try irman2
> > now... pure evilness)
>
>Hrm I've been using staircase which is immune for so long I'd all but
>forgotten about this test case. Looking at your code I assume your changes
>should help this?

Yes. How much very much depends on how strictly I try to enforce. In my
experimental tree, I have four stages of throttling: 1 threshold to begin
trying to consume the difference between measured slice_avg and sleep_avg
(kidd gloves), 2 to begin treating all new sleep as noninteractive (stern
talking to), 3 to cut off new sleep entirely (you're grounded), and 4 is
when to start using slice_avg instead of the out of balance sleep_avg for
the priority calculation (um, bitch-slap?). Levels 1 and 2 won't stop
irman2, 3 will, and especially 4 will.

These are all /proc settings at the moment, so I can set set my starvation
pain threshold from super duper desktop (all off) to just as fair as a
running slice completion time average can possibly make it (all at 1ns
differential), and anywhere in between.

-Mike

2006-01-27 16:55:46

by Con Kolivas

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Saturday 14 January 2006 03:15, Mike Galbraith wrote:
> At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote:
> >On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
> > > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
> > > >See my followup patches that I have posted following "[PATCH 0/5]
> > > > sched - interactivity updates". The first 3 patches are what you
> > > > tested. These patches are being put up for testing hopefully in -mm.
> > >
> > > Then the (buggy) version of my simple throttling patch will need to
> > > come out. (which is OK, I have a debugged potent++ version)
> >
> >Your code need not be mutually exclusive with mine. I've simply damped the
> >current behaviour. Your sanity throttling is a good idea.
>
> I didn't mean to imply that they're mutually exclusive, and after doing
> some testing, I concluded that it (or something like it) is definitely
> still needed. The version that's in mm2 _is_ buggy however, so ripping it
> back out wouldn't hurt my delicate little feelings one bit. In fact, it
> would give me some more time to instrument and test integration with your
> changes.

Ok I've communicated this to Andrew (cc'ed here too) so he should remove your
patch pending a new version from you.

> (Which I think are good btw because they remove what I considered
> to be warts; the pipe and uninterruptible sleep barriers.

Yes I felt your abuse wrt to these in an earlier email...

> Um... try irman2
> now... pure evilness)

Hrm I've been using staircase which is immune for so long I'd all but
forgotten about this test case. Looking at your code I assume your changes
should help this?

Con

2006-01-27 20:04:19

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Fri, 2006-01-27 at 17:57 +0100, Con Kolivas wrote:
> On Saturday 14 January 2006 03:15, Mike Galbraith wrote:
> > At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote:
> > >On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
> > > > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
> > > > >See my followup patches that I have posted following "[PATCH 0/5]
> > > > > sched - interactivity updates". The first 3 patches are what you
> > > > > tested. These patches are being put up for testing hopefully in -mm.
> > > >
> > > > Then the (buggy) version of my simple throttling patch will need to
> > > > come out. (which is OK, I have a debugged potent++ version)
> > >
> > >Your code need not be mutually exclusive with mine. I've simply damped the
> > >current behaviour. Your sanity throttling is a good idea.
> >
> > I didn't mean to imply that they're mutually exclusive, and after doing
> > some testing, I concluded that it (or something like it) is definitely
> > still needed. The version that's in mm2 _is_ buggy however, so ripping it
> > back out wouldn't hurt my delicate little feelings one bit. In fact, it
> > would give me some more time to instrument and test integration with your
> > changes.
>
> Ok I've communicated this to Andrew (cc'ed here too) so he should remove your
> patch pending a new version from you.

(Ok, it took a while what with setting up a new test box, testing this
and that, RL etc etc.)

What do you think of the below as an evaluation patch? It leaves the
bits I'd really like to change (INTERACTIVE_SLEEP() for one), so it can
be switched on and off for easy comparison and regression testing.

I really didn't want to add more to the task struct, but after trying
different things, a timeout was the most effective means of keeping the
nice burst aspect of the interactivity logic but still make sure that a
burst doesn't turn into starvation.

The workings are dirt simple just as before. The goal is to keep
sleep_avg and slice_avg balanced. When an imbalance starts, immediately
cut off interactive bonus points. If the imbalance doesn't correct
itself through normal sleep_avg usage, we'll soon hit the (1 dynamic
prio) trigger point, which starts a countdown toward active
intervention. The default setting is that a task can run at higher
dynamic priority than it's cpu usage can justify for 5 seconds. After
than, we start trying to work off the deficit, and if we don't succeeded
within another second (ie it was a big deficit), we demote the offender
to the rank his cpu usage indicates.

The strategy works well enough to take the wind out of irman2's sails,
and interactive tasks can still do a nice reasonable burst of activity
without being evicted. Down side to starvation control is that X is
sometimes a serious cpu user, and _can_ end up in the expired array (not
nice under load). I personally don't think that's a show stopper
though... all you have to do is tell the scheduler that what it already
noticed, that X is a piggy, but an OK piggy by renicing it. It becomes
immune from active throttling, and all is well. I know that's not going
to be popular, but you just can't let X have free rein without leaving
the barn door wide open. (maybe that switch should stay since the
majority of boxen are workstations, and default to off?).

'nuff words, patch follows.

-Mike

--- 2.6.16-rc1-mm3/include/linux/sched.h.org 2006-01-27 11:28:20.000000000 +0100
+++ 2.6.16-rc1-mm3/include/linux/sched.h 2006-01-27 11:54:48.000000000 +0100
@@ -722,6 +722,7 @@
unsigned short ioprio;

unsigned long sleep_avg;
+ unsigned long slice_avg, last_slice, throttle_stamp;
unsigned long long timestamp, last_ran;
unsigned long long sched_time; /* sched_clock time spent running */
enum sleep_type sleep_type;
--- 2.6.16-rc1-mm3/include/linux/sysctl.h.org 2006-01-27 11:28:20.000000000 +0100
+++ 2.6.16-rc1-mm3/include/linux/sysctl.h 2006-01-27 11:57:13.000000000 +0100
@@ -147,6 +147,8 @@
KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid core */
KERN_SPIN_RETRY=70, /* int: number of spinlock retries */
KERN_ACPI_VIDEO_FLAGS=71, /* int: flags for setting up video after ACPI sleep */
+ KERN_SCHED_THROTTLE1=72, /* int: sleep_avg throttling enabled */
+ KERN_SCHED_THROTTLE2=73, /* int: throttling grace period in secs */
};


--- 2.6.16-rc1-mm3/kernel/sched.c.org 2006-01-27 11:28:20.000000000 +0100
+++ 2.6.16-rc1-mm3/kernel/sched.c 2006-01-27 12:08:59.000000000 +0100
@@ -138,6 +138,55 @@
(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
MAX_SLEEP_AVG)

+/*
+ * Interactivity boost can lead to serious starvation problems if the
+ * task being boosted turns out to be a cpu hog. To combat this, we
+ * compute a running slice_avg, which is the sane upper limit for the
+ * task's sleep_avg. If an 'interactive' task begins burning cpu, it's
+ * slice_avg will decay, making it visible as a problem so corrective
+ * measures can be applied. RT tasks and kernel threads are exempt from
+ * throttling - reniced tasks are only subject to mild throttling.
+ *
+ * /proc/sys/kernel tunables.
+ *
+ * sched_throttle: enable throttling logic.
+ * sched_throttle_secs: throttling grace period in seconds.
+ */
+
+int throttle = 1;
+int throttle_secs = 5;
+int throttle_secs_max = 10;
+
+#define THROTTLE_BASE \
+ (10 * (NS_MAX_SLEEP_AVG / 100))
+
+/* Disable interactive task bias logic. */
+#define THROTTLE_BIAS(p) \
+ (throttle && (p)->mm && (p)->sleep_avg > (p)->slice_avg)
+
+/* Begin adjusting sleep_avg consumption to match slice_avg. */
+#define THROTTLE_SOFT(p) \
+ ((p)->throttle_stamp && time_after(jiffies, (p)->throttle_stamp) && \
+ (p)->sleep_avg >= (p)->slice_avg + THROTTLE_BASE)
+
+/* Begin adjusting priority to match slice_avg. */
+#define THROTTLE_HARD(p) \
+ ((p)->throttle_stamp && time_after(jiffies, (p)->throttle_stamp + HZ) && \
+ (p)->sleep_avg >= (p)->slice_avg + THROTTLE_BASE)
+
+#define THROTTLE_ENGAGE(p) (throttle && !(p)->throttle_stamp && \
+ (p)->mm && PRIO_TO_NICE((p)->static_prio) >= 0 && \
+ (p)->sleep_avg >= (p)->slice_avg + THROTTLE_BASE)
+
+#define THROTTLE_DISENGAGE(p) ((p)->throttle_stamp && \
+ (PRIO_TO_NICE((p)->static_prio) < 0 || !THROTTLE_BIAS((p)) || \
+ time_after(jiffies, (p)->last_slice + (throttle_secs * HZ))))
+
+#define CURRENT_COST(p) \
+ (NS_TO_JIFFIES(THROTTLE_SOFT((p)) ? \
+ min((p)->sleep_avg, (p)->slice_avg) : NS_MAX_SLEEP_AVG) * \
+ MAX_BONUS / MAX_SLEEP_AVG)
+
#define GRANULARITY (10 * HZ / 1000 ? : 1)

#ifdef CONFIG_SMP
@@ -162,6 +211,10 @@
(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))

+#define INTERACTIVE_SLEEP_AVG(p) \
+ (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
+ (MAX_BONUS / 2 + DELTA((p))) / MAX_BONUS))
+
#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)

@@ -669,6 +722,8 @@
return p->prio;

bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
+ if (unlikely(THROTTLE_HARD(p)))
+ bonus = CURRENT_COST(p) - MAX_BONUS / 2;

prio = p->static_prio - bonus;
if (prio < MAX_RT_PRIO)
@@ -803,8 +858,15 @@
if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
unsigned long ceiling;

- ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- DEF_TIMESLICE);
+ if (!throttle)
+ ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
+ DEF_TIMESLICE);
+ else {
+ ceiling = INTERACTIVE_SLEEP_AVG(p);
+ ceiling += JIFFIES_TO_NS(DEF_TIMESLICE);
+ if (p->slice_avg < ceiling)
+ ceiling = p->slice_avg;
+ }
if (p->sleep_avg < ceiling)
p->sleep_avg = ceiling;
} else {
@@ -1367,7 +1429,10 @@

out_activate:
#endif /* CONFIG_SMP */
- if (old_state == TASK_UNINTERRUPTIBLE) {
+ if (unlikely(THROTTLE_DISENGAGE(p)))
+ p->throttle_stamp = 0;
+
+ if (old_state & TASK_UNINTERRUPTIBLE) {
rq->nr_uninterruptible--;
/*
* Tasks waking from uninterruptible sleep are likely
@@ -1382,7 +1447,7 @@
* woken up with their sleep average not weighted in an
* interactive way.
*/
- if (old_state & TASK_NONINTERACTIVE)
+ if (old_state & TASK_NONINTERACTIVE || THROTTLE_BIAS(p))
p->sleep_type = SLEEP_NONINTERACTIVE;


@@ -1471,6 +1536,7 @@
p->first_time_slice = 1;
current->time_slice >>= 1;
p->timestamp = sched_clock();
+ p->throttle_stamp = 0;
if (unlikely(!current->time_slice)) {
/*
* This case is rare, it happens when the parent has only
@@ -1510,6 +1576,9 @@
*/
p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
+ p->slice_avg = NS_MAX_SLEEP_AVG;
+ p->last_slice = jiffies;
+ p->throttle_stamp = 0;

p->prio = effective_prio(p);

@@ -1584,7 +1653,7 @@
* the sleep_avg of the parent as well.
*/
rq = task_rq_lock(p->parent, &flags);
- if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
+ if ((int)p->first_time_slice > 0 && task_cpu(p) == task_cpu(p->parent)) {
p->parent->time_slice += p->time_slice;
if (unlikely(p->parent->time_slice > task_timeslice(p)))
p->parent->time_slice = task_timeslice(p);
@@ -2665,6 +2734,37 @@
}

/*
+ * Calculate a task's average slice usage rate in terms of sleep_avg.
+ * @p: task for which slice_avg should be computed.
+ * @time_slice: task's current timeslice.
+ */
+static unsigned long task_slice_avg(task_t *p, int time_slice)
+{
+ unsigned long slice_avg = p->slice_avg;
+ int idle, ticks = jiffies - p->last_slice;
+ int w = HZ / DEF_TIMESLICE;
+
+ if (ticks < time_slice)
+ ticks = time_slice;
+ idle = 100 - (100 * time_slice / ticks);
+ slice_avg /= NS_MAX_SLEEP_AVG / 100;
+
+ /*
+ * If the task is lowering it's cpu usage, speed up the
+ * effect on slice_avg so we don't over-throttle.
+ */
+ if (idle > slice_avg + 10) {
+ w -= idle / 10;
+ if (!w)
+ w = 1;
+ }
+ slice_avg = (w * (slice_avg ? : 1) + idle) / (w + 1);
+ slice_avg *= NS_MAX_SLEEP_AVG / 100;
+
+ return slice_avg;
+}
+
+/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
@@ -2710,19 +2810,33 @@
if ((p->policy == SCHED_RR) && !--p->time_slice) {
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
+ p->slice_avg = task_slice_avg(p, p->time_slice);
+ p->last_slice = jiffies;
set_tsk_need_resched(p);

/* put it at the end of the queue: */
requeue_task(p, rq->active);
}
+ if (unlikely(p->throttle_stamp))
+ p->throttle_stamp = 0;
goto out_unlock;
}
if (!--p->time_slice) {
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
- p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
+ p->slice_avg = task_slice_avg(p, p->time_slice);
+ p->prio = effective_prio(p);
p->first_time_slice = 0;
+ p->last_slice = jiffies;
+
+ /*
+ * Check to see if we should start/stop throttling;
+ */
+ if (THROTTLE_ENGAGE(p))
+ p->throttle_stamp = jiffies + (throttle_secs * HZ);
+ else if (THROTTLE_DISENGAGE(p))
+ p->throttle_stamp = 0;

if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
@@ -3033,7 +3147,7 @@
* Tasks charged proportionately less run_time at high sleep_avg to
* delay them losing their interactive status
*/
- run_time /= (CURRENT_BONUS(prev) ? : 1);
+ run_time /= (CURRENT_COST(prev) ? : 1);

spin_lock_irq(&rq->lock);

@@ -3047,7 +3161,7 @@
unlikely(signal_pending(prev))))
prev->state = TASK_RUNNING;
else {
- if (prev->state == TASK_UNINTERRUPTIBLE)
+ if (prev->state & TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
deactivate_task(prev, rq);
}
@@ -3134,6 +3248,14 @@
prev->sleep_avg = 0;
prev->timestamp = prev->last_ran = now;

+ /*
+ * Mark the beginning of a new slice in this mostly unused space.
+ */
+ if (unlikely(!next->first_time_slice)) {
+ next->first_time_slice = ~0U;
+ next->last_slice = jiffies;
+ }
+
sched_info_switch(prev, next);
if (likely(prev != next)) {
next->timestamp = now;
--- 2.6.16-rc1-mm3/kernel/sysctl.c.org 2006-01-27 11:28:20.000000000 +0100
+++ 2.6.16-rc1-mm3/kernel/sysctl.c 2006-01-27 11:58:30.000000000 +0100
@@ -69,6 +69,9 @@
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
extern int percpu_pagelist_fraction;
+extern int throttle;
+extern int throttle_secs;
+extern int throttle_secs_max;

#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
@@ -224,6 +227,12 @@
{ .ctl_name = 0 }
};

+/* Constants for minimum and maximum testing in vm_table and
+ * kern_table. We use these as one-element integer vectors. */
+static int zero;
+static int one = 1;
+static int one_hundred = 100;
+
static ctl_table kern_table[] = {
{
.ctl_name = KERN_OSTYPE,
@@ -666,15 +675,31 @@
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = KERN_SCHED_THROTTLE1,
+ .procname = "sched_throttle",
+ .data = &throttle,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+ {
+ .ctl_name = KERN_SCHED_THROTTLE2,
+ .procname = "sched_throttle_secs",
+ .data = &throttle_secs,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &throttle_secs_max,
+ },
{ .ctl_name = 0 }
};

-/* Constants for minimum and maximum testing in vm_table.
- We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
static ctl_table vm_table[] = {
{
.ctl_name = VM_OVERCOMMIT_MEMORY,


2006-01-27 23:19:19

by Con Kolivas

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Saturday 28 January 2006 07:06, MIke Galbraith wrote:
> What do you think of the below as an evaluation patch? It leaves the
> bits I'd really like to change (INTERACTIVE_SLEEP() for one), so it can
> be switched on and off for easy comparison and regression testing.
>
> I really didn't want to add more to the task struct, but after trying
> different things, a timeout was the most effective means of keeping the
> nice burst aspect of the interactivity logic but still make sure that a
> burst doesn't turn into starvation.
>
> The workings are dirt simple just as before. The goal is to keep
> sleep_avg and slice_avg balanced. When an imbalance starts, immediately
> cut off interactive bonus points. If the imbalance doesn't correct
> itself through normal sleep_avg usage, we'll soon hit the (1 dynamic
> prio) trigger point, which starts a countdown toward active
> intervention. The default setting is that a task can run at higher
> dynamic priority than it's cpu usage can justify for 5 seconds. After
> than, we start trying to work off the deficit, and if we don't succeeded
> within another second (ie it was a big deficit), we demote the offender
> to the rank his cpu usage indicates.
>
> The strategy works well enough to take the wind out of irman2's sails,
> and interactive tasks can still do a nice reasonable burst of activity
> without being evicted. Down side to starvation control is that X is
> sometimes a serious cpu user, and _can_ end up in the expired array (not
> nice under load). I personally don't think that's a show stopper
> though... all you have to do is tell the scheduler that what it already
> noticed, that X is a piggy, but an OK piggy by renicing it. It becomes
> immune from active throttling, and all is well. I know that's not going
> to be popular, but you just can't let X have free rein without leaving
> the barn door wide open. (maybe that switch should stay since the
> majority of boxen are workstations, and default to off?).

Sounds good but I have to disagree on the X renice thing. It's not that I have
a religious objection to renicing things. The problem is that our mainline
scheduler determines latency also by nice level. This means that if you
-renice a bursty cpu hog like X, then audio applications will fail unless
they too are reniced. Try it on your patch.

Con

2006-01-28 00:01:55

by Peter Williams

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

Con Kolivas wrote:
> On Saturday 28 January 2006 07:06, MIke Galbraith wrote:
>
>>What do you think of the below as an evaluation patch? It leaves the
>>bits I'd really like to change (INTERACTIVE_SLEEP() for one), so it can
>>be switched on and off for easy comparison and regression testing.
>>
>>I really didn't want to add more to the task struct, but after trying
>>different things, a timeout was the most effective means of keeping the
>>nice burst aspect of the interactivity logic but still make sure that a
>>burst doesn't turn into starvation.
>>
>>The workings are dirt simple just as before. The goal is to keep
>>sleep_avg and slice_avg balanced. When an imbalance starts, immediately
>>cut off interactive bonus points. If the imbalance doesn't correct
>>itself through normal sleep_avg usage, we'll soon hit the (1 dynamic
>>prio) trigger point, which starts a countdown toward active
>>intervention. The default setting is that a task can run at higher
>>dynamic priority than it's cpu usage can justify for 5 seconds. After
>>than, we start trying to work off the deficit, and if we don't succeeded
>>within another second (ie it was a big deficit), we demote the offender
>>to the rank his cpu usage indicates.
>>
>>The strategy works well enough to take the wind out of irman2's sails,
>>and interactive tasks can still do a nice reasonable burst of activity
>>without being evicted. Down side to starvation control is that X is
>>sometimes a serious cpu user, and _can_ end up in the expired array (not
>>nice under load). I personally don't think that's a show stopper
>>though... all you have to do is tell the scheduler that what it already
>>noticed, that X is a piggy, but an OK piggy by renicing it. It becomes
>>immune from active throttling, and all is well. I know that's not going
>>to be popular, but you just can't let X have free rein without leaving
>>the barn door wide open. (maybe that switch should stay since the
>>majority of boxen are workstations, and default to off?).
>
>
> Sounds good but I have to disagree on the X renice thing. It's not that I have
> a religious objection to renicing things. The problem is that our mainline
> scheduler determines latency also by nice level. This means that if you
> -renice a bursty cpu hog like X, then audio applications will fail unless
> they too are reniced. Try it on your patch.

On the other hand, if X were reniced it would be possible to make the
interactive bonus mechanism more strict on what it considers to be
interactive. As I see it the main reason that non interactive sleeps
get any consideration at all in determining whether a task is
interactive is to make sure that X is classified interactive.

One of the things that I've noticed about the X server (with
instrumented systems) is the its proportion of interruptible sleeps to
total sleeps goes down when it becomes CPU intensive. This is because
the high CPU usage is usually related to screen updates (e.g. try "ls -R
/" in an xterm or similar) and not keyboard or mouse input. This means
that during these periods the X server would lose most of its
interactive bonus and just rely on its niceness allowing genuine
interactive tasks to pip it (provided that X's nice value is
sufficiently small e.g. 5).

Of course, for this to work all of the instances where interruptible
sleeps aren't interactive would need to be labelled TASK_NONINTERACTIVE
which would be a big job.

Anyway just my 2c worth.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-01-28 03:42:03

by Mike Galbraith

[permalink] [raw]
Subject: Re: [SCHED] wrong priority calc - SIMPLE test case

On Sat, 2006-01-28 at 10:18 +1100, Con Kolivas wrote:
> On Saturday 28 January 2006 07:06, MIke Galbraith wrote:
> >
> > The strategy works well enough to take the wind out of irman2's sails,
> > and interactive tasks can still do a nice reasonable burst of activity
> > without being evicted. Down side to starvation control is that X is
> > sometimes a serious cpu user, and _can_ end up in the expired array (not
> > nice under load). I personally don't think that's a show stopper
> > though... all you have to do is tell the scheduler that what it already
> > noticed, that X is a piggy, but an OK piggy by renicing it. It becomes
> > immune from active throttling, and all is well. I know that's not going
> > to be popular, but you just can't let X have free rein without leaving
> > the barn door wide open. (maybe that switch should stay since the
> > majority of boxen are workstations, and default to off?).
>
> Sounds good but I have to disagree on the X renice thing. It's not that I have
> a religious objection to renicing things. The problem is that our mainline
> scheduler determines latency also by nice level. This means that if you
> -renice a bursty cpu hog like X, then audio applications will fail unless
> they too are reniced. Try it on your patch.

True. If I renice, X can/will stomp all over xmms. However, I can use
up to 30% cpu sustained before I would need to renice, so in the general
case, this patch actually improves latency because it will pull X out of
prio 16 much much sooner than before. In this regard, it makes sense to
increase the scope, and throttle kernel threads as well. What really
_should_ happen from the latency perspective is that as soon as we
detect a prio boundary crossing, we should downgrade the task a level
immediately, and then wait to see how things develope. That would allow
the truely low latency tasks to preempt. When I compile a kernel for my
PIII/500 over nfs with my P4, knfsd eats about 40% of the PIII's cpu,
and all of it is there along side xmms.

What would be nice for the X case is a scheduling class that only meant
I'm always in the active array. Open multimedia hardware, and you stay
active, but can be throttled under load without turning the box into a
brick.

Oh, while I'm thinking of it, if this thing makes it past discussion and
evaluation, I think my definition of idle task boost should be used
whether the active measures are applied or not. The comments on that
code say idle tasks will only be promoted into the very bottom of the
interactive priority range via one long sleep. INTERACTIVE_SLEEP_AVG()
does that, and it helps. That's not enough however. What allows me to
defeat [1] irman2 isn't scrubbing to perfectly match how much it sleeps,
it is designed to sleep long between burns and hand over the baton to
the next partner in crime, so if I measured slice_avg purely from issue
time, slice_avg would match sleep_avg, and it would still be deadly.
Detecting the beginning of a new slice and moving the stamp forward to
match beginning of execution is what defeats it... the time he spends
trading the cpu with his evil twins doesn't count.


-Mike

1. Defeat is actually too strong a word, it is still one hell of an
unpleasant cpu hog in it's total... as any threaded app may be.