2007-05-08 15:24:32

by Ingo Molnar

[permalink] [raw]
Subject: [patch] CFS scheduler, -v11


* Mike Galbraith <[email protected]> wrote:

> [...] Values 0x3 and 0xb are merely context switch happy.

thanks Mike - value 0x8 looks pretty good here and doesnt have the
artifacts you found. I've done a quick -v11 release with that fixed,
available at the usual place:

http://people.redhat.com/mingo/cfs-scheduler/

with no other changes.

Ingo


2007-05-09 15:47:10

by Kasper Sandberg

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v11

Hello Ingo.

Sorry it has taken this long, but i've been quite busy with work.

Here are my results with v11 for smoothness.

under slight load (spamasassin nice 19'ed), its now doing okay in
smoothness, almost as good as SD. but under harder load such as pressing
a link in a browser while 3d(at nice 0), or spamasassin at nice 0 still
makes it go stutterish instead of smooth. But overall it does seem
better.

On Tue, 2007-05-08 at 17:04 +0200, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > [...] Values 0x3 and 0xb are merely context switch happy.
>
> thanks Mike - value 0x8 looks pretty good here and doesnt have the
> artifacts you found. I've done a quick -v11 release with that fixed,
> available at the usual place:
>
> http://people.redhat.com/mingo/cfs-scheduler/
>
> with no other changes.
>
> Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-05-09 17:54:34

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Definition of fairness (was Re: [patch] CFS scheduler, -v11)

On Tue, May 08, 2007 at 05:04:31PM +0200, Ingo Molnar wrote:
> thanks Mike - value 0x8 looks pretty good here and doesnt have the
> artifacts you found. I've done a quick -v11 release with that fixed,
> available at the usual place:
>
> http://people.redhat.com/mingo/cfs-scheduler/
>
> with no other changes.

Ingo,
I had a question with respect to the definition of fairness used, esp
for tasks that are not 100% cpu hogs.

Ex: consider two equally important tasks T1 and T2 running on same CPU and
whose execution nature is:

T1 = 100% cpu hog
T2 = 60% cpu hog (run for 600ms, sleep for 400ms)

Over a arbitrary observation period of 10 sec,

T1 was ready to run for all 10sec
T2 was ready to run for 6 sec

Over this observation period, how much execution time should T2 get,
under a "fair" scheduler?

I was expecting both T2 and T1 to get 5 sec (50:50 split). Is this a
wrong expectation of fairness?

Anyway, results of this experiment (using testcase attached) is below.
T2 gets way below its fair share IMO (under both cfs and sd).


2.6.21.1:

5444 vatsa 16 0 2468 460 388 R 59 0.0 0:19.76 3 T1
5443 vatsa 25 0 2468 460 388 R 40 0.0 0:15.36 3 T2


2.6.21.1 + cfs-v11:

5460 vatsa 31 0 2464 460 388 R 70 0.0 0:15.28 3 T1
5461 vatsa 29 0 2468 460 388 R 30 0.0 0:05.65 3 T2


2.6.21 + sd-0.48:

5459 vatsa 23 0 2468 460 388 R 70 0.0 0:17.02 3 T1
5460 vatsa 21 0 2464 460 388 R 30 0.0 0:06.21 3 T2


Note:

T1 is started as ./cpuhog 600 0 10 > /dev/null &
T2 is started as ./cpuhog 600 400 10 > /dev/null &

First arg = runtime in ms
Second arg = sleeptime in ms
Third arg = Observation period in seconds


--
Regards,
vatsa


Attachments:
(No filename) (1.70 kB)
cpuhog.c (1.89 kB)
Download all attachments

2007-05-09 19:24:31

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)

On 09/05/07, Srivatsa Vaddagiri <[email protected]> wrote:
> On Tue, May 08, 2007 at 05:04:31PM +0200, Ingo Molnar wrote:
> > thanks Mike - value 0x8 looks pretty good here and doesnt have the
> > artifacts you found. I've done a quick -v11 release with that fixed,
> > available at the usual place:
> >
> > http://people.redhat.com/mingo/cfs-scheduler/
> >
> > with no other changes.
>
> Ingo,
> I had a question with respect to the definition of fairness used, esp
> for tasks that are not 100% cpu hogs.
>
> Ex: consider two equally important tasks T1 and T2 running on same CPU and
> whose execution nature is:
>
> T1 = 100% cpu hog
> T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
>
> Over a arbitrary observation period of 10 sec,
>
> T1 was ready to run for all 10sec
> T2 was ready to run for 6 sec
>

One quick observation.

Isn't it important for both processes to have the same "loops_per_ms" value?

At least on my machine, it's not always the case. See how results
differ in the following 2 tests:

(1) when loops_per_ms happen to be quite different;
(2) ----//---- pretty same values.

Both are for mainline 2.6.19 and both processes have equal characteristics.
namely,

./cpuhog 600 400 10

I run them one by one from different shells.

(1)

dimm@dimm:/mnt/storage/kernel/sched-bench> ./cpuhog 600 400 10
loops_per_ms = 94517.958412 <---------- (*)

run time = 600 ms (56710775 loops), sleep time = 400 ms, epoch time = 10 s
Obtained 6.39 seconds of execution time in 10 elapsed seconds
Obtained 4.52 seconds of execution time in 11 elapsed seconds
Obtained 4.47 seconds of execution time in 10 elapsed seconds
Obtained 4.47 seconds of execution time in 10 elapsed seconds
Obtained 4.53 seconds of execution time in 10 elapsed seconds
Obtained 4.14 seconds of execution time in 10 elapsed seconds
...

dimm@dimm:/mnt/storage/kernel/sched-bench> ./cpuhog 600 400 10
loops_per_ms = 51599.587203 <------------- (**)

run time = 600 ms (30959752 loops), sleep time = 400 ms, epoch time = 10 s
Obtained 3.91 seconds of execution time in 10 elapsed seconds
Obtained 3.16 seconds of execution time in 10 elapsed seconds
Obtained 3.18 seconds of execution time in 10 elapsed seconds
Obtained 3.19 seconds of execution time in 10 elapsed seconds
Obtained 2.97 seconds of execution time in 10 elapsed seconds

Look at a difference between (*) and (**) and we can see the reported
results are different.
Although, one would expect both are treated "equally" with the current
scheduler (whatever one puts in this meaning :)


(2) now another attempt.

dimm@dimm:/mnt/storage/kernel/sched-bench> ./cpuhog 600 400 10
loops_per_ms = 96711.798839

run time = 600 ms (58027079 loops), sleep time = 400 ms, epoch time = 10 s
Obtained 4.32 seconds of execution time in 10 elapsed seconds
Obtained 3.84 seconds of execution time in 11 elapsed seconds
Obtained 3.86 seconds of execution time in 11 elapsed seconds
Obtained 3.84 seconds of execution time in 11 elapsed seconds
Obtained 3.86 seconds of execution time in 11 elapsed seconds
...

dimm@dimm:/mnt/storage/kernel/sched-bench> ./cpuhog 600 400 10
loops_per_ms = 96899.224806

run time = 600 ms (58139534 loops), sleep time = 400 ms, epoch time = 10 s
Obtained 4.28 seconds of execution time in 10 elapsed seconds
Obtained 3.88 seconds of execution time in 11 elapsed seconds
Obtained 3.88 seconds of execution time in 10 elapsed seconds
Obtained 3.40 seconds of execution time in 10 elapsed seconds
Obtained 3.42 seconds of execution time in 10 elapsed seconds
Obtained 3.89 seconds of execution time in 11 elapsed seconds
...


In this case, "loops_per_ms" happened to be pretty equal so the
reported results look pretty similar. "top" showed very close numbers
in this case as well.

Maybe it's different in your case (besides I don't claim that
everything is ok with SD and CFS) but I guess this could be one of the
reasons.

At least, fork()-ing the second process (with different
run_time/sleep_time characteristics) from the first one would ensure
both have the same "loop_per_ms".


--
Best regards,
Dmitry Adamushko

2007-05-09 20:24:08

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)

On Wed, May 09, 2007 at 11:32:05PM +0530, Srivatsa Vaddagiri wrote:
> I had a question with respect to the definition of fairness used, esp
> for tasks that are not 100% cpu hogs.
> Ex: consider two equally important tasks T1 and T2 running on same CPU and
> whose execution nature is:
> T1 = 100% cpu hog
> T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
> Over a arbitrary observation period of 10 sec,
> T1 was ready to run for all 10sec
> T2 was ready to run for 6 sec
> Over this observation period, how much execution time should T2 get,
> under a "fair" scheduler?
> I was expecting both T2 and T1 to get 5 sec (50:50 split). Is this a
> wrong expectation of fairness?
> Anyway, results of this experiment (using testcase attached) is below.
> T2 gets way below its fair share IMO (under both cfs and sd).

It's not even feasible much of the time. Suppose your task ran for
100ms then slept for 900ms. It can't get more than 10% of the CPU in
any scheduler, work-conserving or not.

Second, anything that would credit tasks for sleep in such a manner
would flunk ringtest.c or otherwise analogues arranged to pass
feasibility checks.

Fairness applies only to running tasks. The fair share of CPU must be
granted while the task is running. As for sleep, it has to use its
fair share of CPU or lose it. The numerous of pathologies that arise
from trying to credit tasks for sleep in this fashion this are why
fairness in the orthodox sense I describe is now such a high priority.

However, it is possible to credit tasks for sleep in other ways. For
instance, EEVDF (which is very close to CFS) has a deadline parameter
expressing latency in addition to one expressing bandwidth that could,
in principle, be used for the purpose of crediting sleeping tasks. It's
not clear to me whether trying to use it for such purposes would be
sound, or, for that matter, whether tasks should receive latency
credits for sleep as opposed to other sorts of behaviors even if
utilizing the latency parameter for credits and bonuses for various
sorts of behavior is sound.


-- wli

2007-05-10 04:23:33

by David Schwartz

[permalink] [raw]
Subject: RE: Definition of fairness (was Re: [patch] CFS scheduler, -v11)


> Ex: consider two equally important tasks T1 and T2 running on
> same CPU and
> whose execution nature is:
>
> T1 = 100% cpu hog
> T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
>
> Over a arbitrary observation period of 10 sec,
>
> T1 was ready to run for all 10sec
> T2 was ready to run for 6 sec
>
> Over this observation period, how much execution time should T2 get,
> under a "fair" scheduler?

Anywhere between 30% and 50% of the CPU is fair. There is nothing
particularly fair or unfair about how much credit you give an interactive
task. Less than 30% is unfair as the task is being punished for voluntarily
yielding. More than 50% is unfair as the task should not be entitled to more
than half the CPU if another task with equal static priority wants as much
CPU as possible.

Slightly less than 30% or slightly more than 50% can also be reasonable due
to rounding or the scheduling interval beating against T2's internal timing.

From a practical standpoint, we'd like such a task to get a bit more than
30%. How much more may reasonably depend on exactly what the sleep/run
interval is for the task. It would be unreasonable to expect a task to be
credited for sleeping for an entire day, but unreasonable to expect a task
to not get any credit for voluntarily yielding the CPU when it is later
ready-to-run.

Typical UNIX schedulers bump a task's dynamic priority when it voluntarily
yields. As a result, it can pre-empty a CPU-burner when it later becomes
ready-to-run. This is intended more to reduce latency than to increase CPU
share. However, it should also have the effect of giving a polite task more
CPU than pure CPU-burned when it needs it. (That's only fair.)

DS


2007-05-10 08:52:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)

On Wed, 2007-05-09 at 23:32 +0530, Srivatsa Vaddagiri wrote:

> Ingo,
> I had a question with respect to the definition of fairness used, esp
> for tasks that are not 100% cpu hogs.
>
> Ex: consider two equally important tasks T1 and T2 running on same CPU and
> whose execution nature is:
>
> T1 = 100% cpu hog
> T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
>
> Over a arbitrary observation period of 10 sec,
>
> T1 was ready to run for all 10sec
> T2 was ready to run for 6 sec
>
> Over this observation period, how much execution time should T2 get,
> under a "fair" scheduler?
>
> I was expecting both T2 and T1 to get 5 sec (50:50 split). Is this a
> wrong expectation of fairness?

Depends on how long your fairness yardstick is I suppose.

> Anyway, results of this experiment (using testcase attached) is below.
> T2 gets way below its fair share IMO (under both cfs and sd).
>
>
> 2.6.21.1:
>
> 5444 vatsa 16 0 2468 460 388 R 59 0.0 0:19.76 3 T1
> 5443 vatsa 25 0 2468 460 388 R 40 0.0 0:15.36 3 T2
>
>
> 2.6.21.1 + cfs-v11:
>
> 5460 vatsa 31 0 2464 460 388 R 70 0.0 0:15.28 3 T1
> 5461 vatsa 29 0 2468 460 388 R 30 0.0 0:05.65 3 T2
>
>
> 2.6.21 + sd-0.48:
>
> 5459 vatsa 23 0 2468 460 388 R 70 0.0 0:17.02 3 T1
> 5460 vatsa 21 0 2464 460 388 R 30 0.0 0:06.21 3 T2
>
>
> Note:
>
> T1 is started as ./cpuhog 600 0 10 > /dev/null &

6524 root 20 0 1432 396 336 S 51 0.0 2:31.83 1 cpuhog
6525 root 20 0 1436 356 296 R 48 0.0 2:25.76 1 chew

> T2 is started as ./cpuhog 600 400 10 > /dev/null &

That's cpuhog in the above, and chew is the 100% hog. The below is
cpuhog as you started them.

6565 root 20 0 1436 396 336 R 51 0.0 1:49.30 1 cpuhog
6566 root 20 0 1432 396 336 S 49 0.0 1:59.16 1 cpuhog

FWIW, I can squeeze some better long term fairness out of it by doing
the below.

If a task isn't going to sleep, it's always in competition with someone
even if it happens to be the invisible man right now, so runners will
drift to the right. Also, these two tasks aren't necessarily the same
at dequeue time wrt the effect they have (or rather _will_ have) on the
system, but are being treated as if they're identical twins.

-Mike

Experimental gefingerpoken und mittengrabben.

--- kernel/sched_fair.c.org 2007-05-08 07:51:48.000000000 +0200
+++ kernel/sched_fair.c 2007-05-10 09:43:17.000000000 +0200
@@ -20,7 +20,7 @@ unsigned int sysctl_sched_granularity __

unsigned int sysctl_sched_sleep_history_max __read_mostly = 2000000000;

-unsigned int sysctl_sched_load_smoothing = 1 | 2 | 4 | 0;
+unsigned int sysctl_sched_load_smoothing = 0 | 0 | 0 | 0 | 16;

/*
* Wake-up granularity.
@@ -140,6 +140,8 @@ static void limit_wait_runtime(struct ta
{
s64 limit = sysctl_sched_granularity * 16;

+ if (sysctl_sched_load_smoothing & 16)
+ limit = sysctl_sched_sleep_history_max >> 1;
if (p->wait_runtime > limit)
p->wait_runtime = limit;
if (p->wait_runtime < -limit)
@@ -150,10 +152,11 @@ static void limit_wait_runtime(struct ta
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
*/
-static inline void update_curr(struct rq *rq, u64 now)
+static inline void update_curr(struct rq *rq, u64 now, int load_weight)
{
u64 delta_exec, delta_fair, delta_mine;
struct task_struct *curr = rq->curr;
+ unsigned long load = get_rq_load(rq);

if (curr->sched_class != &fair_sched_class || curr == rq->idle
|| !curr->on_rq)
@@ -167,8 +170,6 @@ static inline void update_curr(struct rq
curr->exec_max = delta_exec;

if (sysctl_sched_load_smoothing & 1) {
- unsigned long load = get_rq_load(rq);
-
if (sysctl_sched_load_smoothing & 2) {
delta_fair = delta_exec << NICE_0_SHIFT;
do_div(delta_fair, load);
@@ -178,13 +179,13 @@ static inline void update_curr(struct rq
}

delta_mine = delta_exec * curr->load_weight;
- do_div(delta_mine, load);
+ do_div(delta_mine, load + load_weight);
} else {
delta_fair = delta_exec << NICE_0_SHIFT;
- do_div(delta_fair, rq->raw_weighted_load);
+ do_div(delta_fair, load);

delta_mine = delta_exec * curr->load_weight;
- do_div(delta_mine, rq->raw_weighted_load);
+ do_div(delta_mine, load + load_weight);
}

curr->sum_exec_runtime += delta_exec;
@@ -300,7 +301,7 @@ update_stats_wait_end(struct rq *rq, str
static inline void
update_stats_dequeue(struct rq *rq, struct task_struct *p, u64 now)
{
- update_curr(rq, now);
+ update_curr(rq, now, !p->sleep_start_fair ? NICE_0_LOAD : 0);
/*
* Mark the end of the wait period if dequeueing a
* waiting task:
@@ -327,7 +328,7 @@ update_stats_curr_start(struct rq *rq, s
static inline void
update_stats_curr_end(struct rq *rq, struct task_struct *p, u64 now)
{
- update_curr(rq, now);
+ update_curr(rq, now, !p->sleep_start_fair ? NICE_0_LOAD : 0);

p->exec_start = 0;
}
@@ -378,7 +379,7 @@ enqueue_task_fair(struct rq *rq, struct
/*
* Update the fair clock.
*/
- update_curr(rq, now);
+ update_curr(rq, now, 0);

if (wakeup) {
if (p->sleep_start) {



2007-05-10 16:34:12

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)

On Wed, May 09, 2007 at 09:24:17PM +0200, Dmitry Adamushko wrote:
> One quick observation.
>
> Isn't it important for both processes to have the same "loops_per_ms" value?

Good catch.

I have modified the testcase based on this observation (using
setitimer).

--
Regards,
vatsa


Attachments:
(No filename) (281.00 B)
cpuhog.c (1.72 kB)
Download all attachments

2007-05-10 17:05:44

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)

On Wed, May 09, 2007 at 01:24:18PM -0700, William Lee Irwin III wrote:
> It's not even feasible much of the time. Suppose your task ran for
> 100ms then slept for 900ms. It can't get more than 10% of the CPU in
> any scheduler, work-conserving or not.

sure. The question of fairnes arises when such a task has to "fight" for
space/cputime in those 100ms, resulting in lesser than 10% bandwidth.
Having some statistics shown on cpu demand vs obtained bandwidth may be
good?

> Fairness applies only to running tasks. The fair share of CPU must be
> granted while the task is running. As for sleep, it has to use its
> fair share of CPU or lose it. The numerous of pathologies that arise
> from trying to credit tasks for sleep in this fashion this are why
> fairness in the orthodox sense I describe is now such a high priority.
>
> However, it is possible to credit tasks for sleep in other ways. For
> instance, EEVDF (which is very close to CFS) has a deadline parameter
> expressing latency in addition to one expressing bandwidth that could,
> in principle, be used for the purpose of crediting sleeping tasks. It's
> not clear to me whether trying to use it for such purposes would be
> sound, or, for that matter, whether tasks should receive latency
> credits for sleep as opposed to other sorts of behaviors even if
> utilizing the latency parameter for credits and bonuses for various
> sorts of behavior is sound.

I guess fairness interval also matters, as Mike pointed out. Over
shorter intervals, it may appear more fair to such sleeping tasks than over
longer intervals. Btw CFS seems to leave this fairness interval undefined (its
dependent on N, number of tasks and sysctl_sched_granularity?).

I suspect container/database/webserver workloads would like to receive
such latency/bandwidth credits for sleep. Will check with our
database/workload management folks on this.

--
Regards,
vatsa

2007-05-10 17:05:57

by Christian

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v11

Hello lkml, hello Ingo!

I've been using CFS-v10 for a few days and I must say that I'm verry
impressed ;-)

Desktop performance without any manual renicing is excellent, even with
make -j20. Gaming performance is at least on par with SD now! I've tried to
change the sched_load_smoothing config to "8" but there is no visible
difference when it's set to "7".

Both schedulers are verrry good! I can't really tell which one is better.
I noticed that while compiling a kernel (with -j4) my CPU temperature is two
to three degrees hotter than with mainline. I have not done any timing tests,
but I suspect that it's a little faster while preserving excellent desktop
usability. Great work!! :-)

-Christian

2007-05-10 17:11:28

by Kasper Sandberg

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v11

On Thu, 2007-05-10 at 18:59 +0200, Christian wrote:
> Hello lkml, hello Ingo!
>
> I've been using CFS-v10 for a few days and I must say that I'm verry
> impressed ;-)
>
> Desktop performance without any manual renicing is excellent, even with
> make -j20. Gaming performance is at least on par with SD now! I've tried to
Which games are you trying? and have you tried other workloads then make
-j20?

try have a window with some 3d game open, and a browser besides it, and
press a link. i cant seem to get smooth results with CFS.

Perhaps i could also conduct tests with the games you are trying on my
hardware.

> change the sched_load_smoothing config to "8" but there is no visible
> difference when it's set to "7".
>
> Both schedulers are verrry good! I can't really tell which one is better.
> I noticed that while compiling a kernel (with -j4) my CPU temperature is two
> to three degrees hotter than with mainline. I have not done any timing tests,
> but I suspect that it's a little faster while preserving excellent desktop
> usability. Great work!! :-)
>
> -Christian
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-05-10 18:52:14

by Christian

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v11

On Thursday 10 May 2007 19:10:44 Kasper Sandberg wrote:
> On Thu, 2007-05-10 at 18:59 +0200, Christian wrote:
> > Hello lkml, hello Ingo!
> >
> > I've been using CFS-v10 for a few days and I must say that I'm verry
> > impressed ;-)
> >
> > Desktop performance without any manual renicing is excellent, even with
> > make -j20. Gaming performance is at least on par with SD now! I've tried
> > to
>
> Which games are you trying? and have you tried other workloads then make
> -j20?
>
> try have a window with some 3d game open, and a browser besides it, and
> press a link. i cant seem to get smooth results with CFS.
>
> Perhaps i could also conduct tests with the games you are trying on my
> hardware.
>
> > change the sched_load_smoothing config to "8" but there is no visible
> > difference when it's set to "7".
> >
> > Both schedulers are verrry good! I can't really tell which one is better.
> > I noticed that while compiling a kernel (with -j4) my CPU temperature is
> > two to three degrees hotter than with mainline. I have not done any
> > timing tests, but I suspect that it's a little faster while preserving
> > excellent desktop usability. Great work!! :-)
> >
> > -Christian
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > in the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/

I've tried many different workloads, kernel compile (normal -j4), extreme
kernel compile (-j20) and Browsing/Open Office. GLXGears, Briquolo and
enemy-territory work relly well under these loads.

I just tried another test with "nice make -j20" and I see latency blips while
gaming. SD did not have this. Latencies with nice are _worse_ than latencies
without nice on my system. Playing with sched_load_smoothing does not change
anything. (I've tested values in the range 1-10)

-Christian

2007-05-10 18:55:27

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)

On Wed, May 09, 2007 at 01:24:18PM -0700, William Lee Irwin III wrote:
>> It's not even feasible much of the time. Suppose your task ran for
>> 100ms then slept for 900ms. It can't get more than 10% of the CPU in
>> any scheduler, work-conserving or not.

On Thu, May 10, 2007 at 10:43:12PM +0530, Srivatsa Vaddagiri wrote:
> sure. The question of fairnes arises when such a task has to "fight" for
> space/cputime in those 100ms, resulting in lesser than 10% bandwidth.
> Having some statistics shown on cpu demand vs obtained bandwidth may be
> good?

Davide Libenzi has a testcase which can essentially do this sort of
test involving sleep/wakeup behavior for you.


On Wed, May 09, 2007 at 01:24:18PM -0700, William Lee Irwin III wrote:
>> Fairness applies only to running tasks. The fair share of CPU must be
>> granted while the task is running. As for sleep, it has to use its
>> fair share of CPU or lose it. The numerous of pathologies that arise
>> from trying to credit tasks for sleep in this fashion this are why
>> fairness in the orthodox sense I describe is now such a high priority.
>> However, it is possible to credit tasks for sleep in other ways. For
>> instance, EEVDF (which is very close to CFS) has a deadline parameter
>> expressing latency in addition to one expressing bandwidth that could,
>> in principle, be used for the purpose of crediting sleeping tasks. It's
>> not clear to me whether trying to use it for such purposes would be
>> sound, or, for that matter, whether tasks should receive latency
>> credits for sleep as opposed to other sorts of behaviors even if
>> utilizing the latency parameter for credits and bonuses for various
>> sorts of behavior is sound.

On Thu, May 10, 2007 at 10:43:12PM +0530, Srivatsa Vaddagiri wrote:
> I guess fairness interval also matters, as Mike pointed out. Over
> shorter intervals, it may appear more fair to such sleeping tasks
> than over longer intervals. Btw CFS seems to leave this fairness
> interval undefined (its dependent on N, number of tasks and
> sysctl_sched_granularity?).

Usually things appear fairer over longer than shorter intervals. One
could regard the rate of convergence as a sort of performance metric
for the fairness of a scheduler, provided it converges to fairness at
all.


On Thu, May 10, 2007 at 10:43:12PM +0530, Srivatsa Vaddagiri wrote:
> I suspect container/database/webserver workloads would like to receive
> such latency/bandwidth credits for sleep. Will check with our
> database/workload management folks on this.

Bear in mind that bandwidth credits have soundness issues. Crediting
bandwidth in exchange for sleep will certainly flunk ringtest.c

Crediting latency in exchange for sleep at least has different problems
if it has any at all. I actually expect it to be sound to credit
latency (the l_i in EEVDF) in exchange for sleep, but the experts have
yet to weigh in on the subject.


-- wli

2007-05-10 19:46:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v11


* Christian <[email protected]> wrote:

> [...] Latencies with nice are _worse_ than latencies without nice on
> my system. [...]

i'll check that - this must be a CFS buglet.

Ingo

2007-05-10 19:55:29

by Ting Yang

[permalink] [raw]
Subject: Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)

Hi, Srivatsa

I would say in this case, your specifications of these tasks do not
actually give a base for evaluating the fairness. Imagine, when you want
to check if the generated schedule is _fair_ or not, you first have set
up a base which indicates the behavior tasks should have, and then
compare the generated schedule against the "should" case for fairness.
In fact, the results you get (in the following) all make perfect sense,
based on the "should" defined by each different model adopted by
different schedulers.

Now I will try to explain all these in my creepy English:
> Ex: consider two equally important tasks T1 and T2 running on same CPU and
> whose execution nature is:
>
> T1 = 100% cpu hog
> T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
>
First, these specification does not give any idea of the amount real
work should be done by each task during any time period. Now suppose you
have 4 T1 running, what do you expect, in 10 sec, each task should
consume: 2.5 sec. Hmmm, what if one of them is processing something I
need to be 3 time faster than others? then your expectation will be 5,
5/3, 5/3, 5/3. Now what if the 3x faster process is not actually
"important" than others?
The same thing happens to T2, which is a little more complicated. When
T2 is active and at the same time there are other tasks running, what is
the amount work it should be done during its active period? and then
when T2 is sleeping, how does it affects other tasks still running,
especially when T2 goes back active again? Should scheduler remembers
the history to be long term "fair", or it should forget about the history?

All these questions require a complete model (different approaches uses
different models), and there are 3 critical things needed for each task
in the system, which currently more or less entangled together in all
existing schedulers.

For each task you need:
_B_ (CPU bandwidth), _T_ (response window), _P_ (priority
or importance)

Here, B is the amount of bandwidth needed, and T is the time period that
B must be ensured. For example, if a task need 20ms to process a
request, and the request has to be processed within 50ms, then B is 40%
and T is 50ms. B captures the throughput need of the task, while T
captures its response requirement. For the above task, if you give (50%,
100ms), then some requests of that task may miss their deadlines, even
though throughput is guaranteed. The combination (B, T) describes the
requirement of a task to run at least "good" enough, however, it not
necessarily related to the task's importance, therefore P is needed, so
that scheduler can make choices when CPU is overload, specifically sum
of Bs goes beyond CPU capacity.

In the Real-Time work, we have pretty good knowledge about a task's
requirement (B,T), therefore usually strict admission control or
bandwidth control is applied to ensure admitted tasks are always
satisfied. When CPU overload does happen, either B or T serves as static
priorities, such as Rate monotonic, or dynamic priorities are
calculated, such as EDF (earliest deadline first). Various strategies
based on capacity reservation and bandwidth reservation are proposed (in
late 80's and mid 90's) to enforce protection between tasks when
workload in the system fluctuates.

However, on a general purpose machine, everything is dynamic. We do not
have any knowledge about (B, T) for any task. Tens of daemons, kernel
threads wake up at arbitrary intervals with arbitrary amount of work.
each may or may not have a specific response deadline, and deadlines may
be soft or hard, etc. The required pre-knowledge and the scalability
issue make capacity/bandwidth reservation based approaches not suitable
for a general purpose scheduler.

The solution here used is to dynamically figure out the (B, T) based on
the workload in the system (so that the sum(Bi) never goes beyond the
CPU capacity), which leads to either _Fair Share_ scheduling or
_Proportional Share_ scheduling (the distinction between them is vague
though), which uses different model for fairness as I will explain later.

In Linux, we have only one parameter for each task, that is _NICE_
value, which has to serve the purpose of giving (B, T, P). Usually small
nice value indicates higher priority, higher bandwidth (larger time
quanta), and may or may not say anything about T. Now let's see how
_Fair share_ model affects vanilla scheduler, and how _Proportional
share_ model affects CFS and RSDL/SD scheduler. (Note: the specification
you give above only indicate when task is runnable, and does not give
information about (B, T))

> 2.6.21.1:
>
> 5444 vatsa 16 0 2468 460 388 R 59 0.0 0:19.76 3 T1
> 5443 vatsa 25 0 2468 460 388 R 40 0.0 0:15.36 3 T2
>
>
The _Fair Share_ (FS) model which vanilla scheduler adopts, says give a
long enough time window, the amount work done by each task should be
proportional to its weight (which comes from the nice value). Therefore
B_i of each task is given by w_i/sum(w), and in your case 50%:50% for T1
and T2, but it says _nothing_ about T_i for each task.

Therefore, Fair Share model is inherently problematic! Imagine, If T1 is
100% cpuhog, and T2 works like this: work 10ms, sleep 90ms, work 10ms,
sleep 90ms, ... , work 10ms, sleep 90ms, work constantly (100% cpuhog).
When T2 enters its last stage and becomes 100% cpuhog, it will starve
T1 for arbitrarily amount of time in FS model (depends on how long it
executes before). The problem here is FS remembers arbitrarily long past
history and tries to compensate in the future, and this leads to
starvation, long latency, and weird interactive problems.

Vanilla scheduler approximates the FS model by calculating dynamic
priorities based on execution time and sleep time _heuristically_. It
also tries to solve the above problem by: (1) dividing long quanta into
smaller timeslices (2) only remember limited amount of history, that is
why it has a upperbound on sleep time for each task, which is used for
calculating goodness.

Overall, this kind of scheduler is heuristic oriented, targeted to long
tern fairness with low accuracy, and provides almost not guarantee for
throughput and response time. Therefore people are constantly looking
for better ones.

> 2.6.21.1 + cfs-v11:
>
> 5460 vatsa 31 0 2464 460 388 R 70 0.0 0:15.28 3 T1
> 5461 vatsa 29 0 2468 460 388 R 30 0.0 0:05.65 3 T2
>
>
> 2.6.21 + sd-0.48:
>
> 5459 vatsa 23 0 2468 460 388 R 70 0.0 0:17.02 3 T1
> 5460 vatsa 21 0 2464 460 388 R 30 0.0 0:06.21 3 T2
Both CFS and RSDL/SD uses Proportional Share (PS). It was originally
designed for packet transmission where response time is relatively
important. It was based on an ideal fluid-flow model called GPS. GPS
model says, any _active_ task should receive the cpu bandwidth
proportional to its weight. The only difference between PS and FS above
is that when deciding B for a task, PS only takes _active_ tasks into
account, while FS takes all tasks. In other words: PS forgets about
history and sleep time (different from the time a task stay in run
queue:-)) does not affect future scheduling. In this way, it can
implicitly provides upperbounds for T for each active task (the accurate
bound depends on algorithms and implementations), which is usually
measured as the difference between actual work done against the work
should be done in ideal GPS model. For example, here CFS and SD give
bound of delay O(n), where n is the number of active tasks.

In your example: When T2 is active, it gets B_2 = w_i/sum(w), 50% share,
and nothing during it sleeps. Therefor the share overtime is:
T1: 400 * 100% (share) + 600 * 50% (share) = 700
T2: 400 * 0% (share) + 600 * 50% (share) = 300
Which is exactly the CPU share given in your above results for both CFS
and SD!

Overall, proportional share scheduler, such as CFS and RSDL (other
similar ones worth mentioning: WFQ, WF2Q, SCFQ, SFQ, SFS, EEVDF all
based on fair queuing approximates GPS), is more accurate, provides
short term fairness, and gives controlled bound for throughput and
latency. However, all these come with a cost. These schedulers are more
difficult to implement, complicated data structures, may take O(n) or
O(log n) time for each operation, need more frequent context switches,
etc.

Given the fluctuation of workloads, various application behaviors,
uncertainty of user intentions, etc, it is almost impossible to set up a
model that captures all of them. Therefore, building a good general
purpose scheduler is so difficult and many people like Ingo, Con, ...
are working so hard on it :-)

Sorry for another long email.

Ting

2007-05-10 20:03:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)


* Mike Galbraith <[email protected]> wrote:

> On Wed, 2007-05-09 at 23:32 +0530, Srivatsa Vaddagiri wrote:
>
> > Ingo,
> > I had a question with respect to the definition of fairness used, esp
> > for tasks that are not 100% cpu hogs.
> >
> > Ex: consider two equally important tasks T1 and T2 running on same CPU and
> > whose execution nature is:
> >
> > T1 = 100% cpu hog
> > T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
> >
> > Over a arbitrary observation period of 10 sec,
> >
> > T1 was ready to run for all 10sec
> > T2 was ready to run for 6 sec
> >
> > Over this observation period, how much execution time should T2 get,
> > under a "fair" scheduler?
> >
> > I was expecting both T2 and T1 to get 5 sec (50:50 split). Is this a
> > wrong expectation of fairness?
>
> Depends on how long your fairness yardstick is I suppose.

yeah, i'd agree with that. I think a 400 msecs sleep period is still
within the range that we should define as being within the scope, but
it's probably borderline. The ultimate threshold is the reaction time of
humans - somewhere between 30 msecs and 1 second. Sleep periods beyond
that are typically not expected to be 'smoothly and fairly scheduled'
the same way as say a CPU hogs are scheduled - because you can already
'see' the effects of the sleep - so 'smoothness' is not possible
anymore.

Ingo

2007-05-10 20:05:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v11


* Christian <[email protected]> wrote:

> > > Desktop performance without any manual renicing is excellent, even
> > > with make -j20. Gaming performance is at least on par with SD now!
> > > I've tried to
> >
> > Which games are you trying? and have you tried other workloads then
> > make -j20?
> >
> > try have a window with some 3d game open, and a browser besides it,
> > and press a link. i cant seem to get smooth results with CFS.
[...]
>
> I've tried many different workloads, kernel compile (normal -j4),
> extreme kernel compile (-j20) and Browsing/Open Office. GLXGears,
> Briquolo and enemy-territory work relly well under these loads.

do you have CONFIG_PREEMPT enabled - while perhaps Kasper has it
disabled? Kasper, if so, could you try a quick test with CONFIG_PREEMPT
enabled, does that make any difference to your gaming-smoothness
results? [it's not a real solution, but this would help us narrow down
the smoothness problem you are seeing.]

Ingo

2007-05-11 11:09:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v11


[ remailing this on-list too, with some more explanations, i suspect
others might be affected by this 3D performance problem as well. ]

* Kasper Sandberg <[email protected]> wrote:

> [...] but under harder load such as pressing a link in a browser while
> 3d(at nice 0), or spamasassin at nice 0 still makes it go stutterish
> instead of smooth. But overall it does seem better.

ok, i think i have finally managed to track this one down.

certain 3D drivers grew a subtle performance dependency on a
sys_sched_yield() implementation/behavioral bug/misbehavior of the
upstream kernel, which implementation SD does too, but CFS fixes it by
always yielding efficiently. The result of this bug/dependency is
extremely low FPS during any CPU-intense workload.

you are using an Nvidia 6600 card so i dont know for sure whether you
are affected by this problem (Radeon cards are affected and i can now
reproduce that) - but the symptoms i've reproduced seem to be matching
your system's symptoms.

I've added a workaround for this to CFS, do you have some time to try
it? I've attached the sched-cfs-v12-rc4.patch (delta patch ontop of a
CFS -v11 tree), and once you have booted it you can activate the
workaround via:

echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

does this make any difference to the drastic 3D smoothness problems you
are experiencing?

Ingo

---
Makefile | 2 +-
drivers/char/drm/radeon_cp.c | 5 +++++
include/linux/sched.h | 2 +-
kernel/sched_fair.c | 23 +++++++++++++++++++----
kernel/sysctl.c | 12 ++++++------
5 files changed, 32 insertions(+), 12 deletions(-)

Index: linux/Makefile
===================================================================
--- linux.orig/Makefile
+++ linux/Makefile
@@ -1,7 +1,7 @@
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 21
-EXTRAVERSION = -cfs-v11
+EXTRAVERSION = -cfs-v12
NAME = Nocturnal Monster Puppy

# *DOCUMENTATION*
Index: linux/drivers/char/drm/radeon_cp.c
===================================================================
--- linux.orig/drivers/char/drm/radeon_cp.c
+++ linux/drivers/char/drm/radeon_cp.c
@@ -2210,6 +2210,11 @@ int radeon_driver_load(struct drm_device

DRM_DEBUG("%s card detected\n",
((dev_priv->flags & RADEON_IS_AGP) ? "AGP" : (((dev_priv->flags & RADEON_IS_PCIE) ? "PCIE" : "PCI"))));
+ if (sysctl_sched_yield_bug_workaround == -1) {
+ sysctl_sched_yield_bug_workaround = 1;
+ printk(KERN_WARNING "quirk installed: turning on "
+ "sys_sched_yield() workaround for Radeon DRM.\n");
+ }
return ret;
}

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1253,9 +1253,9 @@ extern char * sched_print_task_state(str

extern unsigned int sysctl_sched_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
-extern unsigned int sysctl_sched_sleep_history_max;
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_load_smoothing;
+extern int sysctl_sched_yield_bug_workaround;

#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -18,10 +18,6 @@
*/
unsigned int sysctl_sched_granularity __read_mostly = 2000000;

-unsigned int sysctl_sched_sleep_history_max __read_mostly = 2000000000;
-
-unsigned int sysctl_sched_load_smoothing = 0 | 0 | 0 | 8;
-
/*
* Wake-up granularity.
* (default: 1 msec, units: nanoseconds)
@@ -32,6 +28,19 @@ unsigned int sysctl_sched_load_smoothing
*/
unsigned int sysctl_sched_wakeup_granularity __read_mostly = 0;

+unsigned int sysctl_sched_load_smoothing __read_mostly = 0 | 0 | 0 | 8;
+
+/*
+ * sys_sched_yield unfairness bug workaround switch.
+ * (default: -1:auto-detect+disabled. Other values: 0:disabled, 1:enabled)
+ *
+ * This option switches the unfair yield implementation of the
+ * old scheduler back on. Needed for good performance of certain
+ * apps like 3D games on Radeon cards.
+ */
+int sysctl_sched_yield_bug_workaround __read_mostly = -1;
+
+EXPORT_SYMBOL_GPL(sysctl_sched_yield_bug_workaround);

extern struct sched_class fair_sched_class;

@@ -462,6 +471,12 @@ yield_task_fair(struct rq *rq, struct ta
u64 now;

/*
+ * Bug workaround for 3D apps running on the radeon 3D driver:
+ */
+ if (unlikely(sysctl_sched_yield_bug_workaround > 0))
+ return;
+
+ /*
* yield-to support: if we are on the same runqueue then
* give half of our wait_runtime (if it's positive) to the other task:
*/
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -223,24 +223,24 @@ static ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
- .procname = "sched_sleep_history_max_ns",
- .data = &sysctl_sched_sleep_history_max,
+ .procname = "sched_child_runs_first",
+ .data = &sysctl_sched_child_runs_first,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = &proc_dointvec,
},
{
.ctl_name = CTL_UNNUMBERED,
- .procname = "sched_child_runs_first",
- .data = &sysctl_sched_child_runs_first,
+ .procname = "sched_load_smoothing",
+ .data = &sysctl_sched_load_smoothing,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = &proc_dointvec,
},
{
.ctl_name = CTL_UNNUMBERED,
- .procname = "sched_load_smoothing",
- .data = &sysctl_sched_load_smoothing,
+ .procname = "sched_yield_bug_workaround",
+ .data = &sysctl_sched_yield_bug_workaround,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = &proc_dointvec,

2007-05-11 12:03:00

by Christian

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v11

On Thursday 10 May 2007 22:05:00 Ingo Molnar wrote:
> * Christian <[email protected]> wrote:
> > > > Desktop performance without any manual renicing is excellent, even
> > > > with make -j20. Gaming performance is at least on par with SD now!
> > > > I've tried to
> > >
> > > Which games are you trying? and have you tried other workloads then
> > > make -j20?
> > >
> > > try have a window with some 3d game open, and a browser besides it,
> > > and press a link. i cant seem to get smooth results with CFS.
>
> [...]
>
> > I've tried many different workloads, kernel compile (normal -j4),
> > extreme kernel compile (-j20) and Browsing/Open Office. GLXGears,
> > Briquolo and enemy-territory work relly well under these loads.
>
> do you have CONFIG_PREEMPT enabled - while perhaps Kasper has it
> disabled? Kasper, if so, could you try a quick test with CONFIG_PREEMPT
> enabled, does that make any difference to your gaming-smoothness
> results? [it's not a real solution, but this would help us narrow down
> the smoothness problem you are seeing.]

Yes, I have enabled full preemption and 1000 HZ.

-Christian