2007-05-01 21:23:06

by Ingo Molnar

[permalink] [raw]
Subject: [patch] CFS scheduler, -v8


i'm pleased to announce release -v8 of the CFS scheduler patchset. (The
main goal of CFS is to implement "desktop scheduling" with as high
quality as technically possible.)

The CFS patch against v2.6.21.1 (or against v2.6.20.10) can be
downloaded from the usual place:

http://people.redhat.com/mingo/cfs-scheduler/

-v7 resolved a couple of important regresisons while not introducing new
regressions, so i felt it was time to step forward: -v8 tries to address
one of the last (known) frontiers: 3D/OpenGL games^H^H^H applications
'smoothness'.

To achieve more scheduling smoothness, -v8 introduces a new 'precise
load calculation and smoothing' feature. (A variant of this was
suggested by Peter Williams in earlier CFS discussions - thanks Peter!)

i was able to reuse the rq->cpu_load[] load average calculation from
Peter Williams's smpnice code, and it's thus now utilized on UP too. As
a nice side-effect of CFS using a smoothed load metric, apps should also
start up faster under load. CFS now utilizes the full range of smpnice
metrics.

No other fundamental portion of CFS was touched, so the rate of change
is moderate:

7 files changed, 140 insertions(+), 79 deletions(-)

Changes since -v7:

- powerpc debug output and build warning fixes (Balbir Singh)

- documentation fixes (Zach Carter)

- interactivity: precise load calculation and load smoothing

As usual, any sort of feedback, bugreport, fix and suggestion is more
than welcome,

Ingo


2007-05-02 03:07:08

by Ting Yang

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


Hi, Ingo

My name is Ting Yang, a graduate student from UMASS. I am currently
studying the linux scheduler and virtual memory manager to solve some
page swapping problems. I am very excited with the new scheduler CFS.
After I read through your code, I think that you might be interested in
reading this paper:

"A Proportional Share REsource Allocation Algorithm for Real-Time,
Time-Shared Systems", by Ion Stoica. You can find the paper here:
http://citeseer.ist.psu.edu/37752.html

Authors of this paper proposed a scheduler: Earlist Eligible Virtual
Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to
track the execution of each running task. The only difference between
EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is
_start-time_ fair. Scheduling based on deadline gives better reponse
time bound and seems to more fair.

In the following part of this email, I will try to explain the
similarities and differences between EEVDF and CFS. Hopefully, this
might provide you with some useful information w.r.t your current work
on CFS.

Similarities:
1. both EEVDF and CFS use virtual time to track the system's progress.

CFS maintains rq->fair_clock for each cpu, which is updated every
tick by:
SCALE/sum(p_i->loadweight)
where p_i->loadweight is the weight of each task mapped from its nice
value in prio_to_load_shift[], given that the default weight is SCALE (1024)

EEVDF maintains a virtual time, which is advanced every tick by:
1/sum(w_i)
where w_i is the weight of each task, given that the default weight is 1.

Therefore, EEVDF and CFS monitors system progress in the same way,
except normalized to different scale.

2. EEVDF and CFS monitors the progress in the same way.

CFS maintains a p->fair_key which indicating the amount of work done
by this task. When p executes for a period t, then p->fair_key
incremented by:
t * SCALE/p->loadweight //the default weight is SCALE
(based on solving equations in your code :-))

EEVDF does the same thing with assumption that the default weight is
1, it uses the same equation to calculate deadlines for running tasks.

Differences:
The main difference between CFS and EEVDF lies in the scheduling
policy, although they follow the same model for tracking tasks.

*** CFS: When a task starts, it gets p->fair_key from the current
virtual time rq->fair_clock. It will not be scheduled for execution
until all other running tasks' fair_key go beyond p->fair_key by certain
virtual ticks (which is 5 ticks for the current setting of CFS).

(I wish I could show examples with graphs, instead of my not-so-good
english, but I am not sure if it appropriate to attach figures on this
maillist)

EXAMPLE: assume the system runs at 1000 tick/second, i.e. 1ms a tick,
and the granularity of pre-exemption for CFS is 5 virtual ticks (the
current setting). If, at time t=0, we start 2 tasks: p1 and p2, both
have nice value 0 (weight 1024), and rq->fair_clock is initialized to 0.
Now we have:
p1->fair_key = p2->fair_key = rq->fair_clock = 0.
CFS breaks the tie arbitrarily, say it executes p1. After 1 system tick
(1ms later) t=1, we have:
rq->fair_clock = 1/2, p1->fair_key = 1, p2->fair_key = 0.
Suppose, a new task p3 starts with nice value -10 at this moment, that
is p3->fair_key=1/2. In this case, CFS will not schedule p3 for
execution until the fair_keys of p1 and p2 go beyond 5+1/2 (which
translates to about 10ms later in this setting), _regardless_ the
priority (weight) of p3. Further imagine, if we starts n tasks at time
t=0 and then start a task p_(n+1) at time t = 1, the delay of task
p_(n+1) actually is proportional to the number of running tasks n.

Formally speaking, CFS can has a *O(n)* lag w.r.t the ideal
proportional shared fluid-flow system, which can be arbitrarily fine
granularity. The cause of this problem is that p->fair_key only reflects
a fair point that a task should be started, but does not has any
information about how urgent the task is (i.e. the priority or weight of
the task).

*** In EEVDF, each task p_i is specified by 2 parameters: weight w_i
and timeslice length l_i. EEVDF tries to schedule tasks based on the
virtual deadline vd_i where a timeslice l_i should be finished.
EEVDF keeps a virtual start (ve_i) and virtual deadline (vd_i) for
each tasks. When a tasks starts, its ve_i is initialized to be the
current virtual time, and calculates its virtual deadline as:
vd_i = ve_i + l_i/w_i //the same method as CFS updates fair_key.
When l_i amount of work is done, the just finished vd_i becomes the new
ve_i. That is the virtual start time of next l_i amount work is the
deadline of previous finished timeslice. The virtual deadline vd_i is
then updated using the above equation.

EEVDF schedule policy: always executes the task with the _earliest_
virtual deadline.

EXAMPLE: Assume the system has 1000 ticks per second. At time t = 0,
we start 2 tasks: p1 and p2, such that w_1 = w_2 = 1 and l_1 = l_2 = 5
ticks, i.e 5ms (which reflects the similar setting in CFS case).
Furthermore, the system virtual time vt is initialized to be 0. Now at
time t = 0, we have
vt = 0,
ve_1 = 0, vd_1 = ve_1 + l_1/w_1 = 5
ve_2 = 0, vd_2 = vd_2 + l_2/w_2 = 5
As p1 and p2 have the same virtual deadline, EEVDF breaks the tie
arbitrarily, say it executes p1. After 1 system tick (1ms later), we have:
vt = 0 + 1 / (w_1 + w_2) = 1/2 //just as CFS updates rq->fair_clock
ve_1 = 0, vd_1 = 5 //not changed
ve_2 = 0, vd_1 = 5 //not changed
Suppose, we starts a new task p2 at this moment, with weight w_3 = 2 and
timeslice l_3 = 5 ticks (5ms), Then
ve_3 = vt = 1/2
vd_3 = ve_3 + l_3/w_2 = 1/2 + 5/2 = 3
hmmm, the scheduler now will execute task p3 first since it has the
earliest deadline. The deadline implicitly contains some very important
information that the CFS fair_key does not have: how urgent this amount of
work has to be done, which comes from the weight of a task.

Formally speaking, EEVDF has a constant O(1) lag w.r.t the ideal
proportional shared fluid-flow system. (Please look at the paper for
detail proofs.) Under normal cases, for a task p_i, EEVDF ensures that
it does not fall behind the ideal system by more than l_i time.
Occasionally, the delay can be max(l_i), the maximum timeslice used by
all active tasks, due to the dynamic entering and leaving of tasks.
(Again, the paper give more detailed explanation on this).

We can see that here the timeslice l_i used by a task p_i actually
controls accuracy of scheduling p_i. The smaller l_i, the closer to the
ideal system during p_i's execution. For example, in above case, if p3
has w_3 = 1 (the same as p1 and p2) and l_3 = 3 (3ms). Since vd_3 = 1/2
+ l_3/w_3 = 7/2, the scheduler still executes p3 first, even though
p1,p2 and p3 have the same weight. Smaller l_i leads the scheduler to
handle p_i in finer granularity. Intuitively, it makes scheduler checks
the deadline of p_i more frequently
and ensures each small piece of work is done on time, while larger l_i
does the reverse.

The decouple of weight w_i and timeslice l_i is important. Generally
speaking, weight determines throughput and timeslice determines the
responsiveness of a task. In normal situation, high priority tasks
usually need more cpu capacity within short period of time (bursty, such
as keyboard, mouse move, X updates, daemons, etc), and need to be
processed as quick as possible (responsiveness and interactiveness).
Follow the analysis above, we know that for higher priority tasks we
should give _higher weight_ to ensure its CPU throughput, and at the
same time give _smaller timeslice_ to ensure better responsiveness.
This is a bit counter-intuitive against the current linux
implementation: smaller nice value leads to higher weight and larger
timeslice.

Now let's see what happens in the Real-Time world. Usually a real-time
application is specified with (C_i, T_i), where C_i is the amount of
work and T_i is the period of time that the work has to be finished.
For example, (20ms, 50ms) says the scheduler has to do 20ms work within
a window of 50ms for this task. Smaller T_i gives tighter response
constraints. Note that Bi = Ci/Ti is really the CPU proportion for the
task during its execution. In our proportional time share world, weight
w_i decides the cpu proportion which maps to Bi, and timeslice l_i gives
the amount works should be done each time, which maps Ci. Then using w_i
and l_i, we can get a window size, which actually maps Ti. Therefore
smaller l_i leads to smaller Ti which means tighter response
constraints. It seems to me all these make a lot sense in both
proportional time share and real-time world.

Based on my understanding, adopting something like EEVDF in CFS should
not be very difficult given their similarities, although I do not have
any idea on how this impacts the load balancing for SMP. Does this worth
a try?

Sorry for such a long email :-)

Ting

2007-05-02 05:11:10

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Hi Ting,

On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
>
> Hi, Ingo
>
> My name is Ting Yang, a graduate student from UMASS. I am currently
> studying the linux scheduler and virtual memory manager to solve some
> page swapping problems. I am very excited with the new scheduler CFS.
> After I read through your code, I think that you might be interested in
> reading this paper:
>
> "A Proportional Share REsource Allocation Algorithm for Real-Time,
> Time-Shared Systems", by Ion Stoica. You can find the paper here:
> http://citeseer.ist.psu.edu/37752.html
>
> Authors of this paper proposed a scheduler: Earlist Eligible Virtual
> Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to
> track the execution of each running task. The only difference between
> EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is
> _start-time_ fair. Scheduling based on deadline gives better reponse
> time bound and seems to more fair.
>
> In the following part of this email, I will try to explain the
> similarities and differences between EEVDF and CFS. Hopefully, this
> might provide you with some useful information w.r.t your current work
> on CFS.

(...)
Thanks very much for this very clear explanation. Now I realize that
some of the principles I've had in mind for a long time already exist
and are documented ! That's what I called sorting by job completion
time in the past, which might not have been clear for everyone. Now
you have put words on all those concepts, it's more clear ;-)

> The decouple of weight w_i and timeslice l_i is important. Generally
> speaking, weight determines throughput and timeslice determines the
> responsiveness of a task.

I 100% agree. That's the problem we have with nice today. Some people
want to use nice to assign more CPU to tasks (as has always been for
years) and others want to use nice to get better interactivity (meaning
nice as when you're in a queue and leaving the old woman go before you).

IMHO, the two concepts are opposed. Either you're a CPU hog OR you get
quick responsiveness.

> In normal situation, high priority tasks
> usually need more cpu capacity within short period of time (bursty, such
> as keyboard, mouse move, X updates, daemons, etc), and need to be
> processed as quick as possible (responsiveness and interactiveness).
> Follow the analysis above, we know that for higher priority tasks we
> should give _higher weight_ to ensure its CPU throughput, and at the
> same time give _smaller timeslice_ to ensure better responsiveness.
> This is a bit counter-intuitive against the current linux
> implementation: smaller nice value leads to higher weight and larger
> timeslice.

We have an additional problem in Linux, and not the least : it already
exists and is deployed everywhere, so we cannot break existing setups.
More specifically, we don't want to play with nice values of processes
such as X.

That's why I think that monitoring the amount of the time-slice (l_i)
consumed by the task is important. I proposed to conserve the unused
part of l_i as a credit (and conversely the credit can be negative if
the time-slice has been over-used). This credit would serve two purposes :

- reassign the unused part of l_i on next time-slices to get the
most fair share of CPU between tasks

- use it as an interactivity key to sort the tasks. Basically, if
we note u_i the unused CPU cycles, you can sort based on
(d_i - u_i) instead of just d_i, and the less hungry tasks will
reach the CPU faster than others.

(...)

> Based on my understanding, adopting something like EEVDF in CFS should
> not be very difficult given their similarities, although I do not have
> any idea on how this impacts the load balancing for SMP. Does this worth
> a try?

I think that if you have time to spend on this, everyone would like to
see the difference. All the works on the scheduler are more or less
experimental and several people are exchanging ideas right now, so it
should be the right moment. You seem to understand very well both
approaches and it's likely that it will not take you too much time :-)


> Sorry for such a long email :-)

It was worth it, thanks !

Willy

2007-05-02 05:29:34

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
> Authors of this paper proposed a scheduler: Earlist Eligible Virtual
> Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to
> track the execution of each running task. The only difference between
> EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is
> _start-time_ fair. Scheduling based on deadline gives better reponse
> time bound and seems to more fair.
> In the following part of this email, I will try to explain the
> similarities and differences between EEVDF and CFS. Hopefully, this
> might provide you with some useful information w.r.t your current work
> on CFS.

Any chance you could write a patch to convert CFS to EEVDF? People may
have an easier time understanding code than theoretical explanations.
(I guess I could do it if sufficiently pressed.)


-- wli

2007-05-02 06:37:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote:
> i'm pleased to announce release -v8 of the CFS scheduler patchset. (The
> main goal of CFS is to implement "desktop scheduling" with as high
> quality as technically possible.)
...

> As usual, any sort of feedback, bugreport, fix and suggestion is more
> than welcome,

Greetings,

I noticed a (harmless) bounds warning triggered by the reduction in size
of array->bitmap. Patchlet below.

-Mike

CC kernel/sched.o
kernel/sched_rt.c: In function ?load_balance_start_rt?:
include/asm-generic/bitops/sched.h:30: warning: array subscript is above array bounds
kernel/sched_rt.c: In function ?pick_next_task_rt?:
include/asm-generic/bitops/sched.h:30: warning: array subscript is above array bounds

--- linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h.org 2007-05-02 07:16:47.000000000 +0200
+++ linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h 2007-05-02 07:20:45.000000000 +0200
@@ -6,28 +6,23 @@

/*
* Every architecture must define this function. It's the fastest
- * way of searching a 140-bit bitmap where the first 100 bits are
- * unlikely to be set. It's guaranteed that at least one of the 140
- * bits is cleared.
+ * way of searching a 100-bit bitmap. It's guaranteed that at least
+ * one of the 100 bits is cleared.
*/
static inline int sched_find_first_bit(const unsigned long *b)
{
#if BITS_PER_LONG == 64
- if (unlikely(b[0]))
+ if (b[0])
return __ffs(b[0]);
- if (likely(b[1]))
- return __ffs(b[1]) + 64;
- return __ffs(b[2]) + 128;
+ return __ffs(b[1]) + 64;
#elif BITS_PER_LONG == 32
- if (unlikely(b[0]))
+ if (b[0])
return __ffs(b[0]);
- if (unlikely(b[1]))
+ if (b[1])
return __ffs(b[1]) + 32;
- if (unlikely(b[2]))
+ if (b[2])
return __ffs(b[2]) + 64;
- if (b[3])
- return __ffs(b[3]) + 96;
- return __ffs(b[4]) + 128;
+ return __ffs(b[3]) + 96;
#else
#error BITS_PER_LONG not defined
#endif


2007-05-02 06:46:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Mike Galbraith <[email protected]> wrote:

> > As usual, any sort of feedback, bugreport, fix and suggestion is
> > more than welcome,
>
> Greetings,
>
> I noticed a (harmless) bounds warning triggered by the reduction in
> size of array->bitmap. Patchlet below.

thanks, applied! Your patch should also speed up task selection of RT
tasks a bit. (the patch removes ~40 bytes of code). And on 64-bit we now
fit into 2x 64-bit bitmap and thus are now down to two Find-First-Set
instructions. Nice :)

Ingo

2007-05-02 08:00:06

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote:
> - interactivity: precise load calculation and load smoothing

This seems to help quite a bit.

(5 second top sample)

2636 root 15 -5 19148 15m 5324 R 73 1.5 1:42.29 0 amarok_libvisua
5440 root 20 0 320m 36m 8388 S 18 3.6 3:28.55 1 Xorg
4621 root 20 0 22776 18m 4168 R 12 1.8 0:00.63 1 cc1
4616 root 20 0 19456 13m 2200 R 9 1.3 0:00.43 0 cc1

I no longer have to renice both X and Gforce to achieve a perfect
display when they are sharing my box with a make -j2. X is displaying
everything it's being fed beautifully with no help. I have to renice
Gforce (amarok_libvisual), but given it's very heavy CPU usage, that
seems perfectly fine.

No regressions noticed so far. Box is _very_ responsive under load,
seemingly even more so than with previous releases. That is purely
subjective, but the first impression was very distinct.

-Mike

2007-05-02 08:03:47

by Gene Heskett

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Wednesday 02 May 2007, Mike Galbraith wrote:
>On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote:
>> i'm pleased to announce release -v8 of the CFS scheduler patchset. (The
>> main goal of CFS is to implement "desktop scheduling" with as high
>> quality as technically possible.)
>
>...
>
>> As usual, any sort of feedback, bugreport, fix and suggestion is more
>> than welcome,
>
>Greetings,
>
>I noticed a (harmless) bounds warning triggered by the reduction in size
>of array->bitmap. Patchlet below.
>
> -Mike

I just checked my logs, and it appears my workload didn't trigger this one
Mike. And so far, v8 is working great here. And that great is in my
best "Tony the Tiger" voice, stolen shamelessly from the breakfast cereal tv
commercial of 30+ years ago. :)

Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and 100 is
the best of the breed. I'll give this one a 100 till something better shows
up.

> CC kernel/sched.o
>kernel/sched_rt.c: In function ?load_balance_start_rt?:
>include/asm-generic/bitops/sched.h:30: warning: array subscript is above
> array bounds kernel/sched_rt.c: In function ?pick_next_task_rt?:
>include/asm-generic/bitops/sched.h:30: warning: array subscript is above
> array bounds
>
>--- linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h.org 2007-05-02
> 07:16:47.000000000 +0200 +++
> linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h 2007-05-02
> 07:20:45.000000000 +0200 @@ -6,28 +6,23 @@
>
> /*
> * Every architecture must define this function. It's the fastest
>- * way of searching a 140-bit bitmap where the first 100 bits are
>- * unlikely to be set. It's guaranteed that at least one of the 140
>- * bits is cleared.
>+ * way of searching a 100-bit bitmap. It's guaranteed that at least
>+ * one of the 100 bits is cleared.
> */
> static inline int sched_find_first_bit(const unsigned long *b)
> {
> #if BITS_PER_LONG == 64
>- if (unlikely(b[0]))
>+ if (b[0])
> return __ffs(b[0]);
>- if (likely(b[1]))
>- return __ffs(b[1]) + 64;
>- return __ffs(b[2]) + 128;
>+ return __ffs(b[1]) + 64;
> #elif BITS_PER_LONG == 32
>- if (unlikely(b[0]))
>+ if (b[0])
> return __ffs(b[0]);
>- if (unlikely(b[1]))
>+ if (b[1])
> return __ffs(b[1]) + 32;
>- if (unlikely(b[2]))
>+ if (b[2])
> return __ffs(b[2]) + 64;
>- if (b[3])
>- return __ffs(b[3]) + 96;
>- return __ffs(b[4]) + 128;
>+ return __ffs(b[3]) + 96;
> #else
> #error BITS_PER_LONG not defined
> #endif



--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I met my latest girl friend in a department store. She was looking at
clothes, and I was putting Slinkys on the escalators.
-- Steven Wright

2007-05-02 08:11:37

by Gene Heskett

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Wednesday 02 May 2007, Mike Galbraith wrote:
>On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote:
>> - interactivity: precise load calculation and load smoothing
>
>This seems to help quite a bit.
>
>(5 second top sample)
>
> 2636 root 15 -5 19148 15m 5324 R 73 1.5 1:42.29 0
> amarok_libvisua 5440 root 20 0 320m 36m 8388 S 18 3.6 3:28.55
> 1 Xorg 4621 root 20 0 22776 18m 4168 R 12 1.8 0:00.63 1 cc1
> 4616 root 20 0 19456 13m 2200 R 9 1.3 0:00.43 0 cc1
>
>I no longer have to renice both X and Gforce to achieve a perfect
>display when they are sharing my box with a make -j2. X is displaying
>everything it's being fed beautifully with no help. I have to renice
>Gforce (amarok_libvisual), but given it's very heavy CPU usage, that
>seems perfectly fine.
>
>No regressions noticed so far. Box is _very_ responsive under load,
>seemingly even more so than with previous releases. That is purely
>subjective, but the first impression was very distinct.
>
> -Mike

And I have a make -j4 running to apply your patch Mike, and can't tell it from
here, no lag.

This is another keeper folks.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
It takes less time to do a thing right than it does to explain why you
did it wrong.
-- H.W. Longfellow

2007-05-02 08:12:26

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Wed, 2007-05-02 at 04:03 -0400, Gene Heskett wrote:

> I just checked my logs, and it appears my workload didn't trigger this one
> Mike.

It's just a build time compiler warning.

> Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and 100 is
> the best of the breed. I'll give this one a 100 till something better shows
> up.

Ditto. (so far... ya never know;)

-Mike

2007-05-02 08:15:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Gene Heskett <[email protected]> wrote:

> > I noticed a (harmless) bounds warning triggered by the reduction in
> > size of array->bitmap. Patchlet below.
>
> I just checked my logs, and it appears my workload didn't trigger this
> one Mike. [...]

yeah: this is a build-time warning and it needs a newer/smarter gcc to
notice that provably redundant piece of code. It's a harmless thing -
but nevertheless Mike's fix is a nice little micro-optimization as well:
it always bothered me a bit that at 140 priority levels we were _just_
past the 128 bits boundary by 12 bits. Now on 64-bit boxes it's just two
64-bit words to cover all 100 priority levels of RT tasks.

> [...] And so far, v8 is working great here. And that great is in my
> best "Tony the Tiger" voice, stolen shamelessly from the breakfast
> cereal tv commercial of 30+ years ago. :)

heh :-)

> Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and
> 100 is the best of the breed. I'll give this one a 100 till something
> better shows up.

nice - and you arent even using any OpenGL games ;)

The 0-100 rating is really useful to me so that i can see the impact of
regressions (if any) and it's also one single number representing the
subjective impression - that way it's easier to keep tab of things.

btw., do you still renice kmail slightly, or does it now work out of box
with default nice 0?

Ingo

2007-05-02 08:48:54

by Gene Heskett

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Wednesday 02 May 2007, Mike Galbraith wrote:
>On Wed, 2007-05-02 at 04:03 -0400, Gene Heskett wrote:
>> I just checked my logs, and it appears my workload didn't trigger this one
>> Mike.
>
>It's just a build time compiler warning.

Duh. I have a couple of pages of "may be used uninitialized" warnings.
Including one in serial.c for the raw channel. And I have a bit of trouble
there too. Related? Dunno.

>> Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and 100
>> is the best of the breed. I'll give this one a 100 till something better
>> shows up.
>
>Ditto. (so far... ya never know;)
>
> -Mike

Yup, this is sweet so far. :-)

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
The vulcan-death-grip ping has been applied.

2007-05-02 08:51:33

by Gene Heskett

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Wednesday 02 May 2007, Ingo Molnar wrote:
>* Gene Heskett <[email protected]> wrote:
>> > I noticed a (harmless) bounds warning triggered by the reduction in
>> > size of array->bitmap. Patchlet below.
>>
>> I just checked my logs, and it appears my workload didn't trigger this
>> one Mike. [...]
>
>yeah: this is a build-time warning and it needs a newer/smarter gcc to
>notice that provably redundant piece of code. It's a harmless thing -
>but nevertheless Mike's fix is a nice little micro-optimization as well:
>it always bothered me a bit that at 140 priority levels we were _just_
>past the 128 bits boundary by 12 bits. Now on 64-bit boxes it's just two
>64-bit words to cover all 100 priority levels of RT tasks.
>
>> [...] And so far, v8 is working great here. And that great is in my
>> best "Tony the Tiger" voice, stolen shamelessly from the breakfast
>> cereal tv commercial of 30+ years ago. :)
>
>heh :-)
>
>> Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and
>> 100 is the best of the breed. I'll give this one a 100 till something
>> better shows up.
>
>nice - and you arent even using any OpenGL games ;)
>
>The 0-100 rating is really useful to me so that i can see the impact of
>regressions (if any) and it's also one single number representing the
>subjective impression - that way it's easier to keep tab of things.
>
>btw., do you still renice kmail slightly, or does it now work out of box
>with default nice 0?
>
> Ingo

For this last couple of boots, its "right out of the box" and isn't getting
under my skin. A make -j4 didn't bother it either.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
The goys have proven the following theorem...
-- Physicist John von Neumann, at the start of a classroom
lecture.

2007-05-02 09:08:21

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Ingo Molnar wrote:
> Changes since -v7:
>
> - powerpc debug output and build warning fixes (Balbir Singh)
>
> - documentation fixes (Zach Carter)
>
> - interactivity: precise load calculation and load smoothing
>
> As usual, any sort of feedback, bugreport, fix and suggestion is more
> than welcome,
>
> Ingo

Hi, Ingo,

I would like to report, what I think is a regression with -v8.

With -v7 I would run the n/n+1 test. Basically on a system with
n cpus, I would run n+1 tasks and see how their load is distributed.
I usually find that the last two tasks would get stuck on one CPU
on the system and would get half the cpu time as their other peers.
I think this issue has been around for long even before CFS. But
while I was investigating that, I found that with -v8, all the n+1 tasks are
stuck on the same cpu.

Output of /proc/sched_debug

# cat /proc/sched_debug
Sched Debug Version: v0.02
now at 1507287574145 nsecs

cpu: 0
.nr_running : 0
.raw_weighted_load : 0
.nr_switches : 111130
.nr_load_updates : 376821
.nr_uninterruptible : 18446744073709551550
.next_balance : 4295269119
.curr->pid : 0
.clock : 7431167968202137
.prev_clock_raw : 7431167968202136
.clock_warps : 0
.clock_unstable_events : 0
.clock_max_delta : 0
.fair_clock : 26969582038
.prev_fair_clock : 26969539422
.exec_clock : 9881536864
.prev_exec_clock : 9881494248
.wait_runtime : 116431647
.cpu_load[0] : 0
.cpu_load[1] : 0
.cpu_load[2] : 0
.cpu_load[3] : 0
.cpu_load[4] : 0

runnable tasks:
task PID tree-key delta waiting switches
prio wstart-fair sum-exec sum-wait
----------------------------------------------------------------------------------------------------------------------------

cpu: 1
.nr_running : 0
.raw_weighted_load : 0
.nr_switches : 56374
.nr_load_updates : 376767
.nr_uninterruptible : 156
.next_balance : 4295269118
.curr->pid : 0
.clock : 7431167857161633
.prev_clock_raw : 7431167857161632
.clock_warps : 0
.clock_unstable_events : 0
.clock_max_delta : 0
.fair_clock : 34038615236
.prev_fair_clock : 34038615236
.exec_clock : 18764126904
.prev_exec_clock : 18764126904
.wait_runtime : 132146856
.cpu_load[0] : 0
.cpu_load[1] : 0
.cpu_load[2] : 0
.cpu_load[3] : 0
.cpu_load[4] : 0

runnable tasks:
task PID tree-key delta waiting switches
prio wstart-fair sum-exec sum-wait
----------------------------------------------------------------------------------------------------------------------------

cpu: 2
.nr_running : 5
.raw_weighted_load : 5120
.nr_switches : 140351
.nr_load_updates : 376767
.nr_uninterruptible : 18446744073709551559
.next_balance : 4295269128
.curr->pid : 6462
.clock : 7431167968695481
.prev_clock_raw : 7431167968695480
.clock_warps : 0
.clock_unstable_events : 0
.clock_max_delta : 0
.fair_clock : 178895812434
.prev_fair_clock : 178895727748
.exec_clock : 858569069824
.prev_exec_clock : 858568528616
.wait_runtime : 2643237421
.cpu_load[0] : 0
.cpu_load[1] : 0
.cpu_load[2] : 0
.cpu_load[3] : 0
.cpu_load[4] : 0

runnable tasks:
task PID tree-key delta waiting switches
prio wstart-fair sum-exec sum-wait
----------------------------------------------------------------------------------------------------------------------------
R bash 6462 178897659138 1846704 -1846958 19646
120 -178895812434 169799117688 135410790136
bash 6461 178897934427 2121993 -7673376 19538
120 -5551118 169989747968 135499300276
bash 6460 178898353788 2541354 -6492732 19608
120 -3951111 170136703840 135648219117
bash 6459 178899921997 4109563 -6460948 19747
120 -2351093 170559324432 135812802778
bash 6458 178901052918 5240484 -5991881 19756
120 -751111 171257975848 135805570391

cpu: 3
.nr_running : 1 .prev_fair_clock : 24318712701
.exec_clock : 20098322728
.prev_exec_clock : 20098322728
.wait_runtime : 178370619
.cpu_load[0] : 0
.cpu_load[1] : 0
.cpu_load[2] : 0
.cpu_load[3] : 0
.cpu_load[4] : 0

runnable tasks:
task PID tree-key delta waiting switches
prio wstart-fair sum-exec sum-wait
----------------------------------------------------------------------------------------------------------------------------
R cat 7524 24318779730 67029 -67029 3
120 -24318712701 1661560 2277


.raw_weighted_load : 1024
.nr_switches : 43253
.nr_load_updates : 376767
.nr_uninterruptible : 18446744073709551583
.next_balance : 4295269180
.curr->pid : 7524
.clock : 7431167970150081
.prev_clock_raw : 7431167970150080
.clock_warps : 0
.clock_unstable_events : 0
.clock_max_delta : 0
.fair_clock : 24318712701


Output of top

6459 root 20 0 4912 792 252 R 20 0.0 8:29.33 bash

6458 root 20 0 4912 792 252 R 20 0.0 8:29.90 bash

6460 root 20 0 4912 792 252 R 20 0.0 8:28.94 bash

6461 root 20 0 4912 792 252 R 20 0.0 8:28.88 bash

6462 root 20 0 4912 792 252 R 20 0.0 8:28.54 bash





--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-05-02 10:05:18

by Bill Huey

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
> Based on my understanding, adopting something like EEVDF in CFS should
> not be very difficult given their similarities, although I do not have
> any idea on how this impacts the load balancing for SMP. Does this worth
> a try?
>
> Sorry for such a long email :-)

An excellent long email. Thanks.

Have you looked at Con's SD ? What did you think about it analytically
and do you think that these ideas could be incorporated into that
scheduler ?

bill

2007-05-02 10:06:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Balbir Singh <[email protected]> wrote:

> With -v7 I would run the n/n+1 test. Basically on a system with n
> cpus, I would run n+1 tasks and see how their load is distributed. I
> usually find that the last two tasks would get stuck on one CPU on the
> system and would get half the cpu time as their other peers. I think
> this issue has been around for long even before CFS. But while I was
> investigating that, I found that with -v8, all the n+1 tasks are stuck
> on the same cpu.

i believe this problem is specific to powerpc - load is distributed fine
on i686/x86_64 and your sched_debug shows a cpu_load[0] == 0 on CPU#2
which is 'impossible'. (I sent a few suggestions off-Cc about how to
debug this.)

Ingo

2007-05-02 10:27:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Ting Yang <[email protected]> wrote:

> My name is Ting Yang, a graduate student from UMASS. I am currently
> studying the linux scheduler and virtual memory manager to solve some
> page swapping problems. I am very excited with the new scheduler CFS.
> After I read through your code, I think that you might be interested
> in reading this paper:

thanks for your detailed analysis - it was very interesting!

> Based on my understanding, adopting something like EEVDF in CFS
> should not be very difficult given their similarities, although I do
> not have any idea on how this impacts the load balancing for SMP. Does
> this worth a try?

It would definitely be interesting to try! I dont think it should
negatively impact load balancing on SMP. The current fork-time behavior
of CFS is really just a first-approximation thing, and what you propose
seems to make more sense to me too because it preserves the fluidity of
fairness. (I'd probably apply your patch even if there was no directly
measurable impact on workloads, because the more natural approaches tend
to be more maintainable in the long run.)

So by all means, please feel free to do a patch for this.

> Sorry for such a long email :-)

it made alot of sense and was very useful :-)

Ingo

2007-05-02 10:41:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Mike Galbraith <[email protected]> wrote:

> On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote:
> > - interactivity: precise load calculation and load smoothing
>
> This seems to help quite a bit.

great :)

> (5 second top sample)
>
> 2636 root 15 -5 19148 15m 5324 R 73 1.5 1:42.29 0 amarok_libvisua
> 5440 root 20 0 320m 36m 8388 S 18 3.6 3:28.55 1 Xorg
> 4621 root 20 0 22776 18m 4168 R 12 1.8 0:00.63 1 cc1
> 4616 root 20 0 19456 13m 2200 R 9 1.3 0:00.43 0 cc1
>
> I no longer have to renice both X and Gforce to achieve a perfect
> display when they are sharing my box with a make -j2. X is displaying
> everything it's being fed beautifully with no help. I have to renice
> Gforce (amarok_libvisual), but given it's very heavy CPU usage, that
> seems perfectly fine.

ah! Besides OpenGL behavior and app startup performance i didnt
originally have Xorg in mind with this change, but thinking about it,
precise load calculations and load smoothing does have a positive effect
on 'coupled' workloads where under the previous variant of CFS's load
calculation one task component of the workload could become 'invisible'
to another task and hence cause macro-scale scheduling artifacts not
expected by humans. With smoothing these are dealt with more
consistently. Xorg can be a quite strongly coupled workload.

> No regressions noticed so far. Box is _very_ responsive under load,
> seemingly even more so than with previous releases. That is purely
> subjective, but the first impression was very distinct.

yeah, make -jN workloads (and any mixture of non-identical scheduling
patterns, which most real workloads consist of) should be handled more
consistently too by -v8.

so your workload (and Gene's workload) were the ones in fact that i had
hoped not to _hurt_ with -v8 (neither of you being the hardcore gamer
type ;), and in reality -v8 ended up helping them too. These sorts of
side-effects are always welcome ;-)

Ingo

2007-05-02 11:00:07

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Ingo Molnar wrote:
> * Balbir Singh <[email protected]> wrote:
>
>> With -v7 I would run the n/n+1 test. Basically on a system with n
>> cpus, I would run n+1 tasks and see how their load is distributed. I
>> usually find that the last two tasks would get stuck on one CPU on the
>> system and would get half the cpu time as their other peers. I think
>> this issue has been around for long even before CFS. But while I was
>> investigating that, I found that with -v8, all the n+1 tasks are stuck
>> on the same cpu.
>
> i believe this problem is specific to powerpc - load is distributed fine
> on i686/x86_64 and your sched_debug shows a cpu_load[0] == 0 on CPU#2
> which is 'impossible'. (I sent a few suggestions off-Cc about how to
> debug this.)
>
> Ingo

Hi, Ingo

The suggestions helped, here is a fix tested on PowerPC only.

Patch and Description
=====================


Load balancing on PowerPC is broken. Running 5 tasks on a 4 cpu system
results in all 5 tasks running on the same CPU. Based on Ingo's feedback,
I instrumented and debugged update_load_fair().

The problem is with comparing a s64 values with (s64)ULONG_MAX, which
evaluates to -1. Then we check if exec_delta64 and fair_delta64 are greater
than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to the respective
values.

The fix is to compare these values against (s64)LONG_MAX and assign
(s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater than
(s64)LONG_MAX.

Tested on PowerPC, the regression is gone, tasks are load balanced as they
were in v7.

Output of top

5614 root 20 0 4912 784 252 R 52 0.0 3:27.49 3 bash

5620 root 20 0 4912 784 252 R 47 0.0 3:07.38 2 bash

5617 root 20 0 4912 784 252 R 47 0.0 3:08.18 0 bash

5624 root 20 0 4912 784 252 R 26 0.0 1:42.97 1 bash

5621 root 20 0 4912 784 252 R 26 0.0 1:43.14 1 bash


Tasks 5624 and 5621 getting half of their peer values is a separate issue
altogether.

Signed-off-by: Balbir Singh <[email protected]>
---

kernel/sched.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff -puN kernel/sched.c~cfs-fix-load-balancing-arith kernel/sched.c
--- linux-2.6.21/kernel/sched.c~cfs-fix-load-balancing-arith 2007-05-02
16:16:20.000000000 +0530
+++ linux-2.6.21-balbir/kernel/sched.c 2007-05-02 16:16:47.000000000 +0530
@@ -1533,19 +1533,19 @@ static void update_load_fair(struct rq *
this_rq->prev_exec_clock = this_rq->exec_clock;
WARN_ON_ONCE(exec_delta64 <= 0);

- if (fair_delta64 > (s64)ULONG_MAX)
- fair_delta64 = (s64)ULONG_MAX;
+ if (fair_delta64 > (s64)LONG_MAX)
+ fair_delta64 = (s64)LONG_MAX;
fair_delta = (unsigned long)fair_delta64;

- if (exec_delta64 > (s64)ULONG_MAX)
- exec_delta64 = (s64)ULONG_MAX;
+ if (exec_delta64 > (s64)LONG_MAX)
+ exec_delta64 = (s64)LONG_MAX;
exec_delta = (unsigned long)exec_delta64;
if (exec_delta > TICK_NSEC)
exec_delta = TICK_NSEC;

idle_delta = TICK_NSEC - exec_delta;

- tmp = SCHED_LOAD_SCALE * exec_delta / fair_delta;
+ tmp = (SCHED_LOAD_SCALE * exec_delta) / fair_delta;
tmp64 = (u64)tmp * (u64)exec_delta;
do_div(tmp64, TICK_NSEC);
this_load = (unsigned long)tmp64;
_


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-05-02 11:18:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Balbir Singh <[email protected]> wrote:

> The problem is with comparing a s64 values with (s64)ULONG_MAX, which
> evaluates to -1. Then we check if exec_delta64 and fair_delta64 are
> greater than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to
> the respective values.

ah, indeed ...

> The fix is to compare these values against (s64)LONG_MAX and assign
> (s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater
> than (s64)LONG_MAX.
>
> Tested on PowerPC, the regression is gone, tasks are load balanced as
> they were in v7.

thanks, applied!

Ingo

2007-05-02 12:58:34

by Mark Lord

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Ingo Molnar wrote:
> i'm pleased to announce release -v8 of the CFS scheduler patchset. (The
> main goal of CFS is to implement "desktop scheduling" with as high
> quality as technically possible.)
>
> The CFS patch against v2.6.21.1 (or against v2.6.20.10) can be
> downloaded from the usual place:
>
> http://people.redhat.com/mingo/cfs-scheduler/

Excellent. I've switched machines here, so my "old" single-core 2GHz notebook
is no longer being used as much. The new machine is a more or less identical
notebook, but with a 2.1GHz Core2Duo CPU.

And.. man, the mainline scheduler really sucks eggs on the dual-core!
On single core it was nearly always "nice" enough for me,
but wow.. what a degradation with the extra CPU.

So, CFS is now a permanent add-on for kernels I use here on either machine,
and I also can say that it works GGGGGGGRRRRRRRRRRREEEEEEEEEAAAAAAAAAATTTTTTTT!!!
(please excuse the North American excuse for a "cultural" reference). ;)

2007-05-02 13:07:46

by Vegard Nossum

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Tue, May 1, 2007 11:22 pm, Ingo Molnar wrote:
> As usual, any sort of feedback, bugreport, fix and suggestion is more
than welcome,

Hi,

The sys_sched_yield_to() is not callable from userspace on i386 because it
is not part of the syscall table (arch/i386/kernel/syscall_table.S). This
causes sysenter_entry (arch/i386/kernel/entry.S) to use the wrong count
for nr_syscalls (320 instead of 321) and return with -ENOSYS.

Vegard




2007-05-02 16:41:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Vegard Nossum <[email protected]> wrote:

> The sys_sched_yield_to() is not callable from userspace on i386
> because it is not part of the syscall table
> (arch/i386/kernel/syscall_table.S). This causes sysenter_entry
> (arch/i386/kernel/entry.S) to use the wrong count for nr_syscalls (320
> instead of 321) and return with -ENOSYS.

oops, indeed - the patch below should fix this. (x86 should really adopt
the nice x86_64 technique of building the syscall table out of the
unistd.h enumeration definitions.)

Ingo

Index: linux/arch/i386/kernel/syscall_table.S
===================================================================
--- linux.orig/arch/i386/kernel/syscall_table.S
+++ linux/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+ .long sys_sched_yield_to /* 320 */

2007-05-02 17:28:34

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
> "A Proportional Share REsource Allocation Algorithm for Real-Time,
> Time-Shared Systems", by Ion Stoica. You can find the paper here:
> http://citeseer.ist.psu.edu/37752.html

Good paper ..thanks for the pointer.

I briefly went thr' the paper and my impression is it expect each task
to specify the length of each new request it initiates. Is that correct?

If we have to apply EEVDF to SCHED_NORMAL task scheduling under CFS, how
would we calculate that "length of each new request" (which is reqd
before we calculate its virtual deadline)?

> EXAMPLE: assume the system runs at 1000 tick/second, i.e. 1ms a tick,
> and the granularity of pre-exemption for CFS is 5 virtual ticks (the
> current setting). If, at time t=0, we start 2 tasks: p1 and p2, both
> have nice value 0 (weight 1024), and rq->fair_clock is initialized to 0.
> Now we have:
> p1->fair_key = p2->fair_key = rq->fair_clock = 0.
> CFS breaks the tie arbitrarily, say it executes p1. After 1 system tick
> (1ms later) t=1, we have:
> rq->fair_clock = 1/2, p1->fair_key = 1, p2->fair_key = 0.
> Suppose, a new task p3 starts with nice value -10 at this moment, that
> is p3->fair_key=1/2. In this case, CFS will not schedule p3 for
> execution until the fair_keys of p1 and p2 go beyond 5+1/2 (which
> translates to about 10ms later in this setting), _regardless_ the
> priority (weight) of p3.

There is also p->wait_runtime which is taken into account when
calculating p->fair_key. So if p3 had waiting in runqueue for long
before, it can get to run quicker than 10ms later.

--
Regards,
vatsa

2007-05-02 17:48:10

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
>> "A Proportional Share REsource Allocation Algorithm for Real-Time,
>> Time-Shared Systems", by Ion Stoica. You can find the paper here:
>> http://citeseer.ist.psu.edu/37752.html

On Wed, May 02, 2007 at 11:06:34PM +0530, Srivatsa Vaddagiri wrote:
> Good paper ..thanks for the pointer.
> I briefly went thr' the paper and my impression is it expect each task
> to specify the length of each new request it initiates. Is that correct?
> If we have to apply EEVDF to SCHED_NORMAL task scheduling under CFS, how
> would we calculate that "length of each new request" (which is reqd
> before we calculate its virtual deadline)?

l_i and w_i are both functions of the priority. You essentially arrange
l_i to express QoS wrt. latency, and w_i to express QoS wrt. bandwidth.


On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
>> EXAMPLE: assume the system runs at 1000 tick/second, i.e. 1ms a tick,
>> and the granularity of pre-exemption for CFS is 5 virtual ticks (the
>> current setting). If, at time t=0, we start 2 tasks: p1 and p2, both
>> have nice value 0 (weight 1024), and rq->fair_clock is initialized to 0.
>> Now we have:
>> p1->fair_key = p2->fair_key = rq->fair_clock = 0.
>> CFS breaks the tie arbitrarily, say it executes p1. After 1 system tick
>> (1ms later) t=1, we have:
>> rq->fair_clock = 1/2, p1->fair_key = 1, p2->fair_key = 0.
>> Suppose, a new task p3 starts with nice value -10 at this moment, that
>> is p3->fair_key=1/2. In this case, CFS will not schedule p3 for
>> execution until the fair_keys of p1 and p2 go beyond 5+1/2 (which
>> translates to about 10ms later in this setting), _regardless_ the
>> priority (weight) of p3.

On Wed, May 02, 2007 at 11:06:34PM +0530, Srivatsa Vaddagiri wrote:
> There is also p->wait_runtime which is taken into account when
> calculating p->fair_key. So if p3 had waiting in runqueue for long
> before, it can get to run quicker than 10ms later.

Virtual time is time from the task's point of view, which it has spent
executing. ->wait_runtime is a device to subtract out time spent on the
runqueue but not running from what would otherwise be virtual time to
express lag, whether deliberately or coincidentally. ->wait_runtime
would not be useful for EEVDF AFAICT, though it may be interesting to
report.


-- wli

2007-05-02 18:15:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* William Lee Irwin III <[email protected]> wrote:

> > There is also p->wait_runtime which is taken into account when
> > calculating p->fair_key. So if p3 had waiting in runqueue for long
> > before, it can get to run quicker than 10ms later.
>
> Virtual time is time from the task's point of view, which it has spent
> executing. ->wait_runtime is a device to subtract out time spent on
> the runqueue but not running from what would otherwise be virtual time
> to express lag, whether deliberately or coincidentally. [...]

CFS is in fact _built around_ the ->wait_runtime metric (which, as its
name suggests already, expresses the precise lag a task observes
relative to 'ideal' fair execution), so what exactly makes you suspect
that this property of the ->wait_runtime metric might be 'coincidental'?
;-)

Ingo

2007-05-02 18:42:23

by Tong Li

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

> Based on my understanding, adopting something like EEVDF in CFS should
> not be very difficult given their similarities, although I do not have
> any idea on how this impacts the load balancing for SMP. Does this worth
> a try?
>
> Sorry for such a long email :-)

Thanks for the excellent explanation. I think EEVDF and many algs alike
assume global ordering of all tasks in the system (based on virtual
time), whereas CFS does so locally on each processor and relies on load
balancing to achieve fairness across processors. It'd achieve strong
fairness locally, but I'm not sure about its global fairness properties
in an MP environment. If ideally the total load weight on each processor
is always the same, then local fairness would imply global fairness, but
this is a bin packing problem and is intractable ...

tong

2007-05-02 18:55:42

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

* William Lee Irwin III <[email protected]> wrote:
>> Virtual time is time from the task's point of view, which it has spent
>> executing. ->wait_runtime is a device to subtract out time spent on
>> the runqueue but not running from what would otherwise be virtual time
>> to express lag, whether deliberately or coincidentally. [...]

On Wed, May 02, 2007 at 08:15:33PM +0200, Ingo Molnar wrote:
> CFS is in fact _built around_ the ->wait_runtime metric (which, as its
> name suggests already, expresses the precise lag a task observes
> relative to 'ideal' fair execution), so what exactly makes you suspect
> that this property of the ->wait_runtime metric might be 'coincidental'?
> ;-)

The coincidental aspect would be that at the time it was written, the
formal notion of lag was not being used particularly with respect to
priorities and load weights. The documentation doesn't describe it in
those terms, and I can't read minds, so I refrained from guessing.

Things are moving in good directions on all this as far as I'm
concerned. Moving according to Ting Yang's analysis should wrap up the
soundness concerns about intra-queue policy I've had. OTOH load
balancing I know much less about (not that I was ever any sort of an
expert on single queue affairs). That remains a very deep concern, as
load balancing is where most of the enterprise performance improvements
and degradations occur.


-- wli

2007-05-02 19:10:13

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

At some point in the past, Ting Yang wrote:
>> Based on my understanding, adopting something like EEVDF in CFS should
>> not be very difficult given their similarities, although I do not have
>> any idea on how this impacts the load balancing for SMP. Does this worth
>> a try?
>> Sorry for such a long email :-)

On Wed, May 02, 2007 at 11:42:20AM -0700, Li, Tong N wrote:
> Thanks for the excellent explanation. I think EEVDF and many algs alike
> assume global ordering of all tasks in the system (based on virtual
> time), whereas CFS does so locally on each processor and relies on load
> balancing to achieve fairness across processors. It'd achieve strong
> fairness locally, but I'm not sure about its global fairness properties
> in an MP environment. If ideally the total load weight on each processor
> is always the same, then local fairness would imply global fairness, but
> this is a bin packing problem and is intractable ...

It's sort of obvious how to approximate it, but not entirely obvious
whether a given approximation actually suffices. More help with the
theoretical aspects is needed.


- wli

2007-05-02 19:12:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* William Lee Irwin III <[email protected]> wrote:

> The coincidental aspect would be that at the time it was written, the
> formal notion of lag was not being used particularly with respect to
> priorities and load weights. [...]

(nice levels for SCHED_OTHER are 'just' an add-on concept to the core of
CFS, in fact i had two wildly different approaches that both did the
trick for users, so i fail to see the relevance of priorities to the
core concept of measuring how much a task is waiting to get on the
runqueue via the 'fair clock' ... but lets move on.)

> Things are moving in good directions on all this as far as I'm
> concerned. Moving according to Ting Yang's analysis should wrap up the
> soundness concerns about intra-queue policy I've had. OTOH load
> balancing I know much less about (not that I was ever any sort of an
> expert on single queue affairs). [...]

the whole move to ->load_weight based calculations was to make CFS
integrate better with load-balancing and to bring the smpnice
infrastructure even more into the scheduler mainstream. [ There's a
small smpnice related buglet i fixed in -v9-to-be (based on Balbir
Singh's feedback), but otherwise it behaves quite well on SMP and that's
not a big surprise: i left the load-balancer largely intact. ]

Ingo

2007-05-02 19:43:11

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

* William Lee Irwin III <[email protected]> wrote:
>> Things are moving in good directions on all this as far as I'm
>> concerned. Moving according to Ting Yang's analysis should wrap up the
>> soundness concerns about intra-queue policy I've had. OTOH load
>> balancing I know much less about (not that I was ever any sort of an
>> expert on single queue affairs). [...]

On Wed, May 02, 2007 at 09:12:35PM +0200, Ingo Molnar wrote:
> the whole move to ->load_weight based calculations was to make CFS
> integrate better with load-balancing and to bring the smpnice
> infrastructure even more into the scheduler mainstream. [ There's a
> small smpnice related buglet i fixed in -v9-to-be (based on Balbir
> Singh's feedback), but otherwise it behaves quite well on SMP and that's
> not a big surprise: i left the load-balancer largely intact. ]

Despite the original motive, the ->load_weight -based calculations
largely resolved my concerns about intra-queue prioritization. They
were the change to the ->fair_key computation I wanted to see, which
were still not enough because of the O(rq->nr_running) lag issue due to
the differences from EEVDF, but I wasn't entirely aware of that, only
suspicious that some issue remained.

Load balancing has non-negligible impacts on fairness but I'm almost
entirely ignorant of the aspects of the theory relating to it. More
knowledgeable people, e.g. Tong Li and Ting Yang, need to take over
reviewing here much as they did on the intra-queue front, where I only
knew enough to drop hints and not to make more useful suggestions.

O(1) vs. O(lg(n)) priority queues are not what matter here (well,
obviously O(n) priority queues would break), but rather O(1) vs. O(n)
lag.


-- wli

2007-05-02 23:42:00

by Ting Yang

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Srivatsa Vaddagiri wrote:
> I briefly went thr' the paper and my impression is it expect each task
> to specify the length of each new request it initiates. Is that correct?
>
No, the timeslice l_i here serves as a granularity control w.r.t
responsiveness (or latency depends on how you interpret it). As wli said
it can be express as a function of the priority, as we do for weight
now. It is not related with the length of each new request. A request
may be 1 seconds long, but the scheduler may still process it using 10ms
timeslice. Smaller timeslice leads to more accuracy, i.e. closer to
ideal case.
However, the maximum of timeslice l_i used by all active tasks
determines the total responsiveness of the system, which I will explain
in detail later.
> There is also p->wait_runtime which is taken into account when
> calculating p->fair_key. So if p3 had waiting in runqueue for long
> before, it can get to run quicker than 10ms later.
Consider if p3 is a newly started task or waked up task and carries no
p->wait_runtime.

Ting

2007-05-03 02:48:26

by Ting Yang

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Hi,

As encouraged by some of you, I have started implementing EEVDF.
However, I am quite new in this area, and may not be experienced enough
to get it through quickly. The main problems, I am facing now ,is how
to treat the semantics of yeild() and yield_to(). I probably will throw
a lot of questions along the way of my implementation.

Also I found my previous email was not clear enough in describing the
properties of CFS and EEVDF and caused some confusion, and there were
also some mistakes too. In this email, I will try to make up for that.

*** Let's start from CFS:
For simplicity, let's assume that CFS preempt the current task p1 by
another tasks p2, when p1->key - p2->key >1, and the virtual time
rq->fair_clock is initialized to be 0. Suppose, at time t = 0, we start
n+1 tasks that run long enough. task 1 has weight n and all other tasks
have weight 1. It is clear that, at time t=0, p_1->key = p_2->key = ...
=p_(n+1)-> key = rq->fair_clock = 0

Since all tasks has the same key, CFS breaks the ties arbitrarily,
which leads to many possibilities. Let's consider 2 of them:
_Case One:_ p1, which has weight n, executes first:
t = 1: rq->fair_clock = 1/2n, p1->key = 1/n // others
are not changed.
t = 2: rq->fair_clock = 2/2n, p1->key = 2/n
...
t = n: rq->fair_clock = n/2n, p1->key = n/n = 1
Only after p1 executes n ticks, the scheduler will pick another task
for execution. Between time [0, n)
the amount of actual work done by p1 is n. The amount of work should be
done in ideal fluid-flow system is n * n/2n = n/2. Therefore the lag is
n/2 - n = -n/2, negative means p1 goes faster than the ideal case. As we
can see this lag is O(n).
_Case Two:_ the scheduler executes the tasks in the order p2, p3, ...,
p_(n+1), p1
t = 1: rq->fair_clock = 1/2n, p2->key = 1; // others
are not changed
t = 2: rq->fair_clock = 2/2n, p3->key = 1;
....
t = n: rq->fair_clock = n/2n, p_(n+1)->key = 1;
Then the scheduler picks p1 (weight n) for execution. Between time
[0, n) the amount actual work done by p1 is 0, and the ideal amount is
n/2. Therefore the lag is n/2 - 0, positive means p1 falls behind the
ideal case. The lag here for p1 is also O(n).
As I said in the previous email, p->fair_key only has the
information of past execution of a task and reflects a fair start point.
It does not have the information about weight.

*** Now, let's look at EEVDF.
I have to say that I missed a very important concept in EEVDF which
leads to confusions here. EEVDF stands for _Eligible_ Earliest Virtual
Deadline First, and I did not explain what is _eligible_.

EEVDF maintains a virtual start time ve_i and virtual deadline vd_i for
each task p_i, as well as a virtual time vt. A newly started/waked task
has its ve_i initialized to be the current virtual time. Once a
timeslice l_i amount of work is done, the new virtual start time is set
to be the previous virtual deadline, and then virtual deadline vd_i is
recalculated.
A task is eligible, if and only if ve_i <= current
virtual time vt
EEVDF, at every tick, always picks the eligible task which has the
earliest virtual deadline for execution

Let's see how it works using a similar example as for CFS above.
Suppose, at time t = 0, we starts n+1 tasks. p1 has weight n, and all
others have weight 1. For simplicity, we assume all task use timeslice
l_i = 1, and virtual time vt is initialized to be 0.
- at time t = 0, we have
vt = 0;
ve_1 = 0, vd_1 = ve_1 + l_1/w_1 = 1/n
ve_2 = 0, vd_2 = ve_1 + l_2/w_2 = 1
...
ve_(n+1) = 0, vd_(n+1) = ve_(n+1) + l_(n+1)/w_(n+1) = 1;
Since p1 is eligible and has the earliest deadline 1/n, the
scheduler will executes it first. (Here, the weight which encoded in the
deadline plays an important rule, and allows higher weight tasks to be
executed first).
- at time t = 1:
vt = 1/2n,
ve_1 = 1/n (previous vd_1), vd_1 = ve_1 + 1/n = 2/n
Since ve_1 > vt, p1 is _not_ eligible. EEVDF picks another task for
execution by breaking the tie, say
it executes p2.
- at time t = 2:
vt = 2/2n = 1/n, ve_1 = 1/n, vd_1 = 2/n
ve_2 = 1, ve_2 = ve_2 + 1/1 = 2 // this makes
p2 not eligible
Since vt = ve_1, p1 becomes eligible again and has the earliest
deadline 2/n, it will be scheduled for execution. As EEVDF repeats, it
give a schedule like p1, p2, p1,p3, p1, p4, p1 .... (presented by each
tick). As you can see, now p1 never falls behind/goes before the ideal
case by 1.

Now, let's check how timeslice l_i impacts the system. Suppose, we
change the timeslice of p1 from 1 to 2, and keep others unchanged. EEVDF
gives a schedule like:
p1, p1, p2, p3, p1, p1, p4, p5, p1, p1, .... (presented
by each tick)
similarly if timeslice of p1 is set to be 3, EEVDF gives
p1, p1, p1, p2, p3, p4, p1, p1, p1, ....
(presented by each tick)
As the timeslice of p1 increases, the system checks for reschedule
less frequently, thus makes the lag of p1 becomes longer, but the lag
will not be larger than the maximum timeslice used, as long as it is a
fixed constant. On the other hand, increasing the timeslice of other
tasks has no effect on when p1 is schedule. ( you can try to play the
algorithm by yourself :-))

In CFS, a task has to increases p->fair_key for a certain amount so
that the scheduler can consider it to be preempted. Higher weight leads
to less progress in p->fair_key, and then effectively large timeslice.
Suppose the preempt granularity is 5 virtual ticks, then a task of
weight 1 needs to run 5 ticks, weight 2 need 10 ticks, weight 10 needs
50 ticks. The effect is that the timeslice increases linearly w.r.t
weight, and causes the O(n) lag. In fact, we use timeslice n for p1 in
the above example for EEVDF, it behaves exactly as CFS.

Now, I have fix an incorrect statement about EEVDF's lag bound in my
previous email:
Under EEVDF, a task's lag (difference between the amount work should
be done in ideal fluid-flow system and the actual amount of work done)
is bounded by the maximum timeslice used. (not occasionally as I said in
my previous email). This actually means the maximum timeslice used
controls the total responsiveness of the system. Also the combination of
high weight and smaller timeslice will give better response guarantee
for those bursty response-time sensitive tasks.

Sorry for the late replay. I had an eye exam today and got my eyes
dilated, which forced me to stay away from my computer for a while :-)

Ting


2007-05-03 03:08:10

by Ting Yang

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



Li, Tong N wrote:
> Thanks for the excellent explanation. I think EEVDF and many algs alike
> assume global ordering of all tasks in the system (based on virtual
> time), whereas CFS does so locally on each processor and relies on load
> balancing to achieve fairness across processors. It'd achieve strong
> fairness locally, but I'm not sure about its global fairness properties
> in an MP environment. If ideally the total load weight on each processor
> is always the same, then local fairness would imply global fairness, but
> this is a bin packing problem and is intractable ...
First, I am not assuming a global ordering of all tasks. As the current
implementation, EEVDF should maintain virtual time locally for each CPU.
EEVDF is a proportional time share scheduler, therefore the relative
weight and actual cpu share for each task varies when tasks join and
leave. There will be not bin-pack problem for such systems.

I understand that bin-pack problem does exist in Real-time world.
Suppose in a system has 2 cpus, there a 3 tasks, all of which needs to
finish 30ms work within a window of 50ms. Any 2 of them stay together
will exceeds the bandwidth of one cpu. There is a bin-pack problem,
unless the system has to be clever enough to break one of them down into
2 requests of 15ms/25ms, and execute them on different cpus at different
time without overlap, which is quite difficult :-)

In the proportional world, weights and cpu share are scale to fit with
the bandwidth of a cpu. Therefore putting 2 of them on one cpu is fine,
and the fairness for each cpu is preserved. On the other hand, moving
one task back and forth among 2 cpus do give better throughput and
better global fairness. I have not dig into the load balancing
algorithms of SMP yet, so I leave it aside for now, first thing first :-)

Thanks !

Ting

2007-05-03 03:18:58

by Ting Yang

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


> On Wed, May 02, 2007 at 11:06:34PM +0530, Srivatsa Vaddagiri wrote:
>
>> There is also p->wait_runtime which is taken into account when
>> calculating p->fair_key. So if p3 had waiting in runqueue for long
>> before, it can get to run quicker than 10ms later.
>>
>
> Virtual time is time from the task's point of view, which it has spent
> executing. ->wait_runtime is a device to subtract out time spent on the
> runqueue but not running from what would otherwise be virtual time to
> express lag, whether deliberately or coincidentally. ->wait_runtime
> would not be useful for EEVDF AFAICT, though it may be interesting to
> report.
I just want to point out that ->wait_runtime, in fact, stores the lag of
each task in CFS, except that it is also used by other things, and
occasionally tweaked (heuristically ?). Under normal cases the sum of
lags of all active tasks in such a system, should be a constant 0. The
lag information is equally important to EEVDF, when some tasks leave the
system (becomes inactive) carrying certain amount of lag. The key point
here is that we have to spread the lag (either negative or positive) to
all remaining task, so that the fairness of the system is preserved. I
thinks CFS implementation does not seems to handle this properly.

I am running out time today :-( I will write an email about CFS -v8
tomorrow, describing 2 issues in CFS I found related to this.

Ting

2007-05-03 08:20:42

by Zoltan Boszormenyi

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Hi!

> *** Balbir Singh <[email protected]> wrote:
>
> > The problem is with comparing a s64 values with (s64)ULONG_MAX, which
> > evaluates to -1. Then we check if exec_delta64 and fair_delta64 are
> > greater than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to
> > the respective values.
>
> ah, indeed ...
>
> > The fix is to compare these values against (s64)LONG_MAX and assign
> > (s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater
> > than (s64)LONG_MAX.
> >
> > Tested on PowerPC, the regression is gone, tasks are load balanced as
> > they were in v7.
>
> thanks, applied!
>
> Ingo

I started up 12 glxgears to see the effect of CFS v8
on my Athlon64 X2 4200.

Without this patch all the GL load was handled by the second core,
the system only stressed the first core if other processes were also
started, i.e. a kernel compilation. With this patch the load is evenly
balanced across the two cores all the time. And while doing make -j4
on the kernel, the 12 gears are still spinning about 185+ FPS and
there are only slightly visible hiccups. Switching between workspaces,
i.e. refreshing the large windows of Thunderbird and Firefox are
done very quickly, the whole system is exceptionally responsive.

Thanks for this great work!

Best regards,
Zolt?n B?sz?rm?nyi

2007-05-03 08:50:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Ting Yang <[email protected]> wrote:

> Authors of this paper proposed a scheduler: Earlist Eligible Virtual
> Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to
> track the execution of each running task. The only difference between
> EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is
> _start-time_ fair. [...]

Well, this is a difference but note that it's far from being the 'only
difference' between CFS and EEVDF:

- in CFS you have to "earn" your right to be on the CPU, while EEVDF
gives out timeslices (quanta)

- EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they
know the length of work units - while CFS does not need any knowledge
about 'future work', it measures 'past behavior' and makes its
decisions based on that. So CFS is purely 'history-based'.

- thus in CFS there's no concept of "deadline" either (the 'D' from
EEVDF).

- EEVDF seems to be calculating timeslices in units of milliseconds,
while CFS follows a very strict 'precise' accounting scheme on the
nanoseconds scale.

- the EEVDF paper is also silent on SMP issues.

- it seems EEVDF never existed as a kernel scheduler, it was a
user-space prototype under FreeBSD with simulated workloads. (have
they released that code at all?).

The main common ground seems to be that both CFS and EEVDF share the
view that the central metric is 'virtual time' proportional to the load
of the CPU (called the 'fair clock' in CFS) - but even for this the
details of the actual mechanism differ: EEVDF uses 1/N while CFS (since
-v8) uses a precise, smoothed and weighted load average that is close to
(and reuses portions of) Peter Williams's load metric used in smp-nice.

The EEVDF mechanism could perhaps be more appropriate for real-time
systems (the main target of their paper), while the CFS one i believe is
more appropriate for general purpose workloads.

So i'd say there's more in common between SD and CFS than between EEVDF
and CFS.

So ... it would certainly be interesting to try the EEVDF paper based on
CFS (or whatever way you'd like to try it) and turn the EEVDF paper into
a real kernel scheduler - the two mechanisms are quite dissimilar and
they could behave wildly differently on various workloads. Depending on
test results we could use bits of EEVDF's approaches in CFS too, if it
manages to out-schedule CFS :-)

(your observation about CFS's fork handling is correct nevertheless!)

Ingo

2007-05-03 10:19:31

by Bill Huey

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Wed, May 02, 2007 at 11:18:45PM -0400, Ting Yang wrote:
> I just want to point out that ->wait_runtime, in fact, stores the lag of
> each task in CFS, except that it is also used by other things, and
> occasionally tweaked (heuristically ?). Under normal cases the sum of
> lags of all active tasks in such a system, should be a constant 0. The
> lag information is equally important to EEVDF, when some tasks leave the
> system (becomes inactive) carrying certain amount of lag. The key point
> here is that we have to spread the lag (either negative or positive) to
> all remaining task, so that the fairness of the system is preserved. I
> thinks CFS implementation does not seems to handle this properly.
>
> I am running out time today :-( I will write an email about CFS -v8
> tomorrow, describing 2 issues in CFS I found related to this.

Interesting. I haven't look at the code carefully but that wouldn't
surprise me if this was the case and it led to odd corner cases.

I'm eagerly waiting your analysis and explanation.

bill

2007-05-03 13:02:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Zoltan Boszormenyi <[email protected]> wrote:

> I started up 12 glxgears to see the effect of CFS v8 on my Athlon64 X2
> 4200.
>
> Without this patch all the GL load was handled by the second core, the
> system only stressed the first core if other processes were also
> started, i.e. a kernel compilation. With this patch the load is evenly
> balanced across the two cores all the time.

ok, i didnt realize that it would affect x86_64 too. I'll do a -v9
release with this fix included.

> [...] And while doing make -j4 on the kernel, the 12 gears are still
> spinning about 185+ FPS and there are only slightly visible hiccups.
> Switching between workspaces, i.e. refreshing the large windows of
> Thunderbird and Firefox are done very quickly, the whole system is
> exceptionally responsive.

great! So it seems -v8 does improve OpenGL handling too :-)

> Thanks for this great work!

you are welcome :)

Ingo

2007-05-03 13:55:36

by Damien Wyart

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Hello,

* Ingo Molnar <[email protected]> [2007-05-03 15:02]:
> great! So it seems -v8 does improve OpenGL handling too :-)

What are your thoughts/plans concerning merging CFS into mainline ? Is
it a bit premature to get it into 2.6.22 ? I remember Linus was ok to
change the default scheduler rapidly (the discussion was about sd at
that time) to get more testing than in -mm.

--
Damien Wyart

2007-05-03 14:23:23

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Thu, May 03, 2007 at 10:50:15AM +0200, Ingo Molnar wrote:
> - EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they
> know the length of work units

This is what I was thinking when I wrote earlier that EEVDF expects each
task will specify "length of each new request"
(http://lkml.org/lkml/2007/5/2/339).

The other observation that I have of EEVDF is that it tries to be fair
in the virtual time scale (each client will get 'wi' real units in 1
virtual unit), whereas sometimes fairness in real-time scale also matters?
For ex: a multi-media app would call scheduler fair to it, it it recvs
atleast 1 ms cpu time every 10 *real* milleseconds (frame-time). A rogue
user (or workload) that does a fork bomb may skew this fairness for that
multi-media app in real-time scale under EEVDF?

--
Regards,
vatsa

2007-05-03 14:45:20

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Thu, May 03, 2007 at 03:29:32PM +0200, Damien Wyart wrote:
> What are your thoughts/plans concerning merging CFS into mainline ? Is
> it a bit premature to get it into 2.6.22 ? I remember Linus was ok to
> change the default scheduler rapidly (the discussion was about sd at
> that time) to get more testing than in -mm.

And what about group scheduling extensions? Do you have plans to work on
it? I was begining to work on a prototype to do group scheduling based
on CFS, basically on the lines of what you and Linus had outlined
earlier:

http://lkml.org/lkml/2007/4/18/271
http://lkml.org/lkml/2007/4/18/244

--
Regards,
vatsa

2007-05-03 15:02:44

by Ting Yang

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



Ingo Molnar wrote:
> * Ting Yang <[email protected]> wrote:
>
>
>> Authors of this paper proposed a scheduler: Earlist Eligible Virtual
>> Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to
>> track the execution of each running task. The only difference between
>> EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is
>> _start-time_ fair. [...]
>>
>
> Well, this is a difference but note that it's far from being the 'only
> difference' between CFS and EEVDF:
>
>
I re-ordered comments a little bit to make my discussion easier:
> - in CFS you have to "earn" your right to be on the CPU, while EEVDF
> gives out timeslices (quanta)
>
> - thus in CFS there's no concept of "deadline" either (the 'D' from
> EEVDF).
>
These two are needed to implement the deadline fair policy, therefore
are cover in the difference I mentioned. Due to the way that authors
using different terms, the paper may leads to certain confusion, here 3
important concepts needed:
- the scheduling quanta *q* in the paper: it refers to the minimum
unit that a scheduler can
schedules tasks. For system ticks 1000 per second, q is 1ms. For
system ticks 100 per second
(old linux), q is 10ms.
- the length of a request submitted by a task: it refers to the
amount of work needed to process a
request. EEVDF _does_ not need this information for scheduling
at all.
- the timesliced used to process requests, which denoted as *r* in
the paper: it refers to the unit that
scheduler uses to process a request. If the length of a request
is larger than timeslice r, the request is
effectively divided into pieces of length r.

The scheduler can use a suitable timeslice at will to handle a process.
Using the timeslice, it can estimate the deadline based on the weight of
a task, which tells how urgent the task is. and therefore remove the
potential O(n). (but it comes with a cost: more context switches)
> - EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they
> know the length of work units
That is _not_ really the case, although authors of this paper used a
quite aggressive title and reclaimed so.
Look carefully at the Theorem 3 and its Corollary on pg 16, it says
If not request of client k is larger than a time quanta, then at any
time t the lag is bounded by
-q < lag_k(t) < q
It gives three facts:
1. For the system to be hard real-time, the length of request (note,
not the quanta q, timeslice r)
of any task at any time, must be less than the quanta of
schedule, which in modern Linux is 1ms.
This make EEVDF effectively useless for hard real-time system,
where application can not miss
deadline
2. For the system to be soft real-time: EEVDF needs to first fixed
the bandwidth of a task, so that
it does not varies when tasks join and leave. Then EEVDF can
give a soft real-time guarantee with
bounded delay of max(timeslice r). Authors mentioned this, but
unfortunately this feature was
never actually implemented.
3. EEVDF does gives a proportional time-share system that any task
running on it will has not delay
longer than max(timeslice t) w.r.t the ideal fluid-flow system.

Authors of this paper only developed framework that can possibly used by
real-time system (either hard or soft), their main attempt was to glue
real-time and proportional time-share together. I think the real-time
part of this paper does not actually ring the bell to me. However, isn't
the fact 3 listed above really what we needed ?
> - while CFS does not need any knowledge
> about 'future work', it measures 'past behavior' and makes its
> decisions based on that. So CFS is purely 'history-based'.
>
As I explained about, EEVDF does not need any future knowledge, instead
it uses a unit (timeslice) to process request from tasks, by predicting
the deadline of when this unit has to be finished.

There is a firm theoretic base behind this "guess":, which did not
appear in the paper: as long as the total bandwidth needed by all active
tasks does not exceeds the bandwidth of CPU, the predicted virtual
deadline vt, if mapped back to real time t in system, then that t is
exactly the deadline for the chosen unit in the ideal system. Since
EEVDF is proportionally share, the bandwidth for each task is scaled
based on the weight, the sum of them does not exceeds cpu capacity.
Therefore the "guess" of EEVDF is always right :-)

Predicting future taking weight into account, makes it choose better
tasks to execute. Smaller timeslice make it stick to the ideal system
closer in the cost of more context switches.
> - EEVDF seems to be calculating timeslices in units of milliseconds,
> while CFS follows a very strict 'precise' accounting scheme on the
> nanoseconds scale.
>
This is just an implementation issue, right? It is possible to implement
EEVDF in finer granularity.
> - the EEVDF paper is also silent on SMP issues.
>
Yes, you are right. I do not have much knowledge about load balancing
> - it seems EEVDF never existed as a kernel scheduler, it was a
> user-space prototype under FreeBSD with simulated workloads. (have
> they released that code at all?).
>
They never released the code, (quite typical behavior of system
researchers :-)), and that is why I asked if it worth a try.
> The main common ground seems to be that both CFS and EEVDF share the
> view that the central metric is 'virtual time' proportional to the load
> of the CPU (called the 'fair clock' in CFS) - but even for this the
> details of the actual mechanism differ: EEVDF uses 1/N while CFS (since
> -v8) uses a precise, smoothed and weighted load average that is close to
> (and reuses portions of) Peter Williams's load metric used in smp-nice.
>
>
Thanks a lot for giving your opinions, very nice discussing with you.

Ting

2007-05-03 15:19:13

by Ting Yang

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


Srivatsa Vaddagiri wrote:
> On Thu, May 03, 2007 at 10:50:15AM +0200, Ingo Molnar wrote:
>
>> - EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they
>> know the length of work units
>>
>
> This is what I was thinking when I wrote earlier that EEVDF expects each
> task will specify "length of each new request"
> (http://lkml.org/lkml/2007/5/2/339).
>
This is not very true based on my understanding of EEVDF, please look at
the email I just sent out to Ingo for explanation.
> The other observation that I have of EEVDF is that it tries to be fair
> in the virtual time scale (each client will get 'wi' real units in 1
> virtual unit), whereas sometimes fairness in real-time scale also
> matters?
> For ex: a multi-media app would call scheduler fair to it, it it recvs
> atleast 1 ms cpu time every 10 *real* milleseconds (frame-time). A rogue
> user (or workload) that does a fork bomb may skew this fairness for that
> multi-media app in real-time scale under EEVDF?
>
First of all, CFS does not seems to address this issue to. This is a
typical real-time or soft real-time question, that is not only the
bandwidth of a task has to be fixed, i.e. 10% of cpu bandwidth (which
proportional shared system, like CFS, EEVDF does not do), and the work
need to satisfy a deadline.
In both CFS, EEVDF, the scheduler have keep tweaking weights to give a
fixed bandwidth to application. Authors of EEVDF claims this can be
done, but never implemented :-(

Ting

2007-05-03 15:53:26

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Thu, May 03, 2007 at 03:29:32PM +0200, Damien Wyart wrote:
>> What are your thoughts/plans concerning merging CFS into mainline ? Is
>> it a bit premature to get it into 2.6.22 ? I remember Linus was ok to
>> change the default scheduler rapidly (the discussion was about sd at
>> that time) to get more testing than in -mm.

On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote:
> And what about group scheduling extensions? Do you have plans to work on
> it? I was begining to work on a prototype to do group scheduling based
> on CFS, basically on the lines of what you and Linus had outlined
> earlier:
> http://lkml.org/lkml/2007/4/18/271
> http://lkml.org/lkml/2007/4/18/244

Tong Li's Trio scheduler does a bit of this, though it doesn't seem to
have the mindshare cfs seems to have acquired.

The hyperlink seems to have broken, though:
http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch


-- wli

2007-05-03 18:44:42

by Tong Li

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Thu, 2007-05-03 at 08:53 -0700, William Lee Irwin III wrote:
> On Thu, May 03, 2007 at 03:29:32PM +0200, Damien Wyart wrote:
> >> What are your thoughts/plans concerning merging CFS into mainline ? Is
> >> it a bit premature to get it into 2.6.22 ? I remember Linus was ok to
> >> change the default scheduler rapidly (the discussion was about sd at
> >> that time) to get more testing than in -mm.
>
> On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote:
> > And what about group scheduling extensions? Do you have plans to work on
> > it? I was begining to work on a prototype to do group scheduling based
> > on CFS, basically on the lines of what you and Linus had outlined
> > earlier:
> > http://lkml.org/lkml/2007/4/18/271
> > http://lkml.org/lkml/2007/4/18/244
>
> Tong Li's Trio scheduler does a bit of this, though it doesn't seem to
> have the mindshare cfs seems to have acquired.
>
> The hyperlink seems to have broken, though:
> http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch

Yeah, I just fixed the link. The code was written based on the 2.6.19.2
scheduler. I could forward-port it to the latest kernel if there's
interest and I can find time. :) Here's a description of the design:

http://lkml.org/lkml/2007/4/20/286

Thanks,

tong

2007-05-03 19:51:40

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Thu, 2007-05-03 at 08:53 -0700, William Lee Irwin III wrote:
>> Tong Li's Trio scheduler does a bit of this, though it doesn't seem to
>> have the mindshare cfs seems to have acquired.
>> The hyperlink seems to have broken, though:
>> http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch

On Thu, May 03, 2007 at 11:44:27AM -0700, Li, Tong N wrote:
> Yeah, I just fixed the link. The code was written based on the 2.6.19.2
> scheduler. I could forward-port it to the latest kernel if there's
> interest and I can find time. :) Here's a description of the design:
> http://lkml.org/lkml/2007/4/20/286

I have the interest. I don't see performance issues with any of the
schedulers, but am rather interested in the feature.


-- wli

2007-05-05 08:31:44

by Esben Nielsen

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Wed, 2 May 2007, Ingo Molnar wrote:

>
> * Balbir Singh <[email protected]> wrote:
>
>> The problem is with comparing a s64 values with (s64)ULONG_MAX, which
>> evaluates to -1. Then we check if exec_delta64 and fair_delta64 are
>> greater than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to
>> the respective values.
>
> ah, indeed ...
>
>> The fix is to compare these values against (s64)LONG_MAX and assign
>> (s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater
>> than (s64)LONG_MAX.
>>
>> Tested on PowerPC, the regression is gone, tasks are load balanced as
>> they were in v7.
>
> thanks, applied!
>
> Ingo

I have been wondering why you use usigned for timers anyway. It is also
like that in hrtimers. Why not use signed and avoid (almost) all worries
about wrap around issues. The trick is that when all
a < b
is be replaced by
a - b < 0
the code will work on all 2-complement machines even if the (signed!)
integers a and b wrap around.

In both the hrtimer and CFS patch 32 bit timers could be used internally
on 32 bit architectures to avoid expensive 64 bit integer calculations.
The only thing is: timeouts can not be bigger than 2^31 in the chosen
units.

I have successfully coded a (much more primitive) hrtimer system for
another OS on both ARM and PPC using the above trick in my former job.
On both I used the machine's internal clock as the internal
representation of time and I only scaled to a struct timespec in the user
interface.

Esben

2007-05-05 17:45:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Sat, 5 May 2007, Esben Nielsen wrote:
>
> I have been wondering why you use usigned for timers anyway. It is also like
> that in hrtimers. Why not use signed and avoid (almost) all worries about wrap
> around issues. The trick is that when all
> a < b
> is be replaced by
> a - b < 0
> the code will work on all 2-complement machines even if the (signed!) integers
> a and b wrap around.

No. BOTH of the above are buggy.

The C language definition doesn't allow signed integers to wrap (ie it's
undefined behaviour), so "a-b < 0" can be rewritten by the compiler as a
simple signed "a < b".

And the unsigned (or signed) "a < b" is just broken wrt any kind of
wrap-around (whether wrapping around zero or the sign bit).

So the _only_ valid way to handle timers is to
- either not allow wrapping at all (in which case "unsigned" is better,
since it is bigger)
- or use wrapping explicitly, and use unsigned arithmetic (which is
well-defined in C) and do something like "(long)(a-b) > 0".

Notice? The signed variant is basically _never_ correct.

Linus

2007-05-06 08:30:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Linus Torvalds <[email protected]> wrote:

> So the _only_ valid way to handle timers is to
> - either not allow wrapping at all (in which case "unsigned" is better,
> since it is bigger)
> - or use wrapping explicitly, and use unsigned arithmetic (which is
> well-defined in C) and do something like "(long)(a-b) > 0".

hm, there is a corner-case in CFS where a fix like this is necessary.

CFS uses 64-bit values for almost everything, and the majority of values
are of 'relative' nature with no danger of overflow. (They are signed
because they are relative values that center around zero and can be
negative or positive.)

there is one value that is of 'absolute' nature though: the 'fair clock'
(which is the same as the normal system clock when load is 1.0, and it
slows down in a load-proportional way as load increases), which has
units of nanoseconds - and the 'keys' in the rbtree are derived from
such clock values. That particular clock could produce an s64 overflow
at around 292 years after bootup (a bit later if the system is also used
for something in that timeframe). While i dont think that's realistic to
worry about, the fix below should solve that theoretical problem too.

Ingo

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -60,7 +60,7 @@ static inline void __enqueue_task_fair(s
* We dont care about collisions. Nodes with
* the same key stay together.
*/
- if (key < entry->fair_key) {
+ if ((s64)(entry->fair_key - key) > 0) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;

2007-05-06 08:37:00

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Hi Ingo,

On Sun, May 06, 2007 at 10:29:11AM +0200, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > So the _only_ valid way to handle timers is to
> > - either not allow wrapping at all (in which case "unsigned" is better,
> > since it is bigger)
> > - or use wrapping explicitly, and use unsigned arithmetic (which is
> > well-defined in C) and do something like "(long)(a-b) > 0".
>
> hm, there is a corner-case in CFS where a fix like this is necessary.
>
> CFS uses 64-bit values for almost everything, and the majority of values
> are of 'relative' nature with no danger of overflow. (They are signed
> because they are relative values that center around zero and can be
> negative or positive.)

(...)

> - if (key < entry->fair_key) {
> + if ((s64)(entry->fair_key - key) > 0) {

Just a hint: while your code here is correct, it is a good practise to
check against < 0 instead, so that if for any reason you sometimes forget
to cast to signed, the compiler will emit a warning stating that the
condition is always false. This would simply become :

- if (key < entry->fair_key) {
+ if ((s64)(key - entry->fair_key) < 0) {

Just my .02 euros :-)
Willy

2007-05-06 08:52:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Willy Tarreau <[email protected]> wrote:

> Just a hint: while your code here is correct, it is a good practise to
> check against < 0 instead, so that if for any reason you sometimes
> forget to cast to signed, the compiler will emit a warning stating
> that the condition is always false. This would simply become :
>
> - if (key < entry->fair_key) {
> + if ((s64)(key - entry->fair_key) < 0) {

done :-) (Btw., the key is still s64 so the cast is technically
unnecessary - but now the code would be correct with the key being u64
too.)

Ingo

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -60,7 +60,7 @@ static inline void __enqueue_task_fair(s
* We dont care about collisions. Nodes with
* the same key stay together.
*/
- if (key < entry->fair_key) {
+ if ((s64)(key - entry->fair_key) < 0) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;

2007-05-06 17:46:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Sun, 6 May 2007, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > So the _only_ valid way to handle timers is to
> > - either not allow wrapping at all (in which case "unsigned" is better,
> > since it is bigger)
> > - or use wrapping explicitly, and use unsigned arithmetic (which is
> > well-defined in C) and do something like "(long)(a-b) > 0".
>
> hm, there is a corner-case in CFS where a fix like this is necessary.
>
> CFS uses 64-bit values for almost everything, and the majority of values
> are of 'relative' nature with no danger of overflow. (They are signed
> because they are relative values that center around zero and can be
> negative or positive.)

Well, I'd like to just worry about that for a while.

You say there is "no danger of overflow", and I mostly agree that once
we're talking about 64-bit values, the overflow issue simply doesn't
exist, and furthermore the difference between 63 and 64 bits is not really
relevant, so there's no major reason to actively avoid signed entries.

So in that sense, it all sounds perfectly sane. And I'm definitely not
sure your "292 years after bootup" worry is really worth even considering.

When we're really so well off that we expect the hardware and software
stack to be stable over a hundred years, I'd start to think about issues
like that, in the meantime, to me worrying about those kinds of issues
just means that you're worrying about the wrong things.

BUT.

There's a fundamental reason relative timestamps are difficult and almost
always have overflow issues: the "long long in the future" case as an
approximation of "infinite timeout" is almost always relevant.

So rather than worry about the system staying up 292 years, I'd worry
about whether people pass in big numbers (like some MAX_S64 approximation)
as an approximation for "infinite", and once you have things like that,
the "64 bits never overflows" argument is totally bogus.

There's a damn good reason for using only *absolute* time. The whole
"signed values of relative time" may _sound_ good, but it really sucks in
subtle and horrible ways!

Linus

2007-05-07 00:05:17

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Damien Wyart wrote:
> Hello,
>
> * Ingo Molnar <[email protected]> [2007-05-03 15:02]:
>> great! So it seems -v8 does improve OpenGL handling too :-)
>
> What are your thoughts/plans concerning merging CFS into mainline ? Is
> it a bit premature to get it into 2.6.22 ? I remember Linus was ok to
> change the default scheduler rapidly (the discussion was about sd at
> that time) to get more testing than in -mm.
>
It would seem that a number of us are testing the variants and reporting
results (I now have to pull all the new versions and redo stuff, but I
have a new test I'm going to run as well).

Given that and the fact that there is another idea which can be tested
from Tong Li, I think that decision can be postponed, Linus willing.
There appear to be enough testers to be driving evolution now, authors
may disagree, of course.

Several folks added to cc list...

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-05-07 11:09:58

by Esben Nielsen

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Sat, 5 May 2007, Linus Torvalds wrote:

>
>
> On Sat, 5 May 2007, Esben Nielsen wrote:
>>
>> I have been wondering why you use usigned for timers anyway. It is also like
>> that in hrtimers. Why not use signed and avoid (almost) all worries about wrap
>> around issues. The trick is that when all
>> a < b
>> is be replaced by
>> a - b < 0
>> the code will work on all 2-complement machines even if the (signed!) integers
>> a and b wrap around.
>
> No. BOTH of the above are buggy.
>
> The C language definition doesn't allow signed integers to wrap (ie it's
> undefined behaviour), so "a-b < 0" can be rewritten by the compiler as a
> simple signed "a < b".
>
> And the unsigned (or signed) "a < b" is just broken wrt any kind of
> wrap-around (whether wrapping around zero or the sign bit).
>
> So the _only_ valid way to handle timers is to
> - either not allow wrapping at all (in which case "unsigned" is better,
> since it is bigger)
> - or use wrapping explicitly, and use unsigned arithmetic (which is
> well-defined in C) and do something like "(long)(a-b) > 0".
>
> Notice? The signed variant is basically _never_ correct.
>

What is (long)(a-b) ? I have tried to look it up in the C99 standeard but
I can't find it. Maybe it is in the referred LIA-1 standeard, which I
can't find with google.

I think the best would be to use "a-b > ULONG_MAX/2" when you mean "a<b"
as that should be completely portable.

According to C99 Appendix H2.2
(http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf) an
implementation can choose to do modulo signed integers as it is
mandatory for unsigned integers. If an implementation have choosen
to do that it must be a bug to to do the "a-b < 0" -> "a<b" optimization.

I have never experienced a compiler/architecture combination _not_ doing
wrapped signed integers.

Esben


> Linus
>

2007-05-07 11:30:56

by Esben Nielsen

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Sun, 6 May 2007, Linus Torvalds wrote:

>
>
> On Sun, 6 May 2007, Ingo Molnar wrote:
>>
>> * Linus Torvalds <[email protected]> wrote:
>>
>>> So the _only_ valid way to handle timers is to
>>> - either not allow wrapping at all (in which case "unsigned" is better,
>>> since it is bigger)
>>> - or use wrapping explicitly, and use unsigned arithmetic (which is
>>> well-defined in C) and do something like "(long)(a-b) > 0".
>>
>> hm, there is a corner-case in CFS where a fix like this is necessary.
>>
>> CFS uses 64-bit values for almost everything, and the majority of values
>> are of 'relative' nature with no danger of overflow. (They are signed
>> because they are relative values that center around zero and can be
>> negative or positive.)
>
> Well, I'd like to just worry about that for a while.
>
> You say there is "no danger of overflow", and I mostly agree that once
> we're talking about 64-bit values, the overflow issue simply doesn't
> exist, and furthermore the difference between 63 and 64 bits is not really
> relevant, so there's no major reason to actively avoid signed entries.
>
> So in that sense, it all sounds perfectly sane. And I'm definitely not
> sure your "292 years after bootup" worry is really worth even considering.
>

I would hate to tell mission control for Mankind's first mission to another
star to reboot every 200 years because "there is no need to worry about it."

As a matter of principle an OS should never need a reboot (with exception
for upgrading). If you say you have to reboot every 200 years, why not
every 100? Every 50? .... Every 45 days (you know what I am referring
to :-) ?

> When we're really so well off that we expect the hardware and software
> stack to be stable over a hundred years, I'd start to think about issues
> like that, in the meantime, to me worrying about those kinds of issues
> just means that you're worrying about the wrong things.
>
> BUT.
>
> There's a fundamental reason relative timestamps are difficult and almost
> always have overflow issues: the "long long in the future" case as an
> approximation of "infinite timeout" is almost always relevant.
>
> So rather than worry about the system staying up 292 years, I'd worry
> about whether people pass in big numbers (like some MAX_S64 approximation)
> as an approximation for "infinite", and once you have things like that,
> the "64 bits never overflows" argument is totally bogus.
>
> There's a damn good reason for using only *absolute* time. The whole
> "signed values of relative time" may _sound_ good, but it really sucks in
> subtle and horrible ways!
>

I think you are wrong here. The only place you need absolute time is a
for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
representation when you could have used a 32 bit. With a 32 bit
implementation you are forced to handle the corner cases with wrap
around and too big arguments up front. With a 64 bit you hide those
problems.

I think CFS would be best off using a 32 bit timer counting in micro
seconds. That would wrap around in 72 minuttes. But as the timers are
relative you will never be able to specify a timer larger than 36 minuttes
in the future. But 36 minuttes is redicolously long for a scheduler and a
simple test limiting time values to that value would not break anything.

Esben

> Linus
>

2007-05-07 14:15:26

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Thu, May 03, 2007 at 08:53:47AM -0700, William Lee Irwin III wrote:
> On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote:
> > And what about group scheduling extensions? Do you have plans to work on
> > it? I was begining to work on a prototype to do group scheduling based
> > on CFS, basically on the lines of what you and Linus had outlined
> > earlier:
> > http://lkml.org/lkml/2007/4/18/271
> > http://lkml.org/lkml/2007/4/18/244
>
> Tong Li's Trio scheduler does a bit of this, though it doesn't seem to
> have the mindshare cfs seems to have acquired.
>
> The hyperlink seems to have broken, though:
> http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch

The big question I have is, how well does DWRR fits into the "currently hot"
scheduling frameworks like CFS? For ex: if the goal is to do
fair (group) scheduling of SCHED_NORMAL tasks, can CFS and DWRR co-exist?
Both seem to be radically different algorithms and my initial impressions
of them co-existing is "No", but would be glad to be corrected if I am
wrong. If they can't co-exist, then we need a different way of doing
group scheduling on top of CFS, as that is gaining more popularity on
account of better handling of interactivity.

Tong,
I understand a center hallmark of DWRR is SMP fairness.
Have you considered how bad/good the other alternative to achieve SMP fairness
which is in vogue today : pressure/weight based balancing (ex: smpnice and
CKRM CPU scheduler - ckrm.sourceforge.net/downloads/ckrm-ols03-slides.pdf)?

--
Regards,
vatsa

2007-05-07 16:00:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8


* Esben Nielsen <[email protected]> wrote:

> I would hate to tell mission control for Mankind's first mission to
> another star to reboot every 200 years because "there is no need to
> worry about it."

ok, i need no other argument - fixed :-)

(btw., if anyone from the planning committee of Mankind's first mission
to another star has any other particular wish about CFS's design, i'm
all ears and it's going to get the highest possible priority!)

i really mean it! (And i guess you discovered my weak spot. Darn :)
Hopefully they wont ask for STREAMS support, that would put us into a
_really_ difficult position.)

Ingo

2007-05-07 16:12:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Mon, 7 May 2007, Esben Nielsen wrote:
>
> I would hate to tell mission control for Mankind's first mission to another
> star to reboot every 200 years because "there is no need to worry about it."

I'd like to put it another way:

- if your mission to another star *depends* on every single piece of
complex equipment staying up with zero reboots for 200+ years, you have
some serious technology problems.

In other words: your argument is populist, and totally silly.

Trust me, if you are going to another star, you'd better have the
capabilities to handle bugs. You'd better have multiple fail-over etc.

A notion of "robustness" cannot and must not hinge on "no bugs". That's
unrealistic.

> As a matter of principle an OS should never need a reboot (with exception for
> upgrading). If you say you have to reboot every 200 years, why not every 100?
> Every 50? .... Every 45 days (you know what I am referring to :-) ?

There's something of a difference between 45 days and 292 years.

I'm not saying we can't get there, but worrying about it in the current
state is just stupid. It's not just looking at the trees and not seeing
the forest, it's looking at one *leaf*, and missing the forest.

And quite frankly, if you work for NASA and are aiming for the stars,
you'd better not be that person who looks at one leaf and cannot see the
forest around you. That's the kind of thing that makes you miss the
difference between miles and kilometers.

Linus

2007-05-07 16:30:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Mon, 7 May 2007, Esben Nielsen wrote:
>
> What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I
> can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find
> with google.

I don't worry about non-2's-complement machines (they don't exist, and
likely won't exist in the future either).

So I worry about compilers rewriting my code.

So "(long)(a-b) < 0" (with "a" and "b" being unsigned long) is basically a
portable way of testing the high bit of the result.

> I think the best would be to use "a-b > ULONG_MAX/2" when you mean "a<b" as
> that should be completely portable.

That certainly works too, but the difference is irrelevant, since Linux is
unlikely to work on insane machines anyway (ie we do make a lot of other
assumptions about the architecture, being two's-complement is the least of
those).

So you basically shouldn't worry about hardware: everybody is pretty much
the same. You should worry about *compilers* - that's where the
differences show up.

So "(long)(a-b)" may be "implementation defined" (but since
implementations are all 2's complement, we don't care), but a signed
"(a-b)" that over/overflows is *undefined*, and that is much worse because
it means that the compiler can do some funky stuff, and _that_ is a real
practical worry.

And no, I also don't worry about porting Linux to 18-bit machines, or to
ternary CPU's.

Linus

2007-05-07 18:39:38

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Mon, May 07, 2007, Linus Torvalds wrote:
> On Mon, 7 May 2007, Esben Nielsen wrote:
> >
> > What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I
> > can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find
> > with google.

C99 defines unsigned overflow semantics, but it doesn't say anything
about signed overflow, thus it's undefined -- and you have a hard
time finding it out.

However, I have no clue *why* it's undefined and not
implementation defined. Does someone know?

> I don't worry about non-2's-complement machines (they don't exist, and
> likely won't exist in the future either).

I think DSPs can do saturated arithmetics (clamp to min/max
values instead of wrap around). Not that it matters for Linux...

> So I worry about compilers rewriting my code.

gcc has -fwrapv and -ftrapv to change signed integer overflow
behaviour.

One baffling example where gcc rewrites code is when
conditionals depend on signed integer overflow:

$ cat xx.c
#include <assert.h>

int foo(int a)
{
assert(a + 100 > a);
return a;
}

int bar(int a)
{
if (a + 100 > a)
a += 100;
return a;
}
$ gcc -Wall -Wextra -fomit-frame-pointer -c xx.c
$ objdump -dr xx.o

xx.o: file format elf32-i386

Disassembly of section .text:

00000000 <foo>:
0: 8b 44 24 04 mov 0x4(%esp),%eax
4: c3 ret

00000005 <bar>:
5: 83 44 24 04 64 addl $0x64,0x4(%esp)
a: 8b 44 24 04 mov 0x4(%esp),%eax
e: c3 ret


The assert and the condition were just dropped
by gcc -- without any warning.

gcc-4.2 will add -fstrict-overflow and -Wstrict-overflow.
http://gcc.gnu.org/gcc-4.2/changes.html


Johannes

2007-05-07 18:57:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Mon, 7 May 2007, Johannes Stezenbach wrote:
>
> One baffling example where gcc rewrites code is when
> conditionals depend on signed integer overflow:

Yes. This is one of my favourite beefs with gcc. Some of the optimization
decisions seem to make no sense.

Your example is a good one, but my private beef has been in alias
handling. Alias analysis is an important part of optimization, and there's
two kinds: the static (and exact, aka "safe") kind that you can do
regardless of any language definitions, because you *know* that you aren't
actually changing behaviour, and the additional type-based heuristics that
the C language allows.

So which ones would you expect a compiler to consider more important?

And which one do you think gcc will use?

Right. You can have static analysis that *guarantees* that two objects
alias, but if gcc determins that they have different types and thus might
not alias, it decides to use the heuristic instead of the firm knowledge,
and generate code that doesn't work.

"Because the language definition allows it".

Oh well.

Linus

2007-05-07 20:54:15

by Tong Li

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Mon, 2007-05-07 at 19:52 +0530, Srivatsa Vaddagiri wrote:
> On Thu, May 03, 2007 at 08:53:47AM -0700, William Lee Irwin III wrote:
> > On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote:
> > > And what about group scheduling extensions? Do you have plans to work on
> > > it? I was begining to work on a prototype to do group scheduling based
> > > on CFS, basically on the lines of what you and Linus had outlined
> > > earlier:
> > > http://lkml.org/lkml/2007/4/18/271
> > > http://lkml.org/lkml/2007/4/18/244
> >
> > Tong Li's Trio scheduler does a bit of this, though it doesn't seem to
> > have the mindshare cfs seems to have acquired.
> >
> > The hyperlink seems to have broken, though:
> > http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch
>
> The big question I have is, how well does DWRR fits into the "currently hot"
> scheduling frameworks like CFS? For ex: if the goal is to do
> fair (group) scheduling of SCHED_NORMAL tasks, can CFS and DWRR co-exist?
> Both seem to be radically different algorithms and my initial impressions
> of them co-existing is "No", but would be glad to be corrected if I am
> wrong. If they can't co-exist, then we need a different way of doing
> group scheduling on top of CFS, as that is gaining more popularity on
> account of better handling of interactivity.

Yeah, the intent of DWRR was to provide proportional fairness and rely
on the underlying scheduler to support interactivity. In a way, DWRR
ensures that each task receives its fair share, while the underlying
scheduler determines the right order to run the tasks. Since SD is
structurally similar to the stock scheduler, DWRR should co-exist with
it easily. Co-existing with CFS requires more work, but I think the
round-robin mechanism in DWRR could be applicable to CFS to facilitate
cross-processor fairness.

> Tong,
> I understand a center hallmark of DWRR is SMP fairness.
> Have you considered how bad/good the other alternative to achieve SMP fairness
> which is in vogue today : pressure/weight based balancing (ex: smpnice and
> CKRM CPU scheduler - ckrm.sourceforge.net/downloads/ckrm-ols03-slides.pdf)?
>

The disadvantage of DWRR is its potential overhead and the advantage is
it can provide stronger fairness. If we have 2 processors and 3 tasks,
DWRR ensures that each task gets 66% of the total CPU time, while
smpnice would keep two tasks on the same processor and the third one on
another. I did an analysis and showed that the lag bound of DWRR is
constant if task weights are bounded by a constant. On the other hand,
the cost DWRR pays is that it requires more task migrations. I tested
with a set of benchmarks on an SMP and didn't see migrations were
causing much performance impact, but this is certainly a big issue for
NUMA.

tong

PS. I'm now porting the code to the latest kernel and will post as soon
as I'm done.

2007-05-08 00:35:50

by Peter Williams

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Esben Nielsen wrote:
>
>
> On Sun, 6 May 2007, Linus Torvalds wrote:
>
>>
>>
>> On Sun, 6 May 2007, Ingo Molnar wrote:
>>>
>>> * Linus Torvalds <[email protected]> wrote:
>>>
>>>> So the _only_ valid way to handle timers is to
>>>> - either not allow wrapping at all (in which case "unsigned" is
>>>> better,
>>>> since it is bigger)
>>>> - or use wrapping explicitly, and use unsigned arithmetic (which is
>>>> well-defined in C) and do something like "(long)(a-b) > 0".
>>>
>>> hm, there is a corner-case in CFS where a fix like this is necessary.
>>>
>>> CFS uses 64-bit values for almost everything, and the majority of values
>>> are of 'relative' nature with no danger of overflow. (They are signed
>>> because they are relative values that center around zero and can be
>>> negative or positive.)
>>
>> Well, I'd like to just worry about that for a while.
>>
>> You say there is "no danger of overflow", and I mostly agree that once
>> we're talking about 64-bit values, the overflow issue simply doesn't
>> exist, and furthermore the difference between 63 and 64 bits is not
>> really
>> relevant, so there's no major reason to actively avoid signed entries.
>>
>> So in that sense, it all sounds perfectly sane. And I'm definitely not
>> sure your "292 years after bootup" worry is really worth even
>> considering.
>>
>
> I would hate to tell mission control for Mankind's first mission to another
> star to reboot every 200 years because "there is no need to worry about
> it."
>
> As a matter of principle an OS should never need a reboot (with
> exception for upgrading). If you say you have to reboot every 200 years,
> why not every 100? Every 50? .... Every 45 days (you know what I am
> referring to :-) ?

There's always going to be an upper limit on the representation of time.
At least until we figure out how to implement infinity properly.

>
>> When we're really so well off that we expect the hardware and software
>> stack to be stable over a hundred years, I'd start to think about issues
>> like that, in the meantime, to me worrying about those kinds of issues
>> just means that you're worrying about the wrong things.
>>
>> BUT.
>>
>> There's a fundamental reason relative timestamps are difficult and almost
>> always have overflow issues: the "long long in the future" case as an
>> approximation of "infinite timeout" is almost always relevant.
>>
>> So rather than worry about the system staying up 292 years, I'd worry
>> about whether people pass in big numbers (like some MAX_S64
>> approximation)
>> as an approximation for "infinite", and once you have things like that,
>> the "64 bits never overflows" argument is totally bogus.
>>
>> There's a damn good reason for using only *absolute* time. The whole
>> "signed values of relative time" may _sound_ good, but it really sucks in
>> subtle and horrible ways!
>>
>
> I think you are wrong here. The only place you need absolute time is a
> for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
> representation when you could have used a 32 bit. With a 32 bit
> implementation you are forced to handle the corner cases with wrap
> around and too big arguments up front. With a 64 bit you hide those
> problems.

As does the other method. A 32 bit signed offset with a 32 bit base is
just a crude version of 64 bit absolute time.

>
> I think CFS would be best off using a 32 bit timer counting in micro
> seconds. That would wrap around in 72 minuttes. But as the timers are
> relative you will never be able to specify a timer larger than 36
> minuttes in the future. But 36 minuttes is redicolously long for a
> scheduler and a simple test limiting time values to that value would not
> break anything.

Except if you're measuring sleep times. I think that you'll find lots
of tasks sleep for more than 72 minutes.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-08 05:39:29

by Matt Mackall

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Mon, May 07, 2007 at 09:28:32AM -0700, Linus Torvalds wrote:
>
>
> On Mon, 7 May 2007, Esben Nielsen wrote:
> >
> > What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I
> > can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find
> > with google.
>
> I don't worry about non-2's-complement machines (they don't exist, and
> likely won't exist in the future either).

They do exist. And they run Linux. SLES9 in fact.

http://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#UNISYS_ClearPath_IX_series

http://www.unisys.com/eprise/main/admin/corporate/doc/ClearPath_Plus_Dorado_Model_390_Server_Specification_Sheet.pdf

That machine is a direct descendant of the Univac 1101 from 1950 and
is still software-compatible with 1107s from 1962.

(Granted, they only run Linux on the x86 side.)

--
Mathematics is the supreme nostalgia of our time.

2007-05-08 07:35:08

by Esben Nielsen

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Mon, 7 May 2007, Johannes Stezenbach wrote:

> On Mon, May 07, 2007, Linus Torvalds wrote:
>> On Mon, 7 May 2007, Esben Nielsen wrote:
>>>
>>> What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I
>>> can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find
>>> with google.
>
> C99 defines unsigned overflow semantics, but it doesn't say anything
> about signed overflow, thus it's undefined -- and you have a hard
> time finding it out.
>
> However, I have no clue *why* it's undefined and not
> implementation defined. Does someone know?
>
>> I don't worry about non-2's-complement machines (they don't exist, and
>> likely won't exist in the future either).
>
> I think DSPs can do saturated arithmetics (clamp to min/max
> values instead of wrap around). Not that it matters for Linux...
>
>> So I worry about compilers rewriting my code.
>
> gcc has -fwrapv and -ftrapv to change signed integer overflow
> behaviour.
>
> One baffling example where gcc rewrites code is when
> conditionals depend on signed integer overflow:
>
> $ cat xx.c
> #include <assert.h>
>
> int foo(int a)
> {
> assert(a + 100 > a);
> return a;
> }
>
> int bar(int a)
> {
> if (a + 100 > a)
> a += 100;
> return a;
> }
> $ gcc -Wall -Wextra -fomit-frame-pointer -c xx.c
> $ objdump -dr xx.o
>
> xx.o: file format elf32-i386
>
> Disassembly of section .text:
>
> 00000000 <foo>:
> 0: 8b 44 24 04 mov 0x4(%esp),%eax
> 4: c3 ret
>
> 00000005 <bar>:
> 5: 83 44 24 04 64 addl $0x64,0x4(%esp)
> a: 8b 44 24 04 mov 0x4(%esp),%eax
> e: c3 ret
>
>
> The assert and the condition were just dropped
> by gcc -- without any warning.
>
> gcc-4.2 will add -fstrict-overflow and -Wstrict-overflow.
> http://gcc.gnu.org/gcc-4.2/changes.html
>
>
> Johannes
>

This is contrary to C99 standeard annex H2.2
(http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf):

"An implementation that defines signed integer types as also being modulo need
not detect integer overflow, in which case, only integer divide-by-zero need
be detected."

So if it doesn't properly defines wrapping it has to detect integer
overflow, right?

gcc does niether with that optimization :-(

Esben

2007-05-08 09:05:43

by Esben Nielsen

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Tue, 8 May 2007, Peter Williams wrote:

> Esben Nielsen wrote:
>>
>>
>> On Sun, 6 May 2007, Linus Torvalds wrote:
>>
>> >
>> >
>> > On Sun, 6 May 2007, Ingo Molnar wrote:
>> > >
>> > > * Linus Torvalds <[email protected]> wrote:
>> > >
>> > > > So the _only_ valid way to handle timers is to
>> > > > - either not allow wrapping at all (in which case "unsigned" is
>> > > > better,
>> > > > since it is bigger)
>> > > > - or use wrapping explicitly, and use unsigned arithmetic (which is
>> > > > well-defined in C) and do something like "(long)(a-b) > 0".
>> > >
>> > > hm, there is a corner-case in CFS where a fix like this is necessary.
>> > >
>> > > CFS uses 64-bit values for almost everything, and the majority of
>> > > values
>> > > are of 'relative' nature with no danger of overflow. (They are signed
>> > > because they are relative values that center around zero and can be
>> > > negative or positive.)
>> >
>> > Well, I'd like to just worry about that for a while.
>> >
>> > You say there is "no danger of overflow", and I mostly agree that once
>> > we're talking about 64-bit values, the overflow issue simply doesn't
>> > exist, and furthermore the difference between 63 and 64 bits is not
>> > really
>> > relevant, so there's no major reason to actively avoid signed entries.
>> >
>> > So in that sense, it all sounds perfectly sane. And I'm definitely not
>> > sure your "292 years after bootup" worry is really worth even
>> > considering.
>> >
>>
>> I would hate to tell mission control for Mankind's first mission to
>> another
>> star to reboot every 200 years because "there is no need to worry about
>> it."
>>
>> As a matter of principle an OS should never need a reboot (with exception
>> for upgrading). If you say you have to reboot every 200 years, why not
>> every 100? Every 50? .... Every 45 days (you know what I am referring to
>> :-) ?
>
> There's always going to be an upper limit on the representation of time.
> At least until we figure out how to implement infinity properly.

Well you need infinite memory for that :-)
But that is my point: Why go into the problem of storing absolute time
when you can use relative time?


>
>>
>> > When we're really so well off that we expect the hardware and software
>> > stack to be stable over a hundred years, I'd start to think about issues
>> > like that, in the meantime, to me worrying about those kinds of issues
>> > just means that you're worrying about the wrong things.
>> >
>> > BUT.
>> >
>> > There's a fundamental reason relative timestamps are difficult and
>> > almost
>> > always have overflow issues: the "long long in the future" case as an
>> > approximation of "infinite timeout" is almost always relevant.
>> >
>> > So rather than worry about the system staying up 292 years, I'd worry
>> > about whether people pass in big numbers (like some MAX_S64
>> > approximation)
>> > as an approximation for "infinite", and once you have things like that,
>> > the "64 bits never overflows" argument is totally bogus.
>> >
>> > There's a damn good reason for using only *absolute* time. The whole
>> > "signed values of relative time" may _sound_ good, but it really sucks
>> > in
>> > subtle and horrible ways!
>> >
>>
>> I think you are wrong here. The only place you need absolute time is a for
>> the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
>> representation when you could have used a 32 bit. With a 32 bit
>> implementation you are forced to handle the corner cases with wrap around
>> and too big arguments up front. With a 64 bit you hide those problems.
>
> As does the other method. A 32 bit signed offset with a 32 bit base is just
> a crude version of 64 bit absolute time.

64 bit is also relative - just over a much longer period.
32 bit signed offset is relative - and you know it. But with 64 people
think it is absolute and put in large values as Linus said above. With 32
bit future developers will know it is relative and code for it. And they
will get their corner cases tested, because the code soon will run
into those corners.

>
>>
>> I think CFS would be best off using a 32 bit timer counting in micro
>> seconds. That would wrap around in 72 minuttes. But as the timers are
>> relative you will never be able to specify a timer larger than 36 minuttes
>> in the future. But 36 minuttes is redicolously long for a scheduler and a
>> simple test limiting time values to that value would not break anything.
>
> Except if you're measuring sleep times. I think that you'll find lots of
> tasks sleep for more than 72 minutes.

I don't think those large values will be relavant. You can easily cut
off sleep times at 30 min or even 1 min. But you need to detect that the
task have indeed been sleeping 2^32+1 usec and not 1 usec. You can't do
with a 32 bit timer alone. In that case you need to use a (at least) 64 bit
timer - which is needed in the OS anyways. But not internally in the
wait queue, where the repeated calculations are.

Esben


>
> Peter
> --
> Peter Williams [email protected]
>
> "Learning, n. The kind of ignorance distinguishing the studious."
> -- Ambrose Bierce
>
>

2007-05-08 09:54:35

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Tue, May 08, 2007, Esben Nielsen wrote:
>
> This is contrary to C99 standeard annex H2.2
> (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf):
>
> "An implementation that defines signed integer types as also being modulo
> need
> not detect integer overflow, in which case, only integer divide-by-zero need
> be detected."
>
> So if it doesn't properly defines wrapping it has to detect integer
> overflow, right?

No. Annex H (informative!) only talks about LIA-1 conformance.

C99 isn't LIA-1 conformant. H2.2 describes what an implementation
might do to make signed integers LIA-1 compatible, which is
what gcc does with -fwarpv or -ftrapv.

At least that's how I understand it, the C99 standard
seems to have been written with the "it was hard to
write, so it should be hard to read" mindset. :-/

I still don't know _why_ signed integer overflow behaviour
isn't defined in C. It just goes against everyones expectation
and thus causes bugs.


Johannes

2007-05-08 10:27:45

by Esben Nielsen

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Tue, 8 May 2007, Johannes Stezenbach wrote:

> On Tue, May 08, 2007, Esben Nielsen wrote:
>>
>> This is contrary to C99 standeard annex H2.2
>> (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf):
>>
>> "An implementation that defines signed integer types as also being modulo
>> need
>> not detect integer overflow, in which case, only integer divide-by-zero need
>> be detected."
>>
>> So if it doesn't properly defines wrapping it has to detect integer
>> overflow, right?
>
> No. Annex H (informative!) only talks about LIA-1 conformance.
>
> C99 isn't LIA-1 conformant. H2.2 describes what an implementation
> might do to make signed integers LIA-1 compatible.

"The signed C integer types int, long int, long long int, and the
corresponding unsigned types are compatible with LIA-1."

I read this as any C99 implementation must be compatible. I would like to
see LIA-1 to check.


>, which is
> what gcc does with -fwarpv or -ftrapv.
>

Yes, either or: Either wrap or trap.

> At least that's how I understand it, the C99 standard
> seems to have been written with the "it was hard to
> write, so it should be hard to read" mindset. :-/
>
> I still don't know _why_ signed integer overflow behaviour
> isn't defined in C. It just goes against everyones expectation
> and thus causes bugs.

Because it is hard to make wrapping work on non twos complement
architectures. Then it is easier to trap.

Esben

>
>
> Johannes
>

2007-05-09 00:02:26

by Peter Williams

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Esben Nielsen wrote:
>
>
> On Tue, 8 May 2007, Peter Williams wrote:
>
>> Esben Nielsen wrote:
>>>
>>>
>>> On Sun, 6 May 2007, Linus Torvalds wrote:
>>>
>>> > > > On Sun, 6 May 2007, Ingo Molnar wrote:
>>> > > > > * Linus Torvalds <[email protected]> wrote:
>>> > > > > > So the _only_ valid way to handle timers is to
>>> > > > - either not allow wrapping at all (in which case "unsigned"
>>> is > > > better,
>>> > > > since it is bigger)
>>> > > > - or use wrapping explicitly, and use unsigned arithmetic
>>> (which is
>>> > > > well-defined in C) and do something like "(long)(a-b) > 0".
>>> > > > > hm, there is a corner-case in CFS where a fix like this is
>>> necessary.
>>> > > > > CFS uses 64-bit values for almost everything, and the
>>> majority of > > values
>>> > > are of 'relative' nature with no danger of overflow. (They are
>>> signed
>>> > > because they are relative values that center around zero and can be
>>> > > negative or positive.)
>>> > > Well, I'd like to just worry about that for a while.
>>> > > You say there is "no danger of overflow", and I mostly agree
>>> that once
>>> > we're talking about 64-bit values, the overflow issue simply doesn't
>>> > exist, and furthermore the difference between 63 and 64 bits is
>>> not > really
>>> > relevant, so there's no major reason to actively avoid signed
>>> entries.
>>> > > So in that sense, it all sounds perfectly sane. And I'm
>>> definitely not
>>> > sure your "292 years after bootup" worry is really worth even >
>>> considering.
>>> >
>>>
>>> I would hate to tell mission control for Mankind's first mission to
>>> another
>>> star to reboot every 200 years because "there is no need to worry about
>>> it."
>>>
>>> As a matter of principle an OS should never need a reboot (with
>>> exception
>>> for upgrading). If you say you have to reboot every 200 years, why not
>>> every 100? Every 50? .... Every 45 days (you know what I am
>>> referring to
>>> :-) ?
>>
>> There's always going to be an upper limit on the representation of time.
>> At least until we figure out how to implement infinity properly.
>
> Well you need infinite memory for that :-)
> But that is my point: Why go into the problem of storing absolute time
> when you can use relative time?

I'd reverse that and say "Why go to the bother of using relative time
when you can use absolute time?". Absolute time being time since boot,
of course.

>
>
>>
>>>
>>> > When we're really so well off that we expect the hardware and
>>> software
>>> > stack to be stable over a hundred years, I'd start to think about
>>> issues
>>> > like that, in the meantime, to me worrying about those kinds of
>>> issues
>>> > just means that you're worrying about the wrong things.
>>> > > BUT.
>>> > > There's a fundamental reason relative timestamps are difficult
>>> and > almost
>>> > always have overflow issues: the "long long in the future" case as an
>>> > approximation of "infinite timeout" is almost always relevant.
>>> > > So rather than worry about the system staying up 292 years, I'd
>>> worry
>>> > about whether people pass in big numbers (like some MAX_S64 >
>>> approximation)
>>> > as an approximation for "infinite", and once you have things like
>>> that,
>>> > the "64 bits never overflows" argument is totally bogus.
>>> > > There's a damn good reason for using only *absolute* time. The
>>> whole
>>> > "signed values of relative time" may _sound_ good, but it really
>>> sucks > in
>>> > subtle and horrible ways!
>>> >
>>>
>>> I think you are wrong here. The only place you need absolute time is
>>> a for
>>> the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
>>> representation when you could have used a 32 bit. With a 32 bit
>>> implementation you are forced to handle the corner cases with wrap
>>> around
>>> and too big arguments up front. With a 64 bit you hide those problems.
>>
>> As does the other method. A 32 bit signed offset with a 32 bit base
>> is just a crude version of 64 bit absolute time.
>
> 64 bit is also relative - just over a much longer period.

Yes, relative to boot.

> 32 bit signed offset is relative - and you know it. But with 64 people
> think it is absolute and put in large values as Linus said above.

What people? Who gets to feed times into the scheduler? Isn't it just
using the time as determined by the system?

> With
> 32 bit future developers will know it is relative and code for it. And
> they will get their corner cases tested, because the code soon will run
> into those corners.
>
>>
>>>
>>> I think CFS would be best off using a 32 bit timer counting in micro
>>> seconds. That would wrap around in 72 minuttes. But as the timers are
>>> relative you will never be able to specify a timer larger than 36
>>> minuttes
>>> in the future. But 36 minuttes is redicolously long for a scheduler
>>> and a
>>> simple test limiting time values to that value would not break
>>> anything.
>>
>> Except if you're measuring sleep times. I think that you'll find lots
>> of tasks sleep for more than 72 minutes.
>
> I don't think those large values will be relavant. You can easily cut
> off sleep times at 30 min or even 1 min.

The aim is to make the code as simple as possible not add this kind of
rubbish and 1 minute would be far too low.

> But you need to detect that the
> task have indeed been sleeping 2^32+1 usec and not 1 usec. You can't do
> with a 32 bit timer alone. In that case you need to use a (at least) 64 bit
> timer - which is needed in the OS anyways. But not internally in the
> wait queue, where the repeated calculations are.
>

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-11 12:29:56

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Hi!

> >>You say there is "no danger of overflow", and I mostly
> >>agree that once
> >>we're talking about 64-bit values, the overflow issue
> >>simply doesn't
> >>exist, and furthermore the difference between 63 and
> >>64 bits is not really
> >>relevant, so there's no major reason to actively avoid
> >>signed entries.
> >>
> >>So in that sense, it all sounds perfectly sane. And
> >>I'm definitely not
> >>sure your "292 years after bootup" worry is really
> >>worth even considering.
> >>
> >
> >I would hate to tell mission control for Mankind's
> >first mission to another
> >star to reboot every 200 years because "there is no
> >need to worry about it."
> >
> >As a matter of principle an OS should never need a
> >reboot (with exception for upgrading). If you say you
> >have to reboot every 200 years, why not every 100?
> >Every 50? .... Every 45 days (you know what I am
> >referring to :-) ?
>
> There's always going to be an upper limit on the
> representation of time. At least until we figure out
> how to implement infinity properly.

There's also upper limit on life time of this universe. 1000 bits is
certainly enough to represent that in u-seconds.

Also notice that current cpus were not designed to work 300 years.
When we have hw designed for 50 years+, we can start to worry.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-05-11 16:52:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8



On Thu, 10 May 2007, Pavel Machek wrote:
>
> Also notice that current cpus were not designed to work 300 years.
> When we have hw designed for 50 years+, we can start to worry.

Indeed. CPU manufacturers don't seem to talk about it very much, and
searching for it with google on intel.com comes up with

The failure rate and Mean Time Between Failure (MTBF) data is not
currently available on our website. You may contact Intel?
Customer Support for this information.

which seems to be just a fancy way of saying "we don't actually want to
talk about it". Probably not because it's actually all that bad, but
simply because people don't think about it, and there's no reason a CPU
manufacturer would *want* people to think about it.

But if you wondered why server CPU's usually run at a lower frequency,
it's because of MTBF issues. I think a desktop CPU is usually specced to
run for 5 years (and that's expecting that it's turned off or at least
idle much of the time), while a server CPU is expected to last longer and
be active a much bigger percentage of time.

("Active" == "heat" == "more damage due to atom migration etc". Which is
part of why you're not supposed to overclock stuff: it may well work well
for you, but for all you know it will cut your expected CPU life by 90%).

Of course, all the other components have a MTBF too (and I'd generally
expect a power supply component to fail long before the CPU does). And
yes, some machines will last for decades. But I suspect we've all heard of
machines that just don't work reliably any more.

Linus

2007-05-11 19:21:33

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

Hi!

> > Also notice that current cpus were not designed to work 300 years.
> > When we have hw designed for 50 years+, we can start to worry.
>
> Indeed. CPU manufacturers don't seem to talk about it very much, and
> searching for it with google on intel.com comes up with
>
> The failure rate and Mean Time Between Failure (MTBF) data is not
> currently available on our website. You may contact Intel?
> Customer Support for this information.
>
> which seems to be just a fancy way of saying "we don't actually want to
> talk about it". Probably not because it's actually all that bad, but
> simply because people don't think about it, and there's no reason a CPU
> manufacturer would *want* people to think about it.
>
> But if you wondered why server CPU's usually run at a lower frequency,
> it's because of MTBF issues. I think a desktop CPU is usually specced to
> run for 5 years (and that's expecting that it's turned off or at least
> idle much of the time), while a server CPU is expected to last longer and
> be active a much bigger percentage of time.
>
> ("Active" == "heat" == "more damage due to atom migration etc". Which is
> part of why you're not supposed to overclock stuff: it may well work well
> for you, but for all you know it will cut your expected CPU life by 90%).

Actually, when I talked with AMD, they told me that cpus should last
10 years *at their max specced temperature*... which is 95Celsius. So
overclocking is not that evil, according to my info.

(That would mean way more than 10 years if you use your cpu
'normally'.)

But I guess capacitors from cpu power supply will hate you running cpu
at 95C...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-05-11 19:39:33

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On Fri, May 11, 2007 at 07:18:25PM +0000, Pavel Machek wrote:
> Hi!
>
> > > Also notice that current cpus were not designed to work 300 years.
> > > When we have hw designed for 50 years+, we can start to worry.
> >
> > Indeed. CPU manufacturers don't seem to talk about it very much, and
> > searching for it with google on intel.com comes up with
> >
> > The failure rate and Mean Time Between Failure (MTBF) data is not
> > currently available on our website. You may contact Intel?
> > Customer Support for this information.
> >
> > which seems to be just a fancy way of saying "we don't actually want to
> > talk about it". Probably not because it's actually all that bad, but
> > simply because people don't think about it, and there's no reason a CPU
> > manufacturer would *want* people to think about it.
> >
> > But if you wondered why server CPU's usually run at a lower frequency,
> > it's because of MTBF issues. I think a desktop CPU is usually specced to
> > run for 5 years (and that's expecting that it's turned off or at least
> > idle much of the time), while a server CPU is expected to last longer and
> > be active a much bigger percentage of time.
> >
> > ("Active" == "heat" == "more damage due to atom migration etc". Which is
> > part of why you're not supposed to overclock stuff: it may well work well
> > for you, but for all you know it will cut your expected CPU life by 90%).
>
> Actually, when I talked with AMD, they told me that cpus should last
> 10 years *at their max specced temperature*... which is 95Celsius. So
> overclocking is not that evil, according to my info.

I tend to believe that. I've slowed down the FANs on my dual-athlon XP
to silent them, and I found the system to be (apparently) stable till
96 deg Celsius with FANs unplugged. So I regulated them in order to maintain
a temperature below 90 deg for a safety margin, and they seem to be as
happy as my ears. They've been like that for the last 3-4 years, I don't
remember.

On a related note, older technology is less sensible. My old VAX VLC4000
from 1991 which received a small heatsink in exchange for its noisy fans
is still doing well in an closed place. I'm more worried for the EPROMs
which store the boot code. Also, I think that the RS6000 processed at
half-micron which run the Mars rovers might run for hundreds of years
at 30 MHz.

> (That would mean way more than 10 years if you use your cpu
> 'normally'.)
>
> But I guess capacitors from cpu power supply will hate you running cpu
> at 95C...

even at 60, many of them die within a few years. Bu I think we're
"slightly" off-topic now...

> Pavel

Cheers
Willy

2007-05-11 20:54:17

by Kevin Bowling

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v8

On 5/11/07, Willy Tarreau <[email protected]> wrote:
> On Fri, May 11, 2007 at 07:18:25PM +0000, Pavel Machek wrote:
> > Hi!
> >
> > > > Also notice that current cpus were not designed to work 300 years.
> > > > When we have hw designed for 50 years+, we can start to worry.
>
> On a related note, older technology is less sensible. My old VAX VLC4000
> from 1991 which received a small heatsink in exchange for its noisy fans
> is still doing well in an closed place. I'm more worried for the EPROMs
> which store the boot code. Also, I think that the RS6000 processed at
> half-micron which run the Mars rovers might run for hundreds of years
> at 30 MHz.

Not your typical RS/6000. http://en.wikipedia.org/wiki/RAD6000

Without a doubt, decently built computers will easily last at least 10
years. From what I can tell, the issue here was simply OS runtime?