LinuxLists.cc - fair clock use in CFS

2007-05-14 08:26:15

Subject: fair clock use in CFS

Hi,
I have been brooding over how fair clock is computed/used in
CFS and thought I would ask the experts to avoid wrong guesses!

As I understand, fair_clock is a monotonously increasing clock which
advances at a pace inversely proportional to the load on the runqueue.
If load = 1 (task), it will advance at same pace as wall clock, as
load increases it advances slower than wall clock.

In addition, following calculations depend on fair clock: task's wait
time on runqueue and sleep time outside the runqueue (both reflected in
p->wait_run_time).

Few questions that come up are:

1. Why can't fair clock be same as wall clock at all times? i.e fair
clock progresses at same pace as wall clock independent of the load on
the runqueue.

It would still give the ability to measure time spent waiting on runqueue
or sleeping and use that calculated time to give latency/bandwidth
credit?

In case of EEVDF, the use of virtual clock seems more
understandable, if we consider the fact that each client gets 'wi' real
time units in 1 virtual time unit. That doesnt seem to be the case in
CFS as Ting Yang explained +/- lags here
http://lkml.org/lkml/2007/5/2/612 ..

2. Preemption granularity - sysctl_sched_granularity

This seems to be measured in the fair clock scale rather than
wall clock scale. As a consequence of this, the time taken
for a task to relinquish to competetion is dependent on number N
of tasks? For ex: if there a million cpu hungry tasks, then the
time taken to switch between two tasks is more compared to the
case where just two cpu hungry tasks are running. Is there
any advantage of using fair clock scale to detect preemtion points?

--
Regards,
vatsa

2007-05-14 10:29:12

by William Lee Irwin III

[permalink] [raw]

Subject: Re: fair clock use in CFS

On Mon, May 14, 2007 at 02:03:58PM +0530, Srivatsa Vaddagiri wrote:
> I have been brooding over how fair clock is computed/used in
> CFS and thought I would ask the experts to avoid wrong guesses!
> As I understand, fair_clock is a monotonously increasing clock which
> advances at a pace inversely proportional to the load on the runqueue.
> If load = 1 (task), it will advance at same pace as wall clock, as
> load increases it advances slower than wall clock.
> In addition, following calculations depend on fair clock: task's wait
> time on runqueue and sleep time outside the runqueue (both reflected in
> p->wait_run_time).

It's not hard to see that that's a mistake. The great thing about virtual
deadline schedulers, though, is that all it takes to completely rewrite
the policy is changing the numbers calculated for these things.

On Mon, May 14, 2007 at 02:03:58PM +0530, Srivatsa Vaddagiri wrote:
> Few questions that come up are:
> 1. Why can't fair clock be same as wall clock at all times? i.e fair
> clock progresses at same pace as wall clock independent of the load on
> the runqueue.
> It would still give the ability to measure time spent waiting on runqueue
> or sleeping and use that calculated time to give latency/bandwidth
> credit?
> In case of EEVDF, the use of virtual clock seems more
> understandable, if we consider the fact that each client gets 'wi' real
> time units in 1 virtual time unit. That doesnt seem to be the case in
> CFS as Ting Yang explained +/- lags here
> http://lkml.org/lkml/2007/5/2/612 ..

It's not just more understandable, it doesn't break down with
increasing numbers of tasks.

On Mon, May 14, 2007 at 02:03:58PM +0530, Srivatsa Vaddagiri wrote:
> 2. Preemption granularity - sysctl_sched_granularity
> This seems to be measured in the fair clock scale rather than
> wall clock scale. As a consequence of this, the time taken
> for a task to relinquish to competetion is dependent on number N
> of tasks? For ex: if there a million cpu hungry tasks, then the
> time taken to switch between two tasks is more compared to the
> case where just two cpu hungry tasks are running. Is there
> any advantage of using fair clock scale to detect preemtion points?

I'm not convinced this is a great way to mitigate context switch
overhead. I'd recommend heuristically adjusting the latency parameters
(l_i) to try to mitigate context switch overhead or otherwise express
the tradeoff between latency and throughput instead of introducing an
arbitrary delay like sched_granularity_ns. I suspect it'll have to
restart from a point much closer to EEVDF for that to be effective.

-- wli

2007-05-14 10:31:41

by Ingo Molnar

[permalink] [raw]

Subject: Re: fair clock use in CFS

* William Lee Irwin III <[email protected]> wrote:

> On Mon, May 14, 2007 at 02:03:58PM +0530, Srivatsa Vaddagiri wrote:
> > I have been brooding over how fair clock is computed/used in
> > CFS and thought I would ask the experts to avoid wrong guesses!
> > As I understand, fair_clock is a monotonously increasing clock which
> > advances at a pace inversely proportional to the load on the runqueue.
> > If load = 1 (task), it will advance at same pace as wall clock, as
> > load increases it advances slower than wall clock.
> > In addition, following calculations depend on fair clock: task's wait
> > time on runqueue and sleep time outside the runqueue (both reflected in
> > p->wait_run_time).
>
> It's not hard to see that that's a mistake. [...]

please clarify - exactly what is a mistake? Thanks,

Ingo

2007-05-14 11:04:36

by William Lee Irwin III

[permalink] [raw]

Subject: Re: fair clock use in CFS

On Mon, May 14, 2007 at 02:03:58PM +0530, Srivatsa Vaddagiri wrote:
>>> I have been brooding over how fair clock is computed/used in
>>> CFS and thought I would ask the experts to avoid wrong guesses!
>>> As I understand, fair_clock is a monotonously increasing clock which
>>> advances at a pace inversely proportional to the load on the runqueue.
>>> If load = 1 (task), it will advance at same pace as wall clock, as
>>> load increases it advances slower than wall clock.
>>> In addition, following calculations depend on fair clock: task's wait
>>> time on runqueue and sleep time outside the runqueue (both reflected in
>>> p->wait_run_time).

* William Lee Irwin III <[email protected]> wrote:
>> It's not hard to see that that's a mistake. [...]

On Mon, May 14, 2007 at 12:31:20PM +0200, Ingo Molnar wrote:
> please clarify - exactly what is a mistake? Thanks,

The variability in ->fair_clock advancement rate was the mistake, at
least according to my way of thinking. The queue's virtual time clock
effectively stops under sufficiently high load, possibly literally in
the event of fixpoint underflow. The waiting time is impossible to
scale properly in the presence of a variable time rate (at least under
the memory allocation constraints of a scheduler), and the deadlines
computed as offsets from ->fair_clock come out clustered.

Basically it needs to move closer to EEVDF in these respects.

This detail sailed right past me when it went in; sorry about that.

-- wli

2007-05-14 11:11:17

by Ingo Molnar

[permalink] [raw]

Subject: Re: fair clock use in CFS

* Srivatsa Vaddagiri <[email protected]> wrote:

> I have been brooding over how fair clock is computed/used in
> CFS and thought I would ask the experts to avoid wrong guesses!

hey, thanks for the interest :-)

> As I understand, fair_clock is a monotonously increasing clock which
> advances at a pace inversely proportional to the load on the runqueue.
> If load = 1 (task), it will advance at same pace as wall clock, as
> load increases it advances slower than wall clock.

correct.

> In addition, following calculations depend on fair clock: task's wait
> time on runqueue and sleep time outside the runqueue (both reflected
> in p->wait_run_time).

(note that in -v12 only the on-runqueue wait time is used.)

> Few questions that come up are:
>
> 1. Why can't fair clock be same as wall clock at all times? i.e fair
> clock progresses at same pace as wall clock independent of the load
> on the runqueue.
>
> It would still give the ability to measure time spent waiting on
> runqueue or sleeping and use that calculated time to give
> latency/bandwidth credit?

155 milliseconds spent waiting for CPU time while there are another 10
tasks contending for the CPU is a different scenario from when there is
just one task running on the CPU. In the 10-task case a single task
would only have a 'fair expectation' to run for 15.5 milliseconds, in
the 1-task case a single task has a 'fair expectation' to run the full
155 milliseconds. The 'fair clock' measures this capacity of 'effective
CPU power' in essence.

but let me give you some more CFS design background:

80% of CFS's design can be summed up in a single sentence: CFS basically
models an "ideal, precise multi-tasking CPU" on real hardware.

"Ideal multi-tasking CPU" is a (non-existent :-) CPU that has 100%
physical power and which can run each task at precise equal speed, in
parallel, each at 1/nr_running speed. For example: if there are 2 tasks
running then it runs each at 50% physical power - totally in parallel.

On real hardware, we can run only a single task at once, so while that
one task runs the other tasks that are waiting for the CPU are at a
disadvantage - the current task gets an unfair amount of CPU time. In
CFS this fairness imbalance is expressed and tracked via the per-task
p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of
time the task should now run on the CPU for it become completely fair
and balanced.

( small detail: on 'ideal' hardware, the p->wait_runtime value would
always be zero - no task would ever get 'out of balance' from the
'ideal' share of CPU time. )

CFS's task picking logic is based on this p->wait_runtime value and it
is thus very simple: it always tries to run the task with the largest
p->wait_runtime value. In other words, CFS tries to run the task with
the 'gravest need' for more CPU time. So CFS always tries to split up
CPU time between runnable tasks as close to 'ideal multitasking
hardware' as possible.

Most of the rest of CFS's design just falls out of this really simple
concept, with a few add-on embellishments like nice levels,
multiprocessing and various algorithm variants to recognize sleepers.

In practice it works like this: the system runs a task a bit, and when
the task schedules (or a scheduler tick happens) the task's CPU usage is
'accounted for': the (small) time it just spent using the physical CPU
is deducted from p->wait_runtime. [minus the 'fair share' it would have
gotten anyway]. Once p->wait_runtime gets low enough so that another
task becomes the 'leftmost task' (plus a small amount of 'granularity'
distance relative to the leftmost task so that we do not over-schedule
tasks and trash the cache) then the new leftmost task is picked and the
current task is preempted.

The rq->fair_clock value tracks the 'CPU time a runnable task would have
fairly gotten, had it been runnable during that time'. So by using
rq->fair_clock values we can accurately timestamp and measure the
'expected CPU time' a task should have gotten. All runnable tasks are
sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and
CFS picks the 'leftmost' task and sticks to it. As the system progresses
forwards, newly woken tasks are put into the tree more and more to the
right - slowly but surely giving a chance for every task to become the
'leftmost task' and thus get on the CPU within a deterministic amount of
time.

> 2. Preemption granularity - sysctl_sched_granularity
>
> This seems to be measured in the fair clock scale rather than
> wall clock scale. As a consequence of this, the time taken for a
> task to relinquish to competetion is dependent on number N of
> tasks? For ex: if there a million cpu hungry tasks, then the
> time taken to switch between two tasks is more compared to the
> case where just two cpu hungry tasks are running. Is there any
> advantage of using fair clock scale to detect preemtion points?

yes, there is an advantage. The granularity is really mostly a property
of the physical CPU and not part of the "precise scheduling" model.
Granularity expresses the fact that a task has caching affinity to the
CPU and there is a cost to scheduling in a too finegrained way. This
granularity value does not depend on the number of tasks running. For
example the granularity value could be made really small if CPUs had no
task-switching costs.

( small detail: the granularity value is currently dependent on the nice
level, making it easier for higher-prio tasks to preempt lower-prio
tasks. )

( i'd agree in theory with the proposition that the granularity could be
made a function of load too, but only to model cache/affinity effects,
_not_ due to the basic model of precise scheduling. )

Ingo

2007-05-14 11:15:20

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: fair clock use in CFS

On Mon, May 14, 2007 at 04:05:00AM -0700, William Lee Irwin III wrote:
> The variability in ->fair_clock advancement rate was the mistake, at
> least according to my way of thinking. The queue's virtual time clock
> effectively stops under sufficiently high load, possibly literally in
> the event of fixpoint underflow.

[snip]

> Basically it needs to move closer to EEVDF in these respects.

Doesn't EEVDF have the same issue? From the paper:

V(t) = 1/(w1 + w2 + ...wn)

--
Regards,
vatsa

2007-05-14 11:19:40

by William Lee Irwin III

[permalink] [raw]

Subject: Re: fair clock use in CFS

On Mon, May 14, 2007 at 04:05:00AM -0700, William Lee Irwin III wrote:
>> The variability in ->fair_clock advancement rate was the mistake, at
>> least according to my way of thinking. The queue's virtual time clock
>> effectively stops under sufficiently high load, possibly literally in
>> the event of fixpoint underflow.

On Mon, May 14, 2007 at 04:52:59PM +0530, Srivatsa Vaddagiri wrote:
> [snip]

On Mon, May 14, 2007 at 04:05:00AM -0700, William Lee Irwin III wrote:
>> Basically it needs to move closer to EEVDF in these respects.

On Mon, May 14, 2007 at 04:52:59PM +0530, Srivatsa Vaddagiri wrote:
> Doesn't EEVDF have the same issue? From the paper:
> V(t) = 1/(w1 + w2 + ...wn)

Who knows what I was smoking, then. I misremembered the scale factor
as being on the other side of comparisons with the queue's clock. I'm
suspicious of EEVDF's timekeeping now as well.

-- wli

2007-05-14 11:59:24

by Ingo Molnar

[permalink] [raw]

Subject: Re: fair clock use in CFS

* William Lee Irwin III <[email protected]> wrote:

> On Mon, May 14, 2007 at 12:31:20PM +0200, Ingo Molnar wrote:
> > please clarify - exactly what is a mistake? Thanks,
>
> The variability in ->fair_clock advancement rate was the mistake, at
> least according to my way of thinking. [...]

you are quite wrong. Lets consider the following example:

we have 10 tasks running (all at nice 0). The current task spends 20
msecs on the CPU and a new task is picked. How much CPU time did that
waiting task get entitled to during its 20 msecs wait? If fair_clock was
constant as you suggest then we'd give it 20 msecs - but its true 'fair
expectation' of CPU time was only 20/10 == 2 msecs!

So a 'constant' fair_clock would turn the whole equilibrium upside down
(it would inflate p->wait_runtime values and the global sum would not be
roughly constant anymore but would run up very fast), especially during
fluctuating loads.

the fair_clock is the fundamental expression of "fair CPU timeline", and
task's expected runtime is always measured by that, not by the real
clock. The only time when we measure the true time is when a _single_
task runs on the CPU - but in that case the task truly spent a small
amount of time on the CPU, exclusively. See the exec_time calculations
in kernel/sched_fair.c.

Ingo

2007-05-14 12:04:24

by Ingo Molnar

[permalink] [raw]

Subject: Re: fair clock use in CFS

* William Lee Irwin III <[email protected]> wrote:

> [...] I'm suspicious of EEVDF's timekeeping now as well.

well, EEVDF is a paper-only academic scheduler, one out of thousands
that never touched real hardware. For nearly every mathematically
possible scheduling algorithm i suspect there's at least one paper out
there already describing it ;-) But this time the EEVDF paper got it
right and you got it wrong. But really, if you want to make an impact on
how CFS looks like then please try it and send test feedback and/or
patches - not words. =B-)

Ingo

2007-05-14 12:56:27

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: fair clock use in CFS

On Mon, May 14, 2007 at 01:10:51PM +0200, Ingo Molnar wrote:
> but let me give you some more CFS design background:

Thanks for this excellent explanation. Things are much clearer now to
me. I just want to clarify one thing below:

> > 2. Preemption granularity - sysctl_sched_granularity

[snip]

> This granularity value does not depend on the number of tasks running.

Hmm ..so does sysctl_sched_granularity represents granularity in
real/wall-clock time scale then? AFAICS that doesnt seem to be the case.

__check_preempt_curr_fair() compares for the distance between the two
task's (current and next-to-be-run task) fair_key values for deciding
preemption point.

Let's say that to begin with, at real time t0, both current task Tc and next
task Tn's fair_key values are same, at value K. Tc will keep running until its
fair_key value reaches atleast K + 2000000. The *real/wall-clock* time taken
for Tc's fair_key value to reach K + 2000000 - is surely dependent on N,
the number of tasks on the queue (more the load, more slowly the fair
clock advances)?

This is what I meant by my earlier remark: "If there a million cpu hungry tasks,
then the (real/wall-clock) time taken to switch between two tasks is more
compared to the case where just two cpu hungry tasks are running".

--
Regards,
vatsa

2007-05-14 13:16:11

by Ingo Molnar

[permalink] [raw]

Subject: Re: fair clock use in CFS

* Srivatsa Vaddagiri <[email protected]> wrote:

> On Mon, May 14, 2007 at 01:10:51PM +0200, Ingo Molnar wrote:
> > but let me give you some more CFS design background:
>
> Thanks for this excellent explanation. Things are much clearer now to
> me. I just want to clarify one thing below:
>
> > > 2. Preemption granularity - sysctl_sched_granularity
>
> [snip]
>
> > This granularity value does not depend on the number of tasks running.
>
> Hmm ..so does sysctl_sched_granularity represents granularity in
> real/wall-clock time scale then? AFAICS that doesnt seem to be the
> case.

there's only this small detail i mentioned:

> > ( small detail: the granularity value is currently dependent on the
> > nice level, making it easier for higher-prio tasks to preempt
> > lower-prio tasks. )

> __check_preempt_curr_fair() compares for the distance between the two
> task's (current and next-to-be-run task) fair_key values for deciding
> preemption point.
>
> Let's say that to begin with, at real time t0, both current task Tc
> and next task Tn's fair_key values are same, at value K. Tc will keep
> running until its fair_key value reaches atleast K + 2000000. The
> *real/wall-clock* time taken for Tc's fair_key value to reach K +
> 2000000 - is surely dependent on N, the number of tasks on the queue
> (more the load, more slowly the fair clock advances)?

well, it's somewhere in the [ granularity .. granularity*2 ] wall-clock
scale. Basically the slowest way it can reach it is 'half speed' (two
tasks running), the slowest way is 'near full speed' (lots of tasks
running).

> This is what I meant by my earlier remark: "If there a million cpu
> hungry tasks, then the (real/wall-clock) time taken to switch between
> two tasks is more compared to the case where just two cpu hungry tasks
> are running".

the current task is recalculated at scheduler tick time and put into the
tree at its new position. At a million tasks the fair-clock will advance
little (or not at all - which at these load levels is our smallest
problem anyway) so during a scheduling tick in kernel/sched_fair.c
update_curr() we will have a 'delta_mine' and 'delta_fair' of near zero
and a 'delta_exec' of ~1 million, so curr->wait_runtime will be
decreased at 'full speed': delta_exec-delta_mine, by almost a full tick.
So preemption will occur every sched_granularity (rounded up to the next
tick) points in time, in wall-clock time.

with 2 tasks running delta_exec-delta_mine is 0.5 million, so preemption
will occur in 2*sched_granularity (rounded up to the next timer tick)
wall-clock time.

Ingo

2007-05-14 14:31:25

by Daniel Hazelton

[permalink] [raw]

Subject: Re: fair clock use in CFS

On Monday 14 May 2007 07:50:49 Ingo Molnar wrote:
> * William Lee Irwin III <[email protected]> wrote:
> > On Mon, May 14, 2007 at 12:31:20PM +0200, Ingo Molnar wrote:
> > > please clarify - exactly what is a mistake? Thanks,
> >
> > The variability in ->fair_clock advancement rate was the mistake, at
> > least according to my way of thinking. [...]
>
> you are quite wrong. Lets consider the following example:
>
> we have 10 tasks running (all at nice 0). The current task spends 20
> msecs on the CPU and a new task is picked. How much CPU time did that
> waiting task get entitled to during its 20 msecs wait? If fair_clock was
> constant as you suggest then we'd give it 20 msecs - but its true 'fair
> expectation' of CPU time was only 20/10 == 2 msecs!

Either you have a strange definition of fairness or you chose an extremely
poor example, Ingo. In a fair scheduler I'd expect all tasks to get the exact
same amount of time on the processor. So if there are 10 tasks running at
nice 0 and the current task has run for 20msecs before a new task is swapped
onto the CPU, the new task and *all* other tasks waiting to get onto the CPU
should get the same 20msecs. What you've described above is fundamentally
unfair - one process running for 20msecs while the 10 processes that are
waiting for their chance each get a period that increases from a short period
at a predictable rate.

Some numbers based on your above description:
Process 1 runs for 20msecs
Process 2 runs for 2msecs (20/10 == 2msecs)
Process 3 runs for 2.2msecs (has waited 22msecs, 22/10 == 2.2)
Process 4 runs for 2.4msecs (has waited 24.2msecs - rounded for brevity)
Process 5 runs for 2.7msecs (has waited 26.6msecs - rounded for brevity)
process 6 runs for 3msecs (has waited 30.3msecs)
process 7 runs for 3.3msecs (has waited approx. 33msecs)
process 8 runs for 3.6msecs (has waited approx. 36msecs)
process 9 runs for 3.9msecs (has waited approx. 39msecs)
process 10 runs for 4.2msecs (has waited approx. 42msecs)

Now if the "process time" isn't scaled to match the length of time that the
process has spent waiting to get on the CPU you get some measure of fairness
back, but even then the description of CFS you've given shows a fundamental
unfairness.

However, if you meant that "the new process has spent 20msecs waiting to get
on the CPU", then the rest of your description does show what I'd expect from
a fair scheduler. If not, then I guess that CFS is only "Completely Fair" for
significantly large values of "fair".

(I will not, however, argue that CFS is'nt a damned good scheduler that has
improved interactivity on the systems of those people that have tested it)

> So a 'constant' fair_clock would turn the whole equilibrium upside down
> (it would inflate p->wait_runtime values and the global sum would not be
> roughly constant anymore but would run up very fast), especially during
> fluctuating loads.

Hrm... Okay, so you're saying that "fair_clock" runs slower the more process
there are running to keep the above run-up in "Time Spent on CPU" I noticed
based solely on your initial example? If that is the case, then I can see the
fairness - its just not visible from a really quick look at the code and the
simplified description you gave earlier.

DRH

2007-05-14 14:54:46

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: fair clock use in CFS

On Mon, May 14, 2007 at 10:31:13AM -0400, Daniel Hazelton wrote:
> Hrm... Okay, so you're saying that "fair_clock" runs slower the more process
> there are running to keep the above run-up in "Time Spent on CPU" I noticed
> based solely on your initial example? If that is the case, then I can see the
> fairness - its just not visible from a really quick look at the code and the
> simplified description you gave earlier.

>From the code:

update_curr()

delta_fair = delta_exec * NICE_0_LOAD;
do_div(delta_fair, rq->raw_weighted_load);

..

rq->fair_clock += delta_fair;

Although wall clock would have advanced by delta_exec, fair clock
advances only by delta_fair.

More the load on the CPU (rq->raw_weighted_load), slower is this advance
compared to wall clock.

--
Regards,
vatsa

2007-05-14 14:58:04

by Al Boldi

[permalink] [raw]

Subject: Re: fair clock use in CFS

Ingo Molnar wrote:
> the current task is recalculated at scheduler tick time and put into the
> tree at its new position. At a million tasks the fair-clock will advance
> little (or not at all - which at these load levels is our smallest
> problem anyway) so during a scheduling tick in kernel/sched_fair.c
> update_curr() we will have a 'delta_mine' and 'delta_fair' of near zero
> and a 'delta_exec' of ~1 million, so curr->wait_runtime will be
> decreased at 'full speed': delta_exec-delta_mine, by almost a full tick.
> So preemption will occur every sched_granularity (rounded up to the next
> tick) points in time, in wall-clock time.

The only problem I have with this fairness is the server workload that
services requests by fork/thread creation. In such a case, this fairness is
completely counter-productive, as running tasks unfairly inhibit the
creation of peers.

Giving fork/thread creation special priority may alleviate this problem.

Thanks!

--
Al

2007-05-14 15:08:24

by Ingo Molnar

[permalink] [raw]

Subject: Re: fair clock use in CFS

* Daniel Hazelton <[email protected]> wrote:

> [...] In a fair scheduler I'd expect all tasks to get the exact same
> amount of time on the processor. So if there are 10 tasks running at
> nice 0 and the current task has run for 20msecs before a new task is
> swapped onto the CPU, the new task and *all* other tasks waiting to
> get onto the CPU should get the same 20msecs. [...]

What happens in CFS is that in exchange for this task's 20 msecs the
other tasks get 2 msecs each. (and not only the one that gets on the CPU
next) So each task is handled equal. What i described was the first step
- for each task the same step happens (whenever it gets on the CPU, and
accounted/weighted for the precise period they spent waiting - so the
second task would get +4 msecs credited, the third task +6 msecs, etc.,
etc.).

but really - nothing beats first-hand experience: please just boot into
a CFS kernel and test its precision a bit. You can pick it up from the
usual place:

http://people.redhat.com/mingo/cfs-scheduler/

For example start 10 CPU hogs at once from a shell:

for (( N=0; N < 10; N++ )); do ( while :; do :; done ) & done

[ type 'killall bash' in the same shell to get rid of them. ]

then watch their CPU usage via 'top'. While the system is otherwise idle
you should get something like this after half a minute of runtime:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2689 mingo 20 0 5968 560 276 R 10.0 0.1 0:03.45 bash
2692 mingo 20 0 5968 564 280 R 10.0 0.1 0:03.45 bash
2693 mingo 20 0 5968 564 280 R 10.0 0.1 0:03.45 bash
2694 mingo 20 0 5968 564 280 R 10.0 0.1 0:03.45 bash
2695 mingo 20 0 5968 564 280 R 10.0 0.1 0:03.45 bash
2698 mingo 20 0 5968 564 280 R 10.0 0.1 0:03.45 bash
2690 mingo 20 0 5968 564 280 R 9.9 0.1 0:03.45 bash
2691 mingo 20 0 5968 564 280 R 9.9 0.1 0:03.45 bash
2696 mingo 20 0 5968 564 280 R 9.9 0.1 0:03.45 bash
2697 mingo 20 0 5968 564 280 R 9.9 0.1 0:03.45 bash

with each task having exactly the same 'TIME+' field in top. (the more
equal those fields, the more precise/fair the scheduler is. In the above
output each got its precise share of 3.45 seconds of CPU time.)

then as a next phase of testing please run various things on the system
(without stopping these loops) and try to get CFS "out of balance" -
you'll succeed if you manage to get an unequal 'TIME+' field for them.
Please try _really_ hard to break it. You can run any workload.

Or try massive_intr.c from:

http://lkml.org/lkml/2007/3/26/319

which uses a much less trivial scheduling pattern to test a scheduler's
precision of scheduling)

$ ./massive_intr 9 10
002765 00000125
002767 00000125
002762 00000125
002769 00000125
002768 00000126
002761 00000126
002763 00000126
002766 00000126
002764 00000126

(the second column is runtime - the more equal, the more precise/fair
the scheduler.)

Ingo

2007-05-14 20:21:25

by Ting Yang

[permalink] [raw]

Subject: Re: fair clock use in CFS

William Lee Irwin III wrote:
> On Mon, May 14, 2007 at 04:52:59PM +0530, Srivatsa Vaddagiri wrote:
>
>> Doesn't EEVDF have the same issue? From the paper:
>> V(t) = 1/(w1 + w2 + ...wn)
>>
>
> Who knows what I was smoking, then. I misremembered the scale factor
> as being on the other side of comparisons with the queue's clock. I'm
> suspicious of EEVDF's timekeeping now as well.
>
>
Both CFS and EEVDF uses the queue virtual time (VT) to measure the total
amount of work done so far, VT maps to different real time scale as the
workload in the system varies. It provides a measure for task to check
if it goes ahead or falls behind.

Suppose, each task p maintain its own virtual time, which is advance
reverse proportional to its weight
VT_p(t + 1) = VT_p(t) + 1/w_p
(in fact, CFS uses this to calculate p->delta_mine, EEVDF uses this to
decide the deadline for a given slice of work by adding l_p/w_p to
virtual start time.)
At the time when VT_p(t) = VT(t), i.e. at time t, the virtual time of a
task equals the virtual time of the queue, this task has received its
entitled share in interval [0, t]. If VT_p(t) < VT(p), it falls behind
than it should, otherwise it goes ahead than it should.

Both CFS and EEVDF uses this measure implicitly to decide when a task
should be executed. The difference is that CFS allows the amount of
carried out by a task of weight w_i to be continuously executed until it
goes ahead what it should by a certain amount (tuned and scaled
accordingly). While EEVDF has to give out a slice (since it is deadline
driven), and forces a potential long work to be done in smaller slices
and interleaved with other tasks. Combined with eligibility check, EEVDF
provides better "fairness" (only in the sense that work spread out more
evenly in relative short window, since nobody can continuously do more
than l_i amount of work) with the overhead of _more_ context switches.

It is really difficult for me to say which one is better. In
particular, the current CFS implementation favors higher weight tasks.
The granularity used by higher weight task is scaled up, which allows it
to go ahead more (as it is possibly more important and should make it
finish as early as possible.), while lower weight task has no such
ability. This makes a lot sense to me.

Ting

2007-05-14 21:24:45

by Ting Yang

[permalink] [raw]

Subject: Re: fair clock use in CFS

Ingo Molnar wrote:
> * William Lee Irwin III <[email protected]> wrote:
>
>
>> On Mon, May 14, 2007 at 12:31:20PM +0200, Ingo Molnar wrote:
>>
>>> please clarify - exactly what is a mistake? Thanks,
>>>
>> The variability in ->fair_clock advancement rate was the mistake, at
>> least according to my way of thinking. [...]
>>
>
> you are quite wrong. Lets consider the following example:
>
> we have 10 tasks running (all at nice 0). The current task spends 20
> msecs on the CPU and a new task is picked. How much CPU time did that
> waiting task get entitled to during its 20 msecs wait? If fair_clock was
> constant as you suggest then we'd give it 20 msecs - but its true 'fair
> expectation' of CPU time was only 20/10 == 2 msecs!
>
Exactly, once the queue virtual time is used, all other measures should
be scaled onto virtual time for consistency, since at different time a
unit of virtual time maps different amount wall clock time.

In CFS, the p->wait_runtime is in fact the lag against the ideal system,
that is the difference between the amount of "really" done so far and
the amount of work "should" be done so far. CFS really keeps a record
for each task indicates how far away it is from the "should" case. If a
task has p->wait_runtime = 0, it must have just received the exact share
it entitled till now. Similarly negative means faster than it "should"
and positive means slower than it "should".

I guess CFS may provides better "fairness" if it controls the negative
wait_runtime can be accumulated by a task. Higher priority allows more
negative to be accumulated and low priority allows less. CFS has already
done so by scaling granularity of preemption based on weight, the only
issue is that the amount of negative wait_runtime can be accumulated is
proportional to weight, which potentially can be O(n).

It is possible to do something like this in check_preemption ?

delta = curr->fair_key - first->fair_key;

if (delta > ??? [scale it as you wish] ||
(curr->key > first->key) && (curr->wait_runtime > ???
[simple funtion of curr->weight]) )
preempt

A limit control on wait_runtime may prevent a high weight task from
running for too long, so that others get executed a little earlier. Just
a thought :-)

Ting.

Ting

2007-05-14 23:23:19

by William Lee Irwin III

[permalink] [raw]

Subject: Re: fair clock use in CFS

On Mon, May 14, 2007 at 12:31:20PM +0200, Ingo Molnar wrote:
>>> please clarify - exactly what is a mistake? Thanks,

* William Lee Irwin III <[email protected]> wrote:
>> The variability in ->fair_clock advancement rate was the mistake, at
>> least according to my way of thinking. [...]

On Mon, May 14, 2007 at 01:50:49PM +0200, Ingo Molnar wrote:
> you are quite wrong. Lets consider the following example:
> we have 10 tasks running (all at nice 0). The current task spends 20
> msecs on the CPU and a new task is picked. How much CPU time did that
> waiting task get entitled to during its 20 msecs wait? If fair_clock was
> constant as you suggest then we'd give it 20 msecs - but its true 'fair
> expectation' of CPU time was only 20/10 == 2 msecs!

The amount of time to which the task was entitled remains the same,
delta_exec*curr->load_weight/get_rq_load(rq). Where the timekeeping
goes wrong is when trying to divide out changes in the virtual time.

Where I'm actually wrong is that using wall clock time doesn't resolve
it because there still isn't an integral to divide out. The running
average is a closer approximation. Possibly better would be to update
an explicit integral at the time of changes to ->raw_weighted_load and
to store a starting value of the integral for use as a subtrahend to
form a difference to use in lieu of computations now used for delta_fair.
It's a step function so it's not computationally intensive. What's now
used is somewhat more ad hoc.

On Mon, May 14, 2007 at 01:50:49PM +0200, Ingo Molnar wrote:
> So a 'constant' fair_clock would turn the whole equilibrium upside down
> (it would inflate p->wait_runtime values and the global sum would not be
> roughly constant anymore but would run up very fast), especially during
> fluctuating loads.
> the fair_clock is the fundamental expression of "fair CPU timeline", and
> task's expected runtime is always measured by that, not by the real
> clock. The only time when we measure the true time is when a _single_
> task runs on the CPU - but in that case the task truly spent a small
> amount of time on the CPU, exclusively. See the exec_time calculations
> in kernel/sched_fair.c.

My thought here revolves around the divisor of p->wait_runtime.

The interest in the global sum is interesting. It suggests an \ell^1
(sum of absolute values) aggregate lag metric but it appears you're
looking at them as signed in this sum. The field is at least signed.
More typical is attempting to minimize \ell^\infty (maximum absolute
value) aggregate lag metrics.

I'm suspicious of this global sum of signed lags. I would expect it
to deviate from this constraint significantly due to arriving and
departing tasks. The vector of lags is merely restricted to lie near a
plane i.e. |\sum_t lag(t)| < K as opposed to within some distance from
the origin in some \ell^p norm i.e. \sum_t |lag(t)|^p < K with the
usual limiting convention for p = \infty. The latter is the requirement
for constant lag bounds.

Of course, this sum treatment will largely work out anyway given that
tasks with larger positive lags are favored and larger negative lags
penalized.

-- wli

2007-05-14 23:57:29

by William Lee Irwin III

[permalink] [raw]

Subject: Re: fair clock use in CFS

* William Lee Irwin III <[email protected]> wrote:
>> [...] I'm suspicious of EEVDF's timekeeping now as well.

On Mon, May 14, 2007 at 02:04:05PM +0200, Ingo Molnar wrote:
> well, EEVDF is a paper-only academic scheduler, one out of thousands
> that never touched real hardware. For nearly every mathematically
> possible scheduling algorithm i suspect there's at least one paper out
> there already describing it ;-) But this time the EEVDF paper got it
> right and you got it wrong. But really, if you want to make an impact on
> how CFS looks like then please try it and send test feedback and/or
> patches - not words. =B-)

Perhaps the paper was inspired by preexisting code. Perhaps the code
dated back to the 70's.

The weight consistency testcase is largely done on account of Davide
Libenzi's already having written such a load generator apart from
incorporating the already-written code to postprocess the results.

I have no particular need to rely on the cfs patches for whatever
scheduler patches I'd care to write. I suppose I could write some if
I feel particularly inspired. Code review is not my forte anyway.

-- wli

2007-05-15 00:57:55

by Ting Yang

[permalink] [raw]

Subject: Re: fair clock use in CFS

>
>
> It is possible to do something like this in check_preemption ?
>
> delta = curr->fair_key - first->fair_key;
>
> if (delta > ??? [scale it as you wish] ||
> (curr->key > first->key) && (curr->wait_runtime > ???
> [simple funtion of curr->weight]) )
> preempt
Forgive me, there is typo in the above check. It should be "less than"
since we are controlling negative lags.

(curr->key > first->key) && (curr->wait_runtime < ??? [simple
funtion of curr->weight]) )

>
> A limit control on wait_runtime may prevent a high weight task from
> running for too long, so that others get executed a little earlier.
> Just a thought :-)
>
>
Ting

2007-05-15 03:00:09

by David Schwartz

[permalink] [raw]

Subject: RE: fair clock use in CFS

> Either you have a strange definition of fairness or you chose an
> extremely
> poor example, Ingo. In a fair scheduler I'd expect all tasks to
> get the exact
> same amount of time on the processor.

Yes, as a long-term average. However, that is impossible to do in the
short-term. Some taks has to run first and some task has to run last.
Whatever task runs last has had no CPU for all the time the other tasks were
running.

> So if there are 10 tasks running at
> nice 0 and the current task has run for 20msecs before a new task
> is swapped
> onto the CPU, the new task and *all* other tasks waiting to get
> onto the CPU
> should get the same 20msecs. What you've described above is fundamentally
> unfair - one process running for 20msecs while the 10 processes that are
> waiting for their chance each get a period that increases from a
> short period
> at a predictable rate.

The thing you're missing is that you are jumping into the middle of a system
that operates based on history. You are assuming concurrent creation of ten
tasks with no history and that for some reason the first task gets to run
for 20 milliseconds. The scheduler has to compensate for that past or it's
not fair.

> Some numbers based on your above description:
> Process 1 runs for 20msecs
> Process 2 runs for 2msecs (20/10 == 2msecs)
> Process 3 runs for 2.2msecs (has waited 22msecs, 22/10 == 2.2)
> Process 4 runs for 2.4msecs (has waited 24.2msecs - rounded for brevity)
> Process 5 runs for 2.7msecs (has waited 26.6msecs - rounded for brevity)
> process 6 runs for 3msecs (has waited 30.3msecs)
> process 7 runs for 3.3msecs (has waited approx. 33msecs)
> process 8 runs for 3.6msecs (has waited approx. 36msecs)
> process 9 runs for 3.9msecs (has waited approx. 39msecs)
> process 10 runs for 4.2msecs (has waited approx. 42msecs)

This is the scheduler making up for the assumed past that you got you into
this situation. It cannot simply predict a perfect distribution and apply it
because when it chooses how much CPU time to give process 2, it has no idea
how many processes will be ready to run when it chooses how much CPU time to
give process 10.

> Now if the "process time" isn't scaled to match the length of
> time that the
> process has spent waiting to get on the CPU you get some measure
> of fairness
> back, but even then the description of CFS you've given shows a
> fundamental
> unfairness.

What is that unfairness? That if there are 10 processes that want to run,
all things being equal, someone has to go first and someone has to go last?
That over a very short period of time, one process gets 100% of the CPU and
the others get none?

Yes. This fundamental unfairness is in every scheduler. The issue is what
you do to make up for this and create long-term fairness.

> However, if you meant that "the new process has spent 20msecs
> waiting to get
> on the CPU", then the rest of your description does show what I'd
> expect from
> a fair scheduler. If not, then I guess that CFS is only
> "Completely Fair" for
> significantly large values of "fair".

Same with every scheduler. It's fair over periods of time significantly
longer than the context switch interval.

> Hrm... Okay, so you're saying that "fair_clock" runs slower the
> more process
> there are running to keep the above run-up in "Time Spent on CPU"
> I noticed
> based solely on your initial example? If that is the case, then I
> can see the
> fairness - its just not visible from a really quick look at the
> code and the
> simplified description you gave earlier.

You have to long over a longer period of time and taking into account that
the scheduler also has to compensate for past inequities.

DS