2007-05-03 07:46:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v7


* Al Boldi <[email protected]> wrote:

> > i'm pleased to announce release -v7 of the CFS scheduler patchset.
> > (The main goal of CFS is to implement "desktop scheduling" with as
> > high quality as technically possible.)
> :
> :
> > As usual, any sort of feedback, bugreport, fix and suggestion is
> > more than welcome,
>
> This one seems on par with SD, [...]

excellent :-)

> [...] but there are still some nice issues.
>
> Try running 3 chew.c's, then renicing one to -10, starves others for
> some seconds while switching prio-level. Now renice it back to 10, it
> starves for up to 45sec.

ok - to make sure i understood you correctly: does this starvation only
occur right when you renice it (when switching prio levels), and it gets
rectified quickly once they get over a few reschedules?

> Also, nice levels are only effective on every other step; ie:
> ... -3/-2 , -1/0 , 1/2 ... yields only 20 instead of 40 prio-levels.

yeah - this is a first-approximation thing.

Some background: in the upstream scheduler (and in SD) nice levels are
linearly scaled, while in CFS they are exponentially scaled. I did this
because i believe exponential is more logical: regardless of which nice
level a task uses, if it goes +2 nice levels up then it will halve its
"fair CPU share". So for example the CPU consumption delta between nice
0 and nice +10 is 1/32 - and so is the delta between -5 and +5, -10 and
-5, etc. This makes nice levels _alot_ more potent than upstream's
linear approach.

Ingo


2007-05-03 08:07:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v7


* Ingo Molnar <[email protected]> wrote:

> > [...] but there are still some nice issues.
> >
> > Try running 3 chew.c's, then renicing one to -10, starves others for
> > some seconds while switching prio-level. Now renice it back to 10,
> > it starves for up to 45sec.
>
> ok - to make sure i understood you correctly: does this starvation
> only occur right when you renice it (when switching prio levels), and
> it gets rectified quickly once they get over a few reschedules?

meanwhile i managed to reproduce it by following the exact steps you
described, and i've fixed the bug in my tree. Can you confirm that the
patch below fixes it for you too?

Ingo

----------------->
From: Ingo Molnar <[email protected]>
Subject: [patch] sched, cfs: fix starvation upon nice level switching

Al Boldi reported the following bug: when switching a CPU-intense task's
nice levels they can get unfairly starved right after the priority level
switching. The bug was that when changing the load_weight the
->wait_runtime value did not get rescaled. So clear wait_runtime when
switching nice levels.

Signed-off-by: Ingo Molnar <[email protected]>

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -575,6 +580,7 @@ static void set_load_weight(struct task_
{
p->load_shift = get_load_shift(p);
p->load_weight = 1 << p->load_shift;
+ p->wait_runtime = 0;
}

static inline void

2007-05-03 08:38:10

by Al Boldi

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v7

Ingo Molnar wrote:
> * Al Boldi <[email protected]> wrote:
> > > i'm pleased to announce release -v7 of the CFS scheduler patchset.
> > > (The main goal of CFS is to implement "desktop scheduling" with as
> > > high quality as technically possible.)
> > >
> > >
> > > As usual, any sort of feedback, bugreport, fix and suggestion is
> > > more than welcome,
> >
> > This one seems on par with SD, [...]
>
> excellent :-)
>
> > [...] but there are still some nice issues.
> >
> > Try running 3 chew.c's, then renicing one to -10, starves others for
> > some seconds while switching prio-level. Now renice it back to 10, it
> > starves for up to 45sec.
>
> ok - to make sure i understood you correctly: does this starvation only
> occur right when you renice it (when switching prio levels),

Yes.

> and it gets
> rectified quickly once they get over a few reschedules?

Well, depending on nice level this delay may be more than 45sec.

And, in cfs-v8 there is an additional repeating latency blip akin to an
expiry when running procs at different nice levels. chew.c shows this
clearly.

> > Also, nice levels are only effective on every other step; ie:
> > ... -3/-2 , -1/0 , 1/2 ... yields only 20 instead of 40 prio-levels.
>
> yeah - this is a first-approximation thing.
>
> Some background: in the upstream scheduler (and in SD) nice levels are
> linearly scaled, while in CFS they are exponentially scaled. I did this
> because i believe exponential is more logical: regardless of which nice
> level a task uses, if it goes +2 nice levels up then it will halve its
> "fair CPU share". So for example the CPU consumption delta between nice
> 0 and nice +10 is 1/32 - and so is the delta between -5 and +5, -10 and
> -5, etc. This makes nice levels _alot_ more potent than upstream's
> linear approach.

Actually, I think 1/32 for +10 is a bit to strong. Introducing a scalefactor
tunable may be useful.

Also, don't you think it reasonable to lower-bound the timeslices?


Thanks!

--
Al

2007-05-03 11:12:53

by Al Boldi

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v7

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
> > > [...] but there are still some nice issues.
> > >
> > > Try running 3 chew.c's, then renicing one to -10, starves others for
> > > some seconds while switching prio-level. Now renice it back to 10,
> > > it starves for up to 45sec.
> >
> > ok - to make sure i understood you correctly: does this starvation
> > only occur right when you renice it (when switching prio levels), and
> > it gets rectified quickly once they get over a few reschedules?
>
> meanwhile i managed to reproduce it by following the exact steps you
> described, and i've fixed the bug in my tree. Can you confirm that the
> patch below fixes it for you too?

Seems like this fixed it. But I can still see these awful latency blips in
the presence of negatively niced chew.c at -10 and two chew.c's at nice 0.
This gets really bad when sched_granularity_ns >= 5,000,000.


Thanks!

--
Al

2007-05-03 12:37:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v7


* Al Boldi <[email protected]> wrote:

> [...] But I can still see these awful latency blips in the presence of
> negatively niced chew.c at -10 and two chew.c's at nice 0. [...]

of course: you asked for the two chew's to be treated like that and CFS
delivered it! :-)

nice -10 means the two chew's will get ~90+% of the CPU time, and all
other nice 0 tasks will get <10% of CPU time.

in the previous mail i have described the new exponential-scale nice
levels that CFS introduces. In practice this means that vanilla kernel's
nice -20 level is roughly equivalent to CFS's nice -6. CFS's nice -10
would be roughly equivalent to vanilla nice -80 (if it existed).

Ingo

2007-05-03 13:45:29

by Al Boldi

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v7

Ingo Molnar wrote:
> * Al Boldi <[email protected]> wrote:
> > [...] But I can still see these awful latency blips in the presence of
> > negatively niced chew.c at -10 and two chew.c's at nice 0. [...]
>
> of course: you asked for the two chew's to be treated like that and CFS
> delivered it! :-)
>
> nice -10 means the two chew's will get ~90+% of the CPU time, and all
> other nice 0 tasks will get <10% of CPU time.

Yes, but the latencies fluctuate wildly from 5ms to
sched_granularity_ns*1000. Isn't it possible to smooth this?


Thanks!

--
Al

2007-05-03 15:02:20

by Ting Yang

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v7

Hi, Ingo

This is the test case that I think worth discuss and it leads me to
find 2 things.

>> [...] but there are still some nice issues.
>>
>> Try running 3 chew.c's, then renicing one to -10, starves others for
>> some seconds while switching prio-level. Now renice it back to 10, it
>> starves for up to 45sec.
>>
>
> ok - to make sure i understood you correctly: does this starvation only
> occur right when you renice it (when switching prio levels), and it gets
> rectified quickly once they get over a few reschedules?
The main problem of what Al Boldi saw might come from this piece of code
in sched_fair.c, which scales of fair_key difference needed to preempt
the current task:

+rescale_load(struct task_struct *p, u64 value)
+{
+ int load_shift = p->load_shift;
+
+ if (load_shift == SCHED_LOAD_SHIFT)
+ return value;
+
+ return (value << load_shift) >> SCHED_LOAD_SHIFT;
+}
+
+static u64
+niced_granularity(struct rq *rq, struct task_struct *curr,
+ unsigned long granularity)
+{
+ return rescale_load(curr, granularity);
+}

Here is the checking for pre-emption:

+static inline void
+__check_preempt_curr_fair(struct rq *rq, struct task_struct *p,
+ struct task_struct *curr, unsigned long granularity)
+{
+ s64 __delta = curr->fair_key - p->fair_key;
+
+ /*
+ * Take scheduling granularity into account - do not
+ * preempt the current task unless the best task has
+ * a larger than sched_granularity fairness advantage:
+ */
+ if (__delta > niced_granularity(rq, curr, granularity))
+ resched_task(curr);
+}

This code actually now says, the difference of fair_key needed to
preempt the current task is amplified by a facto of its weigh (in Al
Boldi's example 32). However, the weighted task already advance its
p->fair_key by its weight, (also 32 here). The combination of them
becomes quadratic!

Let's starting from three nice 0 tasks p1, p2, p3, at time t=0, with
niced_granularity set to be 5ms:
Originally each task executes 5 ms in turn:
5ms for p1, 5ms for p2, 5ms p3, 5ms for p1, 5ms for
p2, 5ms for p3 ...
If somehow p3 is re-niced to -10 _right before_ the 16th ms, we run
into the worst case after p3 gets the cpu:
p1->fair_key = p2 ->fair_key = 10, p3->fair_key = 5.

Now, in order for p3 to be preempted, it has to make its fair_key 5
* 32 larger than p1 and p2's fair_key. Furthermore, p3 now is higher
weight, push its fair_key to increase by 1 now needs 32ms,
thus p3 will stay one cpu for 5 * 32 *32ms, which is about 5 second!

Besides this quadratic effect, another minor issue amplified this a
little bit further: p->wait_runtime accumulated before. During renice it
is not adjusted to match the new nice value. The p->wait_runtime earned
using previous weight has to be paid off using the current weight. If
renice to larger weight you pay more than you need, otherwise you paid
less, which introduces unfairness.

Ingo, now, partially solved this problem by clearing
p->wait_runtime when a task is reniced, but the quadratic effect of
scaling is still there.

Thanks

Ting

2007-05-03 15:17:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v7


* Ting Yang <[email protected]> wrote:

> + s64 __delta = curr->fair_key - p->fair_key;
> +
> + /*
> + * Take scheduling granularity into account - do not
> + * preempt the current task unless the best task has
> + * a larger than sched_granularity fairness advantage:
> + */
> + if (__delta > niced_granularity(rq, curr, granularity))
> + resched_task(curr);
> +}
>
> This code actually now says, the difference of fair_key needed to
> preempt the current task is amplified by a facto of its weigh (in Al
> Boldi's example 32). However, the weighted task already advance its
> p->fair_key by its weight, (also 32 here). The combination of them
> becomes quadratic!

it's not quadratic in terms of CPU share: the first factor impacts the
CPU share, the second factor impacts the granularity. This means that
reniced workloads will be preempted in a more finegrained way - but
otherwise there's _no_ quadratic effect for CPU time - which is a
completely separate metric. Remember: there are no timeslices in CFS, so
a task can be preempted any number of times without being at a
disadvantage.

> Besides this quadratic effect, another minor issue amplified this
> a little bit further: p->wait_runtime accumulated before. [...]

actually, this 'minor issue' was the main issue that caused the bug ;-)

Ingo

2007-05-03 16:00:59

by Ting Yang

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v7

Hi, Ingo

I wrote that email in a hurry, therefore might not explain the
problem clearly. However I do think there is a problem for this part,
after I carefully read the code again. Now I want to try again :-)
Hopefully, this time I will do a right job.

Starting from the following code:

+ if (__delta > niced_granularity(rq, curr, granularity))
+ resched_task(curr);


Suppose, "curr" has nice value -10, then curr->load_shift = 15.
Granularity passed into this function is
fixed 2,000,000 (for CFS -v8). Let's just divide everything by 1,000,000
for simplicity, say granularity used is 2.

Now, we look at how granularity is rescaled:

+ int load_shift = p->load_shift;
+
+ if (load_shift == SCHED_LOAD_SHIFT)
+ return value;
+
+ return (value << load_shift) >> SCHED_LOAD_SHIFT;

it returns (2 << 15) >> 10 = 2 * 32 = 64, therefore __delta has to be
larger than 64 so that the current process can be preempted.

Suppose, "curr" executes for 1 tick, an timer interrupts comes. It
executes about 1,000,000 (roughly speaking, since timer interrupts come
1000/second). Since we divided everything by 1,000,000, it becomes 1 in
this discussion. After this execution, how much will "curr" increments
its fair_key?
It is weighted: 1/32.

then how much time is needed for "curr" to build a 2 * 32 difference on
fair_key, with every 1 ms it updates fair_key by 1/32 ? 2 * 32 * 32 !
On the other hand, for a task has nice value 1, the amount work needed
to preemption is 2 * 1 *1.
If we have only 2 task running, p1 with nice value -10, p2 with nice
value 0.
p1 get cup share: (32 * 32) / (32 * 32 + 1 *1)
p2 get cpu share: ( 1* 1) / (32 * 32 + 1 * 1)

I do see a quadratic effect here. Did I missed anything? sorry to bother
you again, I just want to help :-)

Thanks a lot !

Ting




2007-05-03 19:48:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v7


* Ting Yang <[email protected]> wrote:

> then how much time is needed for "curr" to build a 2 * 32 difference
> on fair_key, with every 1 ms it updates fair_key by 1/32 ? 2 * 32 *
> 32 !

yes - but the "*32" impacts the rescheduling granularity, the "/32"
impacts the speed of how the key moves. So the total execution speed of
the nice -10 task is still "*32" of a nice 0 task - it's just that not
only it gets 32 times more CPU time, it also gets it at 32 times larger
chunks at once. But the rescheduling granularity does _not_ impact the
CPU share the task gets, so there's no quadratic effect.

but this is really simple to test: boot up CFS, start two infinite
loops, one at nice 0 and one at nice +10 and look at it via "top" and
type 's 60' in top to get a really long update interval for precise
results. You wont see quadratically less CPU time used up by the nice
+10 task, you'll see it getting the intended 1/32 share of CPU time.

Ingo

2007-05-03 19:57:09

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v7

* Ting Yang <[email protected]> wrote:
>> then how much time is needed for "curr" to build a 2 * 32 difference
>> on fair_key, with every 1 ms it updates fair_key by 1/32 ? 2 * 32 *
>> 32 !

On Thu, May 03, 2007 at 09:48:27PM +0200, Ingo Molnar wrote:
> yes - but the "*32" impacts the rescheduling granularity, the "/32"
> impacts the speed of how the key moves. So the total execution speed of
> the nice -10 task is still "*32" of a nice 0 task - it's just that not
> only it gets 32 times more CPU time, it also gets it at 32 times larger
> chunks at once. But the rescheduling granularity does _not_ impact the
> CPU share the task gets, so there's no quadratic effect.
> but this is really simple to test: boot up CFS, start two infinite
> loops, one at nice 0 and one at nice +10 and look at it via "top" and
> type 's 60' in top to get a really long update interval for precise
> results. You wont see quadratically less CPU time used up by the nice
> +10 task, you'll see it getting the intended 1/32 share of CPU time.

Davide has code to test this more rigorously. Looks like I don't need
to do very much to get a nice test going at all, besides fiddling with
options parsing and maybe a few other things.


-- wli