2007-05-31 15:10:23

by Ingo Molnar

[permalink] [raw]
Subject: [patch] CFS scheduler, -v15


i'm pleased to announce release -v15 of the CFS scheduler patchset.

The CFS rolled-up patch against v2.6.22-rc3, v2.6.22-rc3-mm1,
v2.6.21.1/3 or v2.6.20.10 can be downloaded from the usual place:

http://people.redhat.com/mingo/cfs-scheduler/

-v15 includes smaller fixes only. More precise sched_info statistics
from Balbir Singh, interactivity tweaks from Mike Galbraith and a number
of corner-cases fixed/cleaned up by Dmitry Adamushko.

Changes since -v14:

- more precise sched_info statistics (Balbir Singh)

- call update_curr() when preempting to an RT task (Dmitry Adamushko)

- smaller interactivity tweaks (Mike Galbraith)

- apply runtime-limit to yield_to() as well (Dmitry Adamushko)

- load-balancing iterator cleanup/simplification (Dmitry Adamushko)

- fix code duplication (noticed by Li Yu)

- cleanups (Mike Galbraith)

- fix CPU usage accounting of threadeded apps in 'top'

- more cleanups

As usual, any sort of feedback, bugreport, fix and suggestion is more
than welcome!

Ingo


2007-06-06 06:34:58

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v15

On Thu, May 31, 2007 at 05:09:08PM +0200, Ingo Molnar wrote:
> i'm pleased to announce release -v15 of the CFS scheduler patchset.

Do I smell a bug when a task switches its scheduling classes?

Lets say a task was in real-time class for a long time and switches to
fair-sched class. When update_curr() is called on this new fair-sched
task, some of required fields (for ex: exec_start) are uninitialized and can
lead to bogus calculations?

Also, real-time tasks are currently influencing fair clock calculations
(because there is only one raw_weighted_load field per runqueue). Is
that intentional?

--
Regards,
vatsa

2007-06-06 07:03:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v15


* Srivatsa Vaddagiri <[email protected]> wrote:

> > i'm pleased to announce release -v15 of the CFS scheduler patchset.
>
> Do I smell a bug when a task switches its scheduling classes?
>
> Lets say a task was in real-time class for a long time and switches to
> fair-sched class. When update_curr() is called on this new fair-sched
> task, some of required fields (for ex: exec_start) are uninitialized
> and can lead to bogus calculations?

yep, you are right. Dmitry Adamushko and Balbir Singh are working on
this area, and my tree already contains the fixes for rt task's
exec_start. (this also gives us the ability to have precise 'top'
statistics for RT tasks.) So this should be fixed in -v16. (Since it's
quite uncommon for a task to drop in and out of scheduling classes, this
problem was found via review, not via any user-visible regression.)

> Also, real-time tasks are currently influencing fair clock
> calculations (because there is only one raw_weighted_load field per
> runqueue). Is that intentional?

yep - the 'fair clock' in essence freezes when RT tasks are running (but
not completely). Can you think of any bad effect from this perhaps? It's
useful to have unified load-balancing (smpnice-driven) between the RT
and fair classes.

Ingo

2007-06-06 07:35:24

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v15

On Wed, Jun 06, 2007 at 09:01:43AM +0200, Ingo Molnar wrote:
> [...] and my tree already contains the fixes for rt task's
> exec_start.

Can I have this snapshot pls? I have to deal with the same issue when
the current task switches groups and I was planning to fix it by
introducing a set_curr_task() method in fair_sched_class which
initializes exec_start and other fields for that task.

> > Also, real-time tasks are currently influencing fair clock
> > calculations (because there is only one raw_weighted_load field per
> > runqueue). Is that intentional?
>
> yep - the 'fair clock' in essence freezes when RT tasks are running (but
> not completely). Can you think of any bad effect from this perhaps?

hmm ..i was thinking of the case where currently running task belongs to
fair_sched class but the raw_weighted_load field of runqueue includes
weight from real time tasks (which would have affected the delta_mine
calculation). Now I realize that it is not possible because real-time
tasks always preempt fair-sched tasks. So possibly this is not an issue.

> It's useful to have unified load-balancing (smpnice-driven) between the RT
> and fair classes.

Sure ..the unified weight number is usefull to pick the busiest
group/queue during load balance. I am retaining this aspect in my
patches (which I am preparing to send against -v15 atm).

--
Regards,
vatsa

2007-06-06 09:08:36

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v15

On 06/06/07, Srivatsa Vaddagiri <[email protected]> wrote:
> On Wed, Jun 06, 2007 at 09:01:43AM +0200, Ingo Molnar wrote:
> > [...] and my tree already contains the fixes for rt task's
> > exec_start.
>
> Can I have this snapshot pls? I have to deal with the same issue when
> the current task switches groups and I was planning to fix it by
> introducing a set_curr_task() method in fair_sched_class which
> initializes exec_start and other fields for that task.

Hum.. what about accounting 'exec_time' and updating 'exec_start' in
rt_sched :: dequeue_task_rt() instead (like update_curr() does it in
dequeue_task_fair())?

This way, on RT -> NORMAL transition.. some 'delta_exec' ( between
deactivate_task() ---> activate_task() ) will be accounted later as if
the task was 'sched_fair_class' during this time.. which I think makes
some sense. What do you think?

sched_setscheduler()
{
...
on_rq = p->on_rq;
if (on_rq)
deactivate_task(rq, p, 0);
oldprio = p->prio;
__setscheduler(rq, p, policy, param->sched_priority);
if (on_rq) {
activate_task(rq, p, 0);
...


> --
> Regards,
> vatsa
>

--
Best regards,
Dmitry Adamushko

2007-06-06 10:39:41

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v15

Dmitry Adamushko wrote:
> On 06/06/07, Srivatsa Vaddagiri <[email protected]> wrote:
>> On Wed, Jun 06, 2007 at 09:01:43AM +0200, Ingo Molnar wrote:
>> > [...] and my tree already contains the fixes for rt task's
>> > exec_start.
>>
>> Can I have this snapshot pls? I have to deal with the same issue when
>> the current task switches groups and I was planning to fix it by
>> introducing a set_curr_task() method in fair_sched_class which
>> initializes exec_start and other fields for that task.
>
> Hum.. what about accounting 'exec_time' and updating 'exec_start' in
> rt_sched :: dequeue_task_rt() instead (like update_curr() does it in
> dequeue_task_fair())?
>
> This way, on RT -> NORMAL transition.. some 'delta_exec' ( between
> deactivate_task() ---> activate_task() ) will be accounted later as if
> the task was 'sched_fair_class' during this time.. which I think makes
> some sense. What do you think?
>

Why not do it explicitly in __setscheduler() if the new policy is SCHED_NORMAL
or SCHED_BATCH. You could also add the smarts to deactivate_task()

> sched_setscheduler()
> {
> ...
> on_rq = p->on_rq;
> if (on_rq)
> deactivate_task(rq, p, 0);
> oldprio = p->prio;
> __setscheduler(rq, p, policy, param->sched_priority);
> if (on_rq) {
> activate_task(rq, p, 0);
> ...
>
>
>> --
>> Regards,
>> vatsa
>>
>


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-06-06 10:51:56

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v15

On Wed, Jun 06, 2007 at 04:07:01PM +0530, Balbir Singh wrote:
> Dmitry Adamushko wrote:
> > On 06/06/07, Srivatsa Vaddagiri <[email protected]> wrote:
> >> On Wed, Jun 06, 2007 at 09:01:43AM +0200, Ingo Molnar wrote:
> >> > [...] and my tree already contains the fixes for rt task's
> >> > exec_start.
> >>
> >> Can I have this snapshot pls? I have to deal with the same issue when
> >> the current task switches groups and I was planning to fix it by
> >> introducing a set_curr_task() method in fair_sched_class which
> >> initializes exec_start and other fields for that task.
> >
> > Hum.. what about accounting 'exec_time' and updating 'exec_start' in
> > rt_sched :: dequeue_task_rt() instead (like update_curr() does it in
> > dequeue_task_fair())?

[For some reason Dimitry's mail has not made it to my inbox yet ..]

> > This way, on RT -> NORMAL transition.. some 'delta_exec' ( between
> > deactivate_task() ---> activate_task() ) will be accounted later as if
> > the task was 'sched_fair_class' during this time.. which I think makes
> > some sense. What do you think?
> >
>
> Why not do it explicitly in __setscheduler() if the new policy is SCHED_NORMAL
> or SCHED_BATCH.

Yes this is the approach I prefer, because we burden the fast/normal
path less that way (RT->NORMAL transition is not common). That's why I
was considering a set_curr_task() method in fair_sched_class which will
be invoked in __setscheduler() if the new policy of currently running
task happens to be SCHED_NORMAL/BATCH. Alternately if the new policy of
currently running task happens to be SCHED_FIFO (and its old policy was
SCHED_NORMAL) we need to invoke put_prev_task() method (so that
fair_clock etc is updated based on outgoing task's execution time in
SCHED_NORMAL class).

>You could also add the smarts to deactivate_task()


--
Regards,
vatsa

2007-06-06 11:19:17

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v15

On 06/06/07, Srivatsa Vaddagiri <[email protected]> wrote:
> [ ... ]
> > > This way, on RT -> NORMAL transition.. some 'delta_exec' ( between
> > > deactivate_task() ---> activate_task() ) will be accounted later as if
> > > the task was 'sched_fair_class' during this time.. which I think makes
> > > some sense. What do you think?
> > >
> >
> > Why not do it explicitly in __setscheduler() if the new policy is SCHED_NORMAL
> > or SCHED_BATCH.
>
> Yes this is the approach I prefer, because we burden the fast/normal
> path less that way (RT->NORMAL transition is not common).

I don't think that rt_sched_class :: dequeue_task_rt() is in any of
such "fast pathes" that we should really care about an additional
math. operation.

If this approach is ok, logically-wise (no side effects from a short
'delta_exec', esp. on RT -> NORMAL).. I think it's better as it keeps
the 'sched_class' interface simpler.

> That's why I
> was considering a set_curr_task() method in fair_sched_class which will
> be invoked in __setscheduler() if the new policy of currently running
> task happens to be SCHED_NORMAL/BATCH. Alternately if the new policy of
> currently running task happens to be SCHED_FIFO (and its old policy was
> SCHED_NORMAL) we need to invoke put_prev_task() method (so that
> fair_clock etc is updated based on outgoing task's execution time in
> SCHED_NORMAL class).

rt_sched_class :: put_prev_task() from __setscheduler() ? But it's not
supposed to be called from here, logically-wise. You just rely on its
current behavior (which is only about updating 'exec_start' and
'exec_sum') -- that's just bad. Maybe I misunderatood your intention
though..


> --
> Regards,
> vatsa
>

--
Best regards,
Dmitry Adamushko

2007-06-06 11:29:17

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v15

On Wed, Jun 06, 2007 at 01:19:01PM +0200, Dmitry Adamushko wrote:
> >Yes this is the approach I prefer, because we burden the fast/normal
> >path less that way (RT->NORMAL transition is not common).
>
> I don't think that rt_sched_class :: dequeue_task_rt() is in any of
> such "fast pathes" that we should really care about an additional
> math. operation.
>
> If this approach is ok, logically-wise (no side effects from a short
> 'delta_exec', esp. on RT -> NORMAL).. I think it's better as it keeps
> the 'sched_class' interface simpler.

I can see your point, given that we just update exec_start in
dequeue_task_rt(). However the situation is slightly more complex in my
patch stack, as I need to update other things during group change. Perhaps
we can postpone this discussion until I post that patch stack.

> >That's why I
> >was considering a set_curr_task() method in fair_sched_class which will
> >be invoked in __setscheduler() if the new policy of currently running
> >task happens to be SCHED_NORMAL/BATCH. Alternately if the new policy of
> >currently running task happens to be SCHED_FIFO (and its old policy was
> >SCHED_NORMAL) we need to invoke put_prev_task() method (so that
> >fair_clock etc is updated based on outgoing task's execution time in
> >SCHED_NORMAL class).
>
> rt_sched_class :: put_prev_task() from __setscheduler() ?

No, fair_sched_class :: put_prev_task() if we are transitioning from
NORMAL->RT. That will update the fair_clock based on execution time
of current task in fair_sched class?

> But it's not
> supposed to be called from here, logically-wise. You just rely on its
> current behavior (which is only about updating 'exec_start' and
> 'exec_sum') -- that's just bad. Maybe I misunderatood your intention
> though..

--
Regards,
vatsa

2007-06-06 11:38:35

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v15

On Wed, Jun 06, 2007 at 05:07:33PM +0530, Srivatsa Vaddagiri wrote:
> No, fair_sched_class :: put_prev_task() if we are transitioning from
> NORMAL->RT. That will update the fair_clock based on execution time
> of current task in fair_sched class?

On seconds thoughts, this may not be necessary as the deactivate_task
(from fair_sched class) will accomplish that.

--
Regards,
vatsa

2007-06-14 12:35:07

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [patch] CFS scheduler, -v15

On Wed, Jun 06, 2007 at 05:07:33PM +0530, Srivatsa Vaddagiri wrote:
> > I don't think that rt_sched_class :: dequeue_task_rt() is in any of
> > such "fast pathes" that we should really care about an additional
> > math. operation.
> >
> > If this approach is ok, logically-wise (no side effects from a short
> > 'delta_exec', esp. on RT -> NORMAL).. I think it's better as it keeps
> > the 'sched_class' interface simpler.
>
> I can see your point, given that we just update exec_start in
> dequeue_task_rt(). However the situation is slightly more complex in my
> patch stack, as I need to update other things during group change. Perhaps
> we can postpone this discussion until I post that patch stack.

Now that I have posted the group scheduling patch stack, we can come
back to this point.

Basically I am maintaining a ->curr field in 'struct lrq' (when
CONFIG_FAIR_GROUP_SCHED is on). It is NULL if the lrq is "inactive" [ none of
its entities are currently running ], otherwise it points to the currentive
active entity. This helps me optimize some functions like update_curr() - when
lrq->curr is NULL, we don't have to update fair_clock etc upon enqueue/dequeue
to the lrq.

pick_next_entity() sets lrq->curr while put_prev_entity resets lrq->curr
to NULL.

Also these lrqs could form a hierarchy. At the top could be a lrq which
holds user-level sched entities on each cpu. Next, each user's lrq could
hold task-level entities.

So when a task moves from RT class to NORMAL class, it will get added to
one of the lrqs (say root user's lrq on CPU0). At that time, I need to:

root_user_lrq[cpu0]->curr = &task_moving->se;
top_lrq[cpu0]->curr = &root_user->se;

I have added a special callback set_curr_task() in fair_sched_class to
accomplish this.

Similarly when a task moves from NORMAL class to RT class, I need to:

root_user_cpu0_lrq->curr = NULL;
top_lrq_cpu0->curr = NULL;

For this I am reusing the put_prev_task() callback.

I have to accomplish the same steps when a task moves from one group to
other also.

Let me know what you think about set_curr_task() callback [in Patch #6]. The
other option was to tackle this requirement in enqueue/dequeue_task (if
current task is being enqueue/dequeued then set/reset lrq->curr), which I felt
add unnecessary overhead in normal path (a current task for ex gets
dequeued when it goes to sleep).

--
Regards,
vatsa