2001-11-16 20:20:37

by Mike Kravetz

[permalink] [raw]
Subject: Real Time Runqueue

As you may know, a few of us are experimenting with multi-runqueue
scheduler implementations. One area of concern is where to place
realtime tasks. It has been my assumption, that POSIX RT semantics
require a specific ordering of tasks such as SCHED_FIFO and SCHED_RR.
To accommodate this ordering, I further believe that the simplest
solution is to ensure that all realtime tasks reside on the same
runqueue. In our MQ scheduler we have a separate runqueue for all
realtime tasks. The problem is that maintaining a separate realtime
runqueue is a pain and results in some fairly complex/ugly code.

Since I'm not a realtime expert, I would like to ask if my assumption
about strict ordering of RT tasks is accurate. Also, is anyone aware
of other ways to approach this problem?

Thanks,
--
Mike


2001-11-16 20:27:28

by Richard Gooch

[permalink] [raw]
Subject: Re: Real Time Runqueue

Mike Kravetz writes:
> As you may know, a few of us are experimenting with multi-runqueue
> scheduler implementations. One area of concern is where to place
> realtime tasks. It has been my assumption, that POSIX RT semantics
> require a specific ordering of tasks such as SCHED_FIFO and SCHED_RR.
> To accommodate this ordering, I further believe that the simplest
> solution is to ensure that all realtime tasks reside on the same
> runqueue. In our MQ scheduler we have a separate runqueue for all
> realtime tasks. The problem is that maintaining a separate realtime
> runqueue is a pain and results in some fairly complex/ugly code.
>
> Since I'm not a realtime expert, I would like to ask if my assumption
> about strict ordering of RT tasks is accurate. Also, is anyone aware
> of other ways to approach this problem?

Yes, strict ordering is required. Years ago I championed a separate
runqueue for RT tasks. Linus even said he liked the approach. I got
busy and never nursed it to inclusion. The patch is here:
ftp://ftp.atnf.csiro.au/pub/people/rgooch/linux/kernel-patches/v2.1/rtqueue-patch

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-11-16 22:35:53

by Davide Libenzi

[permalink] [raw]
Subject: Re: Real Time Runqueue

On Fri, 16 Nov 2001, Mike Kravetz wrote:

> As you may know, a few of us are experimenting with multi-runqueue
> scheduler implementations. One area of concern is where to place
> realtime tasks. It has been my assumption, that POSIX RT semantics
> require a specific ordering of tasks such as SCHED_FIFO and SCHED_RR.
> To accommodate this ordering, I further believe that the simplest
> solution is to ensure that all realtime tasks reside on the same
> runqueue. In our MQ scheduler we have a separate runqueue for all
> realtime tasks. The problem is that maintaining a separate realtime
> runqueue is a pain and results in some fairly complex/ugly code.
>
> Since I'm not a realtime expert, I would like to ask if my assumption
> about strict ordering of RT tasks is accurate. Also, is anyone aware
> of other ways to approach this problem?

I do not use a separate queue coz, if it's single, it becomes a common
lock for all CPUs.
RT tasks are scheduled as usual and the only problem arises in
reschedule_idle() when an RT task is pushed onto the run queue when
1) on its CPU it is _not_ running the idle
2) on its CPU is running another RT task with higher priority

In that case a "good CPU" discovery loop is triggered, the task is moved
on that CPU runqueue, need_resched is set, an IPI is sent and on return
from the remote CPU IPI path the RT task is run.
A good solution would be ( i'm not doing it now ), in setscheduler() to
move the task in a way to have an even distribution of RT tasks among
CPUs.




- Davide



2001-11-16 22:42:39

by Jesse Pollard

[permalink] [raw]
Subject: Re: Real Time Runqueue

Mike Kravetz <[email protected]>:
>
> As you may know, a few of us are experimenting with multi-runqueue
> scheduler implementations. One area of concern is where to place
> realtime tasks. It has been my assumption, that POSIX RT semantics
> require a specific ordering of tasks such as SCHED_FIFO and SCHED_RR.
> To accommodate this ordering, I further believe that the simplest
> solution is to ensure that all realtime tasks reside on the same
> runqueue. In our MQ scheduler we have a separate runqueue for all
> realtime tasks. The problem is that maintaining a separate realtime
> runqueue is a pain and results in some fairly complex/ugly code.
>
> Since I'm not a realtime expert, I would like to ask if my assumption
> about strict ordering of RT tasks is accurate. Also, is anyone aware
> of other ways to approach this problem?

I used to do real-time (seismic survey navigation - sea, land and aircraft
based systems). I've always admired some of the approaches used by the old
VAX system (we did an adaptation for PDP-11/73 systems).

The operation provided a mixed environment of RR and fixed priority operation.
The core scheduler is based on a bit vector of no larger than 64 fixed priority
queues. Each queue could then be handled in a FIFO or RR manner. Selection
of the queue was done by a "first bit set" selection. This identified the
queue that the process was to be selected. Each queue had a selection fuction
that could implement any choses scheduling algorithm, but we only used FIFO
and RR. Several properties were required:

1. Only runnable processes are permitted to exist in the queues.
2. An empty queue had the corresponding bit value of zero.
3. Any queue with pending processes had the corresponding bit set to 1.

Our adaption took the bit vector, converted it to floating point, and
subtracted the exponent bias from the exponent. This gave us the "first bit
set" in the vector. This index can then be used to select the queue and
the selection algorithim. The return value is always the process to run.
If the current process matches the original value, then return to the
already loaded context; otherwise a context swich was called for. Also note
that the current context contained the queue identifier. This makes it
simple to save the current context. Of course, if the vector were zero then
the idle task was invoked.

We found that most of the time the queues only held one or two processes,
making for a fast selection (we only had one processor so we didn't have
to deal with SMP issues).

When only one process/queue exists you have fixed priority queueing.

If a queue has more than one process, then the possibility for FIFO or RR is
available, with FIFO becoming a "complete current process" before it ever
looks at the other processes at the same priority.

I think the VMS useage was to have the first 16 bits in the vector for
realtime kernel processes, the second 16 for realtime user mode processes, the
next 32 were timesharing user processes, with various priorities.

Now as to strict realtime ordering, yes and no. This is because some things
MUST be done - in our case range computations had to be completed before the
next range came in. This put it at the highest fixed priority, NO
interference from other tasks.

Distance/bearing/location came next (if a new range was not available, then
dead reckon the location) and was strictly scheduled on time.

These two tasks were fixed priority and were the only processes in their
corresponding priority queue.

A lower level task called for data recording when a certain distance had
been covered. This would be FIFO with a display update process. (we ended
up with a ship representation display with known local obsticals.. oil rigs,
entered base line markers, wrecks, sandbars...)

Lower level tasks handled command input and was scheduled RR with a
data calculator functions (geodesic distance/time computation), navigation
parameter initialization ...

The idle task ended up handling a location plotter (spin on flag interface,
no interrupt available... stupid thing was too fast for an interrupt, but
required about 100 iterations per output byte).

Inter-process communication was done with SHORT messages. The largest was
the data recording snapshot -between 128 and 150 bytes long); the shortest
was about 32 bytes - latitude/longitude/bering/distance. The kernel
suspended scheduling during the message copy.

Along with this was the usual device drivers (tape/printer/keyboard/serial)
for communicating with external customer devices. There were also special
drivers for controlling the range input device, early GPS recievers,
LORAN-C recievers,...

This is in part a description in favor of your multi queue scheduler.

Complexity is relative - usually the hard part is deciding whether all
processes must exist in the queue at all times -idle/ready/sleeping/...
or if the queues only contain runnable processes.

If (and this may depend on hardware) it is easy to insert/remove from
queues, then having the process out of the queues when idle will speed up
the selection process. Having only active processes also makes it feasable
to use the bit vector.

I believe we opted to leave all process in the queues to speed up the
state change procedures. The selection process was always at the end of
a processing cyle (either clock interrupt, or the process did an I/O)
so the overhead was always measured relative to the previous running
process.

I always liked the trick of converting the bit vector to a floating point to
use the exponent to determine the active queue - it took far fewer instructions
than a loop to check each queue in an array. The added overhead when doing
the queue insertion became one instruction. It does require that only active
processes be in the queue, though. Otherwise you have to have a queue scan
since the queue MAY have all processes idle, even though the bit is set (have
to clear it and start the queue selection over - bummer. (a "find first set
bit" instruction is really usefull here).

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2001-11-16 23:47:37

by Mike Kravetz

[permalink] [raw]
Subject: Re: Real Time Runqueue

On Fri, Nov 16, 2001 at 02:44:30PM -0800, Davide Libenzi wrote:
>
> I do not use a separate queue coz, if it's single, it becomes a common
> lock for all CPUs.
> RT tasks are scheduled as usual and the only problem arises in
> reschedule_idle() when an RT task is pushed onto the run queue when
> 1) on its CPU it is _not_ running the idle
> 2) on its CPU is running another RT task with higher priority
>
> In that case a "good CPU" discovery loop is triggered, the task is moved
> on that CPU runqueue, need_resched is set, an IPI is sent and on return
> from the remote CPU IPI path the RT task is run.
> A good solution would be ( i'm not doing it now ), in setscheduler() to
> move the task in a way to have an even distribution of RT tasks among
> CPUs.
>
> - Davide

Davide,

Suppose you have a 2 CPU system with 4 runnable tasks. 3 of these
tasks are realtime with the same realtime priority and the other is
an ordinary SCHED_OTHER task. The task distribution on the runqueues
looks something like this.

CPU 0 CPU 1
--------- ---------
RT Task A RT Task B
Other Task C RT Task D

Task A and Task B are currently running on the 2 CPUs. Now, Task A
voluntarily gives up CPU 0 and Task B is still running on CPU 1.
At this point, Task D should be chosen to run on CPU 0. Correct?
Isn't this a required RT semantic? I'm curious how you plan on
accomplishing this.

Regards,
--
Mike

2001-11-17 00:05:09

by Davide Libenzi

[permalink] [raw]
Subject: Re: Real Time Runqueue

On Fri, 16 Nov 2001, Jesse Pollard wrote:

> I believe we opted to leave all process in the queues to speed up the
> state change procedures. The selection process was always at the end of

I believe in that too, expecially if a good RT tasks distribution is
achieved.



> I always liked the trick of converting the bit vector to a floating point to
> use the exponent to determine the active queue - it took far fewer instructions
> than a loop to check each queue in an array. The added overhead when doing

Smart trick but by doing it in Linux needs you to get the fpu status
before doing the fp op otherwise you're going to do the fpu status
save/restore each time.




- Davide


2001-11-17 00:19:39

by Davide Libenzi

[permalink] [raw]
Subject: Re: Real Time Runqueue

On Fri, 16 Nov 2001, Mike Kravetz wrote:

> Suppose you have a 2 CPU system with 4 runnable tasks. 3 of these
> tasks are realtime with the same realtime priority and the other is
> an ordinary SCHED_OTHER task. The task distribution on the runqueues
> looks something like this.
>
> CPU 0 CPU 1
> --------- ---------
> RT Task A RT Task B
> Other Task C RT Task D
>
> Task A and Task B are currently running on the 2 CPUs. Now, Task A
> voluntarily gives up CPU 0 and Task B is still running on CPU 1.
> At this point, Task D should be chosen to run on CPU 0. Correct?
> Isn't this a required RT semantic? I'm curious how you plan on
> accomplishing this.

Well I don't know how RT sematics apply to SMP systems.
The easy solution ( == big common lock ) would be to have a single RT
queue that is checked before the private one.
Anyway, sometime it happens that the cure is worst than the disease and to
solve a corner case you're going to punish common case performances (
Linux is not an RT OS even with that fix ).




- Davide


2001-11-17 00:32:53

by Mike Kravetz

[permalink] [raw]
Subject: Re: Real Time Runqueue

On Fri, Nov 16, 2001 at 04:28:33PM -0800, Davide Libenzi wrote:
> On Fri, 16 Nov 2001, Mike Kravetz wrote:
>
> > Suppose you have a 2 CPU system with 4 runnable tasks. 3 of these
> > tasks are realtime with the same realtime priority and the other is
> > an ordinary SCHED_OTHER task. The task distribution on the runqueues
> > looks something like this.
> >
> > CPU 0 CPU 1
> > --------- ---------
> > RT Task A RT Task B
> > Other Task C RT Task D
> >
> > Task A and Task B are currently running on the 2 CPUs. Now, Task A
> > voluntarily gives up CPU 0 and Task B is still running on CPU 1.
> > At this point, Task D should be chosen to run on CPU 0. Correct?
> > Isn't this a required RT semantic? I'm curious how you plan on
> > accomplishing this.
>
> Well I don't know how RT sematics apply to SMP systems.

Me either (not exactly).

> The easy solution ( == big common lock ) would be to have a single RT
> queue that is checked before the private one.
> Anyway, sometime it happens that the cure is worst than the disease and to
> solve a corner case you're going to punish common case performances (
> Linux is not an RT OS even with that fix ).

The reason I ask is that we went through the pains of a separate
realtime RQ in our MQ scheduler. And yes, it does hurt the common
case, not to mention the extra/complex code paths. I was hoping
that someone in the know could enlighten us as to how RT semantics
apply to SMP systems. If the semantics I suggest above are required,
then it implies support must be added to any possible future
scheduler implementations.

--
Mike

2001-11-17 08:09:05

by George Anzinger

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Real Time Runqueue

Mike Kravetz wrote:
>
> On Fri, Nov 16, 2001 at 04:28:33PM -0800, Davide Libenzi wrote:
> > On Fri, 16 Nov 2001, Mike Kravetz wrote:
> >
> > > Suppose you have a 2 CPU system with 4 runnable tasks. 3 of these
> > > tasks are realtime with the same realtime priority and the other is
> > > an ordinary SCHED_OTHER task. The task distribution on the runqueues
> > > looks something like this.
> > >
> > > CPU 0 CPU 1
> > > --------- ---------
> > > RT Task A RT Task B
> > > Other Task C RT Task D
> > >
> > > Task A and Task B are currently running on the 2 CPUs. Now, Task A
> > > voluntarily gives up CPU 0 and Task B is still running on CPU 1.
> > > At this point, Task D should be chosen to run on CPU 0. Correct?
> > > Isn't this a required RT semantic? I'm curious how you plan on
> > > accomplishing this.
> >
> > Well I don't know how RT sematics apply to SMP systems.
>
> Me either (not exactly).
>
> > The easy solution ( == big common lock ) would be to have a single RT
> > queue that is checked before the private one.
> > Anyway, sometime it happens that the cure is worst than the disease and to
> > solve a corner case you're going to punish common case performances (
> > Linux is not an RT OS even with that fix ).
>
> The reason I ask is that we went through the pains of a separate
> realtime RQ in our MQ scheduler. And yes, it does hurt the common
> case, not to mention the extra/complex code paths. I was hoping
> that someone in the know could enlighten us as to how RT semantics
> apply to SMP systems. If the semantics I suggest above are required,
> then it implies support must be added to any possible future
> scheduler implementations.
>
For what its worth here is my $.02 worth. A strict following of the
standard seems to me to mean that in a N cpu system with M>N ready real
time tasks, the N highest priority tasks should be allocated CPUs.
Remember that, for real time tasks, it is MORE important to run the
highest priority tasks than to worry about thru put or performance
(beyond getting to the highest priority task(s) ASAP). I assume that
this is what a user is saying when he declares a task "real time".

That said, the pragmatic view is that the customer is always right.
After all, he paid for the machine. We have customers who want
decidedly non "S" SMP. There are some who want all real time tasks to
run on one cpu and all other tasks to run on another. There are also
some who want a group of tasks (real time and otherwise) to always run
on a particular cpu while other tasks (again real time and otherwise)
run on other cpus.

What I plan to do is to have a non-affined real-time ready list and an
affined list. I really think all the lists should be priority ordered
with one per priority so when I say a non-affined ready list I mean one
for each priority (see the rtsched on source forge, for example of a
list per priority (URL in my signature)). I am not yet sure if the
affined list should have the option of being shared with more than one
cpu, but as you note, there are performance problems with multiple lists
so I only plan to have the cpu check at most two lists, the non-affined,
and one affined list. With this scheme it is possible to have a case
where a ready real time task will not get an idle cpu, however, this
should only happen if the task has affined it self to some other
cpu(s). By using priority order lists or a list per priority the
overhead and lock contention can, IMHO, be managed.

A couple of addition points. There should be an API to declare cpu
affinity, and cpu affinity should be passed thru fork and exec. You may
need to be god (root) to use the API however. (I think cpu affinity is
currently passed thru fork and exec, by the way.)
--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2001-11-17 08:41:58

by George Anzinger

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Real Time Runqueue

Jesse Pollard wrote:
>
> Mike Kravetz <[email protected]>:
> >
> > As you may know, a few of us are experimenting with multi-runqueue
> > scheduler implementations. One area of concern is where to place
> > realtime tasks. It has been my assumption, that POSIX RT semantics
> > require a specific ordering of tasks such as SCHED_FIFO and SCHED_RR.
> > To accommodate this ordering, I further believe that the simplest
> > solution is to ensure that all realtime tasks reside on the same
> > runqueue. In our MQ scheduler we have a separate runqueue for all
> > realtime tasks. The problem is that maintaining a separate realtime
> > runqueue is a pain and results in some fairly complex/ugly code.
> >
> > Since I'm not a realtime expert, I would like to ask if my assumption
> > about strict ordering of RT tasks is accurate. Also, is anyone aware
> > of other ways to approach this problem?
>
> I used to do real-time (seismic survey navigation - sea, land and aircraft
> based systems). I've always admired some of the approaches used by the old
> VAX system (we did an adaptation for PDP-11/73 systems).
>
> The operation provided a mixed environment of RR and fixed priority operation.
> The core scheduler is based on a bit vector of no larger than 64 fixed priority
> queues. Each queue could then be handled in a FIFO or RR manner. Selection
> of the queue was done by a "first bit set" selection. This identified the
> queue that the process was to be selected. Each queue had a selection fuction
> that could implement any choses scheduling algorithm, but we only used FIFO
> and RR. Several properties were required:
>
> 1. Only runnable processes are permitted to exist in the queues.
> 2. An empty queue had the corresponding bit value of zero.
> 3. Any queue with pending processes had the corresponding bit set to 1.
>
> Our adaption took the bit vector, converted it to floating point, and
> subtracted the exponent bias from the exponent. This gave us the "first bit
> set" in the vector. This index can then be used to select the queue and
> the selection algorithim. The return value is always the process to run.
> If the current process matches the original value, then return to the
> already loaded context; otherwise a context swich was called for. Also note
> that the current context contained the queue identifier. This makes it
> simple to save the current context. Of course, if the vector were zero then
> the idle task was invoked.

Take a look at http://sourceforge.net/projects/rtsched/ to see a linux
scheduler that uses very much this same thing. No floating point, but
there are find first bit instructions on most machines.


>
~snip
--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2001-11-19 15:22:54

by Victor Yodaiken

[permalink] [raw]
Subject: Re: Real Time Runqueue

On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote:
> The reason I ask is that we went through the pains of a separate
> realtime RQ in our MQ scheduler. And yes, it does hurt the common
> case, not to mention the extra/complex code paths. I was hoping
> that someone in the know could enlighten us as to how RT semantics
> apply to SMP systems. If the semantics I suggest above are required,
> then it implies support must be added to any possible future
> scheduler implementations.

POSIX RT specs, at least last year, did not mention any SMP
requirements for scheduling at all.
What we do in RTLinux is require that RT threads be associated with a
processor identifier on the theory that the user may have some idea what
processors should run which RT threads, but the OS has no way of
guessing.




>
> --
> Mike
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-11-19 16:30:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Real Time Runqueue

On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote:
> The reason I ask is that we went through the pains of a separate
> realtime RQ in our MQ scheduler. And yes, it does hurt the common
> case, not to mention the extra/complex code paths. I was hoping
> that someone in the know could enlighten us as to how RT semantics
> apply to SMP systems. If the semantics I suggest above are required,
> then it implies support must be added to any possible future
> scheduler implementations.

It seems a lot of applications/APIs do not care about global RT semantics,
but about RT semantics for groups of threads or processes (e.g. java
or ada applications). Linux currently simulates this only for root
and with a global runqueue. I don't think it makes too much sense to have
an global rt queue on a multi processor system, but there should be some
way to define "scheduling groups" where rt semantics are followed inside.
Such a scheduling group could be a clone flag or default to CLONE_VM for
example for compatibility. A scheduling group would also make it possible
to support simple rt semantics for thread groups as non root. Then one
could run a rt queue per scheduling group, and simulate global rt run queue
or per cpu rt run queue as needed by appropiate setup.

Comments?

-Andi

2001-11-19 16:57:34

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Real Time Runqueue

On Mon, 19 Nov 2001, Andi Kleen wrote:

> On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote:
> > The reason I ask is that we went through the pains of a separate
> > realtime RQ in our MQ scheduler. And yes, it does hurt the common
> > case, not to mention the extra/complex code paths. I was hoping
> > that someone in the know could enlighten us as to how RT semantics
> > apply to SMP systems. If the semantics I suggest above are required,
> > then it implies support must be added to any possible future
> > scheduler implementations.
>
> It seems a lot of applications/APIs do not care about global RT semantics,
> but about RT semantics for groups of threads or processes (e.g. java
> or ada applications). Linux currently simulates this only for root
> and with a global runqueue. I don't think it makes too much sense to have
> an global rt queue on a multi processor system, but there should be some
> way to define "scheduling groups" where rt semantics are followed inside.
> Such a scheduling group could be a clone flag or default to CLONE_VM for
> example for compatibility. A scheduling group would also make it possible
> to support simple rt semantics for thread groups as non root. Then one
> could run a rt queue per scheduling group, and simulate global rt run queue
> or per cpu rt run queue as needed by appropiate setup.

What i'm currently trying to achieve is to have two kind of rt tasks,
local and global ones.
Local rt tasks are stored inside the local CPU run queue and compete only
with the local tasks.
Global rt tasks have a global runqueue that is fast-checked w/out
held locks inside the schedule() :

if (!list_empty(&runqueue_head(RT_QID)))
goto rt_queue_select;
rt_queue_select_back:

So a local rt tasks can _not_ preempt a task on another CPU while a global
one yes.




- Davide


2001-11-19 18:24:19

by Richard Gooch

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Real Time Runqueue

Andi Kleen writes:
> On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote:
> > The reason I ask is that we went through the pains of a separate
> > realtime RQ in our MQ scheduler. And yes, it does hurt the common
> > case, not to mention the extra/complex code paths. I was hoping
> > that someone in the know could enlighten us as to how RT semantics
> > apply to SMP systems. If the semantics I suggest above are required,
> > then it implies support must be added to any possible future
> > scheduler implementations.
>
> It seems a lot of applications/APIs do not care about global RT
> semantics, but about RT semantics for groups of threads or processes
> (e.g. java or ada applications). Linux currently simulates this only
> for root and with a global runqueue. I don't think it makes too much
> sense to have an global rt queue on a multi processor system, but
> there should be some way to define "scheduling groups" where rt
> semantics are followed inside. Such a scheduling group could be a
> clone flag or default to CLONE_VM for example for compatibility. A
> scheduling group would also make it possible to support simple rt
> semantics for thread groups as non root. Then one could run a rt
> queue per scheduling group, and simulate global rt run queue or per
> cpu rt run queue as needed by appropiate setup.

We have to continue providing global RT semantics. However, a
non-privileged scheduling class which gives RT-like behaviour within a
scheduling group would be *great*! I've wished for such a facility
myself.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-11-19 18:33:27

by George Anzinger

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Real Time Runqueue

Andi Kleen wrote:
>
> On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote:
> > The reason I ask is that we went through the pains of a separate
> > realtime RQ in our MQ scheduler. And yes, it does hurt the common
> > case, not to mention the extra/complex code paths. I was hoping
> > that someone in the know could enlighten us as to how RT semantics
> > apply to SMP systems. If the semantics I suggest above are required,
> > then it implies support must be added to any possible future
> > scheduler implementations.
>
> It seems a lot of applications/APIs do not care about global RT semantics,
> but about RT semantics for groups of threads or processes (e.g. java
> or ada applications). Linux currently simulates this only for root
> and with a global runqueue.

Why do you say only root? Since the schedule type and priority are
inherited one only needs to be root to set the progenitors real time
priority. Also, a programs priority/ schedule type can be set by
another (root) program without the target program being root. I
routinely set inetd to real time, for example, to let telnet sessions
run at real time.

> I don't think it makes too much sense to have
> an global rt queue on a multi processor system, but there should be some
> way to define "scheduling groups" where rt semantics are followed inside.

Still, the customer is king.

> Such a scheduling group could be a clone flag or default to CLONE_VM for
> example for compatibility. A scheduling group would also make it possible
> to support simple rt semantics for thread groups as non root. Then one
> could run a rt queue per scheduling group, and simulate global rt run queue
> or per cpu rt run queue as needed by appropiate setup.

My first thought is that this is fairly high overhead to put in the
schedule path. May be if I knew more....


--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2001-11-19 18:40:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Real Time Runqueue

On Mon, Nov 19, 2001 at 10:32:23AM -0800, george anzinger wrote:
> Andi Kleen wrote:
> >
> > On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote:
> > > The reason I ask is that we went through the pains of a separate
> > > realtime RQ in our MQ scheduler. And yes, it does hurt the common
> > > case, not to mention the extra/complex code paths. I was hoping
> > > that someone in the know could enlighten us as to how RT semantics
> > > apply to SMP systems. If the semantics I suggest above are required,
> > > then it implies support must be added to any possible future
> > > scheduler implementations.
> >
> > It seems a lot of applications/APIs do not care about global RT semantics,
> > but about RT semantics for groups of threads or processes (e.g. java
> > or ada applications). Linux currently simulates this only for root
> > and with a global runqueue.
>
> Why do you say only root? Since the schedule type and priority are

You effectively need to be root to set up the priorities at least once.
setuid programs are a big pain to use and a lot of people don't want them.

Also it is useful to change priorities on the fly, e.g. to avoid
priority inheritance problems.

> > I don't think it makes too much sense to have
> > an global rt queue on a multi processor system, but there should be some
> > way to define "scheduling groups" where rt semantics are followed inside.
>
> Still, the customer is king.

Hmm?

>
> > Such a scheduling group could be a clone flag or default to CLONE_VM for
> > example for compatibility. A scheduling group would also make it possible
> > to support simple rt semantics for thread groups as non root. Then one
> > could run a rt queue per scheduling group, and simulate global rt run queue
> > or per cpu rt run queue as needed by appropiate setup.
>
> My first thought is that this is fairly high overhead to put in the
> schedule path. May be if I knew more....

The only overhead as far as I can see would be a few more pointer
references for RT processes (task->runqueue-> ...). I guess with
some care it can be even kept out of the non RT fast path.

-Andi

2001-11-19 19:09:37

by Mike Kravetz

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Real Time Runqueue

On Mon, Nov 19, 2001 at 11:23:23AM -0700, Richard Gooch wrote:
>
> We have to continue providing global RT semantics. However, a
> non-privileged scheduling class which gives RT-like behaviour within a
> scheduling group would be *great*! I've wished for such a facility
> myself.
>

If we go to the trouble of supporting scheduling groups, then it would
seem that one could define a 'global RT' group for any required global
semantics. The initial development costs (scheduling groups) may be
high, but you would get global RT semantics for free. I also believe
that we may be able to keep the overhead out of scheduling decisions for
tasks not in the groups. My only concern would be with groups that
span multiple CPUs (such as the global one). If the scheduler continues
to use a single lock (which I don't advocate), this is not much of an
issue. However, things can get quite complicated with multiple (per-CPU)
locks and the possibly required per-group synchronization items.

--
Mike

2001-11-19 20:28:30

by Matthew Dobson

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Real Time Runqueue

george anzinger wrote:
> A couple of addition points. There should be an API to declare cpu
> affinity, and cpu affinity should be passed thru fork and exec. You may
> need to be god (root) to use the API however. (I think cpu affinity is
> currently passed thru fork and exec, by the way.)

There is... there is a launch_policy patch on
http://sourceforge.net/projects/lse, that offers 2 ways of setting a processes
CPU affinity. A process can muck around with its own affinity through prctl(),
and root can muck around with the CPU affinity of any process through
/proc/<pid>/cpus_allowed and /proc/<pid>/launch_policy. And the launch_policy
*is* inherited from parent to child.

The cpus_allowed part comes from Andrew Morton and controls the processes
affinity. I've extended that to include the launch_policy part which sets the
affinity to be passed on through fork/exec.

Enjoy!

-matt