the attached patch adds a feature that was pretty high on the scheduler
features wishlist: it implements the functionality of SCHED_IDLE, in a
safe way. Another desired scheduler feature was batch scheduling, the
cache-friendly handling of lowprio, batch-like, CPU-bound, 100%
noninteractive tasks. The new SCHED_BATCH scheduler policy implements both
features.
the existing SCHED_IDLE patches floating around, despite their simplicity,
had one major flaw that prevented their integration into the scheduler: if
an unpriviledged SCHED_IDLE process uses normal kernel functionality,
which happens to grab a critical kernel resource such as the root
directory's semaphore, and schedules away still holding the semaphore,
then there is no guarantee that the task will run again in any
deterministic amount of time - keeping the critical resource potentially
forever - deadlocking every other process that attempts to use that
critical resource. This property, while being a source for soft lockups
even during ordinary use, also makes SCHED_IDLE an easy DoS exploit.
as the size of the patch suggests, the safe solution is not simple. The
basic concept is the identification of user-space preemption via a special
scheduler upcall: one safe point to delay a task's execution indefinitely
is when the task is preempted in pure user-space mode - if this happens
then the lowlevel kernel entry code calls the schedule_userspace()
function, instead of schedule(). In every other case the task needs to
stay in the 'normal' scheduler queues, to guarantee prompt processing of
kernelspace code. Furthermore, such batch-mode tasks need to be scheduled
if they get a signal delivered - otherwise it would not be possible to eg.
kill them.
other properties: SCHED_BATCH also triggers much longer, batch-like
timeslices - the default SCHED_BATCH timeslice is 1.5 seconds. Nice values
still have a meaning for SCHED_BATCH processes as well - they determine
the relative percentage of idle CPU time allocated to SCHED_BATCH
processes. If the SCHED_BATCH process is in kernel-mode then the nice
value is used as the normal priority when preempting (or not preempting)
other, non-SCHED_BATCH processes.
put in another way: whenever a SCHED_BATCH process is in kernel-mode, it's
"elevated" into the SCHED_NORMAL priority domain - which guarantees timely
execution of kernel-space code. When the SCHED_BATCH process is executing
user-space code then it can be put into the batch-queue, and can be
delayed indefinitely.
Timeslice distribution is a modified/simplified version of SCHED_NORMAL
scheduling: SCHED_BATCH processes are scheduled in a roundrobin way,
timeslices are distributed based on the nice value. SCHED_BATCH tasks that
use up their timeslices get suspended until all other SCHED_BATCH tasks on
that CPU exhaust their timeslices - at which point a new turn begins.
SCHED_NORMAL, SCHED_RR and SCHED_FIFO tasks preempt SCHED_BATCH processes
immediately. All this functionality is implemented in an O(1) way. (The
interactivity estimator is active for SCHED_BATCH processes as well - this
has an effect if the task is in kernelspace mode. This also makes sure
that no artificial priority boost can be achieved by switching in/out of
SCHED_BATCH mode.)
on SMP there are per-CPU batch queues - which enables the use of hundreds
or thousands of SCHED_BATCH processes, if desired. A new, independent
load-balancer is used to distribute SCHED_BATCH processes: SCHED_BATCH
processes will populate CPUs depending on the CPU's "10 seconds history of
idleness". The more idle a CPU, the more SCHED_BATCH processes it will
handle. The weighting is done in a way to make the global distribution of
SCHED_BATCH timeslices fair. The load-balancer also honors caching
properties and tries to reduce unnecessery bouncing of SCHED_BATCH
processes. (The balancing, like in the SCHED_NORMAL case, is not intended
to be 100% 'sharp' - some statistical fuzziness is allowed to keep
overhead and complexity down.)
(to see the SMP SCHED_BATCH load-balancer in action, start up multiple
SCHED_BATCH processes on an SMP box - they populate all available CPUs
evenly. Then start up a single CPU-intensive, non-SCHED_BATCH process -
after a few seconds all SCHED_BATCH processes will migrate off to the
remaining CPUs, and the SCHED_NORMAL task will get 100% CPU time of a
single CPU.)
(design sidenote: initially i tried to integrate SCHED_BATCH scheduling
into the existing scheduler and SCHED_NORMAL balancer somehow, but gave up
on this idea. While that worked for RT scheduling, SCHED_BATCH scheduling
is quite different, and is 100% orthogonal to all the other scheduling
domains. Eg. the balancing of non-SCHED_BATCH processes *must not* be
influenced by the way SCHED_BATCH processes are distributed amongst CPUs.
The distribution of timeslices must be completely separated as well. So
since all the queues and state has to be separate, they can as well be in
separate (and simplified) data structures.)
i've also attached setbatch.c, which is a simple utility to change a given
PID's scheduling policy to SCHED_BATCH. One straightforward way of using
it is to change one shell to be SCHED_BATCH:
./setbatch $$
and start various commands from this SCHED_BATCH shell - all forked
children inherit the SCHED_BATCH setting.
the load generated by multiple SCHED_BATCH processes does not show up in
the load average - this is the straightforward solution to not confuse
load-average-sensitive applications such as sendmail.
the runtime performance impact of SCHED_BATCH is fairly minimal. There is
a (pretty light) branch and function call cost in the entry.S preemption
codepath. Otherwise the SCHED_BATCH code triggers in slowpaths only: eg.
when we would otherwise switch to the idle thread.
the patch was tested on x86 systems. non-x86 systems should still work
with the patch applied, but no SCHED_BATCH process will in fact be
suspended. For batch-suspension to work the architecture needs to call
schedule_userspace() instead of schedule(), when pure userspace code is
preempted.
the attached patch is against 2.5.24, it was tested on SMP and UP systems
as well, but keep in mind that this is the first version of this patch, so
some rough edges might be present. The patch can also be downloaded from
my scheduler patches homepage:
http://redhat.com/~mingo/O(1)-scheduler/batch-sched-2.5.24-A0
bug reports, success reports, comments, suggestions are welcome,
Ingo
On Sun, 2002-06-30 at 17:26, Ingo Molnar wrote:
> -#define SCHED_OTHER 0
> +#define SCHED_NORMAL 0
>From IEEE 1003.1-2001 / Open Group Base Spec. Issue 6:
"Conforming implementations shall include one scheduling policy
identified as SCHED_OTHER (which may execute identically with either the
FIFO or round robin scheduling policy)."
So, you probably want to add a "#define SCHED_OTHER SCHED_NORMAL" here
in order to prevent future confusion, especially because the user-space
headers have SCHED_OTHER, but no SCHED_NORMAL.
- Nicholas
Nicholas Miell <[email protected]> writes:
> On Sun, 2002-06-30 at 17:26, Ingo Molnar wrote:
>
>> -#define SCHED_OTHER 0
>> +#define SCHED_NORMAL 0
>
>>From IEEE 1003.1-2001 / Open Group Base Spec. Issue 6:
> "Conforming implementations shall include one scheduling policy
> identified as SCHED_OTHER (which may execute identically with either the
> FIFO or round robin scheduling policy)."
>
> So, you probably want to add a "#define SCHED_OTHER SCHED_NORMAL" here
> in order to prevent future confusion, especially because the user-space
> headers have SCHED_OTHER, but no SCHED_NORMAL.
This can be done in glibc. linux/sched.h should not be used by
userspace applications, glibc has the define in <bits/sched.h> which
is included from <sched.h> - and <sched.h> is the file defined by
Posix.
Andreas
--
Andreas Jaeger
SuSE Labs [email protected]
private [email protected]
http://www.suse.de/~aj
On Mon, 1 Jul 2002, Andreas Jaeger wrote:
> >> -#define SCHED_OTHER 0
> >> +#define SCHED_NORMAL 0
> >
> >>From IEEE 1003.1-2001 / Open Group Base Spec. Issue 6:
> > "Conforming implementations shall include one scheduling policy
> > identified as SCHED_OTHER (which may execute identically with either the
> > FIFO or round robin scheduling policy)."
> >
> > So, you probably want to add a "#define SCHED_OTHER SCHED_NORMAL" here
> > in order to prevent future confusion, especially because the user-space
> > headers have SCHED_OTHER, but no SCHED_NORMAL.
>
> This can be done in glibc. linux/sched.h should not be used by
> userspace applications, glibc has the define in <bits/sched.h> which is
> included from <sched.h> - and <sched.h> is the file defined by Posix.
yes, this was my thinking too.
the reason for the change: with the introduction of SCHED_BATCH the
regular scheduling policy cannot really be called 'other' anymore, from
the point of scheduler internals - it's in the middle of all scheduler
policies, its only speciality is that it's the default one.
(obviously for the user interface it has to be defined.)
Ingo
On Mon, 2002-07-01 at 01:00, Ingo Molnar wrote:
>
> On Mon, 1 Jul 2002, Andreas Jaeger wrote:
> > This can be done in glibc. linux/sched.h should not be used by
> > userspace applications, glibc has the define in <bits/sched.h> which is
> > included from <sched.h> - and <sched.h> is the file defined by Posix.
>
> yes, this was my thinking too.
OK, I can understand that line of reasoning.
Keep in mind that someday, someone who is looking for the implementation
of the SCHED_OTHER policy will be thoroughly confused by the kernel's
complete lack of reference to SCHED_OTHER. And they'll be asking you for
clarification.
Or, you could make some note in the source that SCHED_OTHER is
SCHED_NORMAL and eliminate any source of confusion now.
- Nicholas
On 1 Jul 2002, Nicholas Miell wrote:
> > > This can be done in glibc. linux/sched.h should not be used by
> > > userspace applications, glibc has the define in <bits/sched.h> which is
> > > included from <sched.h> - and <sched.h> is the file defined by Posix.
> >
> > yes, this was my thinking too.
>
> OK, I can understand that line of reasoning.
>
> Keep in mind that someday, someone who is looking for the implementation
> of the SCHED_OTHER policy will be thoroughly confused by the kernel's
> complete lack of reference to SCHED_OTHER. And they'll be asking you for
> clarification.
>
> Or, you could make some note in the source that SCHED_OTHER is
> SCHED_NORMAL and eliminate any source of confusion now.
okay, i did this.
and, to not confuse existing code that still happens to use the kernel
headers, i added the #define to the 2.4.19-pre10-ac2 backport of the O(1)
scheduler.
Ingo
On Mon, Jul 01, 2002 at 02:26:42AM +0200, Ingo Molnar wrote:
> the load generated by multiple SCHED_BATCH processes does not show up in
> the load average - this is the straightforward solution to not confuse
> load-average-sensitive applications such as sendmail.
I think this will confuse atd too, which is an obvious candidate
for the batch scheduler; it may end up starting all jobs which
sit in it's "batch" queue.
I think a load-average calculation scheme like this would be better:
oldload: is the load average calculated the old way
batchload: is the load average calculated only from the batch scheduler
numcpus: number of cpus...
newload(){
if (oldload > numcpus) return oldload;
if ((oldload+batchload) > numcpus) return numcpus;
return (oldload+batchload)
}
So the batch processes would show the CPUs maxed out, but would not show
up as overload in the load average. (and you could run
"atd -l <numcpus - 0.3>")
regards:
Gabor Vitez
On Thu, Jul 04, 2002 at 03:04:49PM +0200, Vitez Gabor wrote:
> On Mon, Jul 01, 2002 at 02:26:42AM +0200, Ingo Molnar wrote:
> > the load generated by multiple SCHED_BATCH processes does not show up in
> > the load average - this is the straightforward solution to not confuse
> > load-average-sensitive applications such as sendmail.
If SCHED_BATCH load does not affect the load average, then it will *not*
confuse sendmail - because the box is not really loaded, SCHED_BATCH
will only run if sendmail wouldn't :)
I think it's great that SCHED_BATCH doesn't show up in loadavg - it is
really confusing to look at load statistics for boxes which have these
CPU intensive batch jobs running - the load average is perhaps
constantly between 2.1 and 2.2, when the "real" load on the box would
cause it to be between .1 and .2
Idle-time is what you pay for but do not get ;) Having CPU hogs
running is a nice way of getting the last penny out of the investment,
but I have many boxes where I don't do it, simply because it would
render the load statistics useless.
>
> I think this will confuse atd too, which is an obvious candidate
> for the batch scheduler; it may end up starting all jobs which
> sit in it's "batch" queue.
True. I am inclined to saying that any batch system that doesn't keep
track of it's own jobs but only cares about the load average, is flawed.
Load average is "guidance", it is a heuristic, it is not something one
should use as the sole measure when deciding when/where to spawn jobs.
Stuff like that only works in theory...
But atd is in use everywhere, and it does use the load as the only
metric (which is, by the way, why it doesn't start more than one job
per minute, because the loadavg needs time to rise).
Wouldn't it be pretty simple to just make atd have a hard limit on how
many concurrent at jobs (eventually on a per-user basis) it would start?
Unless you really go to extremes (thousands of jobs), the performance of
ten "concurrently" running SCHED_BATCH jobs and the same ten jobs beeing
launched sequentially by atd, should be fairly similar - given the huge
time-slices given by the scheduler to these jobs.
>
> I think a load-average calculation scheme like this would be better:
>
> oldload: is the load average calculated the old way
> batchload: is the load average calculated only from the batch scheduler
> numcpus: number of cpus...
>
> newload(){
> if (oldload > numcpus) return oldload;
> if ((oldload+batchload) > numcpus) return numcpus;
> return (oldload+batchload)
> }
>
> So the batch processes would show the CPUs maxed out, but would not show
> up as overload in the load average. (and you could run
> "atd -l <numcpus - 0.3>")
>
Hmmm... Such a hack might work around the shortcomings in the atd
scheduling algorithm.
I don't like it. It adds yet another level of obfuscation to loadavg.
I bet you can do the proper changes to atd in the same amount of code,
keeping the kernel clean and fixing the problem where it really is.
Just my 0.02 Euro,
--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
In article <[email protected]> you wrote:
> Idle-time is what you pay for but do not get ;) Having CPU hogs
> running is a nice way of getting the last penny out of the investment,
> but I have many boxes where I don't do it, simply because it would
> render the load statistics useless.
And of course it needs a bit more power and produces much more temperature
(and therefore noise).
BTW: even an batch prio will load the system more than an idle process. The
cache is dirty, the disk has a longer work queue, RSS is used up by that
programs. So for a slow system you have to expect impact.
Greetings
Bernd