this is my latest update of the O(1) scheduler:
http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-pre11-H1.patch
(i'll release the -H1 2.4 patch later today.)
this patch does two things:
First it cleans up the load balancer's interaction with the timer tick.
There are now two functions called from the timer tick: busy_cpu_tick()
and idle_cpu_tick(). It's completely up to the scheduler to use them
appropriately. There is IDLE_REBALANCE_TICK and BUSY_REBALANCE_TICK. We do
a busy rebalance every 250 msecs, and we do an idle rebalance on idle CPUs
every 1 msec. (or 10 msec on HZ=100 systems.) The point is to not leave
CPUs idle for a long time, but still not take tasks from well-affinized
runqueues at a too high frequency.
the other change is to use the 'interactivity' output of the load
estimator more agressively. This results in a usable X while more extreme
compilation jobs like 'make -j40 bzImage' are running. In fact i'm writing
this email while an infinite loop of 'make -j40' kernel compilation was
running (default priority, not reniced), on a dual-CPU box, and completely
forgot about it - until i tried to recompile the -H1 tree :-|
Changes in -H1 relative to 2.5.2-pre11:
- Rusty Russel: get the alignment of the runqueues right and reduce array
indexing overhead.
- Kevin O'Connor, Robert Love: fix locking order bug in set_cpus_allowed()
which bug is able to cause boot-time lockups on SMP systems.
- Rusty Russell: fix rebalance tick definition if HZ < 100 in UML.
- Rusty Russell: check for new_mask in set_cpus_allowed(), to be sure.
- Rusty Russell: clean up rq_ macros so that HT can be done by changing
just one of the macros.
- Rusty Russell: remove rq->cpu.
- Rusty Russell: remove cacheline padding from runqueue_t, it's pointless
now.
- Rusty Russell: sched.c comment fixes.
- increase minimum timeslice length by 10 msec.
- fix comments in sched.h
- make rq->bitmap big-endian safe. (Anton Blanchard)
- documented and cleaned up the load estimator bits, no functional
changes apart from small speedups.
- do init_idle() before starting up the init thread, this removes a race
where we'd run the init thread on CPU#0 before init_idle() has been
called.
(please let me know if i dropped any patch from anyone along the way, or
forgot to credit anyone.)
Comments, bug reports, suggestions welcome,
Ingo
On Thu, 10 Jan 2002, Ingo Molnar wrote:
>
> First it cleans up the load balancer's interaction with the timer tick.
> There are now two functions called from the timer tick: busy_cpu_tick()
> and idle_cpu_tick(). It's completely up to the scheduler to use them
> appropriately.
This is _wrong_. The timer doesn't even know whether something is an idle
task or not.
Proof: kapmd (right now the scheduler doesn't know this either, but at
least we could teach it to know).
Don't try to make the timer code know stuff that the timer code should not
and does not know about. Just call the scheduler on each tick, and let the
scheduler make its decision.
Linus
On Thu, 2002-01-10 at 09:19, Ingo Molnar wrote:
> - Kevin O'Connor, Robert Love: fix locking order bug in set_cpus_allowed()
> which bug is able to cause boot-time lockups on SMP systems.
Along the same lines as the above, note this code snippet from
try_to_wake_up:
lock_task_rq(rq, p, flags);
p->state = TASK_RUNNING;
if (!p->array) {
if (0 && !rt_task(p) && synchronous && (smp_processor_id() < p->cpu)) {
spin_lock(&this_rq()->lock);
p->cpu = smp_processor_id();
activate_task(p, this_rq());
spin_unlock(&this_rq()->lock);
} else { ... }
First, H1 added the 0 there ... ???
Second, prior to the change, I wonder whether the
(smp_processor_id() < p->cpu)) is inverted. We've already locked p's
runqueue above. So this branch, if taken, will lock the current
runqueue -- but it checks if current's cpu id is _less_ then p's! Thus
we lock greater to lesser. Would a proper change be:
diff -urN linux-2.5.2-pre10-H1/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.2-pre10-H1/kernel/sched.c Thu Jan 10 14:56:06 2002
+++ linux/kernel/sched.c Thu Jan 10 14:56:21 2002
@@ -383,7 +383,7 @@
lock_task_rq(rq, p, flags);
p->state = TASK_RUNNING;
if (!p->array) {
- if (0 && !rt_task(p) && synchronous && (smp_processor_id() < p->cpu)) {
+ if (!rt_task(p) && synchronous && (smp_processor_id() > p->cpu)) {
spin_lock(&this_rq()->lock);
p->cpu = smp_processor_id();
activate_task(p, this_rq());
Note I removed the 0, which may or not be your intention (do you want
that code branch there at all?) My point is swapping the
lesser-than-sign for a greater-than so that we now lock this_rq only if
it is _greater_ than p's.
Keep 'em coming ;)
Robert Love
Linus Torvalds wrote:
>
> On Thu, 10 Jan 2002, Ingo Molnar wrote:
> >
> > First it cleans up the load balancer's interaction with the timer tick.
> > There are now two functions called from the timer tick: busy_cpu_tick()
> > and idle_cpu_tick(). It's completely up to the scheduler to use them
> > appropriately.
>
> This is _wrong_. The timer doesn't even know whether something is an idle
> task or not.
>
> Proof: kapmd (right now the scheduler doesn't know this either, but at
> least we could teach it to know).
>
> Don't try to make the timer code know stuff that the timer code should not
> and does not know about. Just call the scheduler on each tick, and let the
> scheduler make its decision.
>
> Linus
Currently this code is called from the interrupt code in timer. At this
time the xtime(write) lock is held and interrupts are off.
It could also be called from the "bh" section of timer.c at which time
the interrupts are on and the xtime lock is not held.
In either case, the state of the interrupt system is known and the the
irq_save version of the spin lock need not be used. (Hay a cycle is a
cycle.)
But more to the point, could the call(s) be made from the "bh" section
and thus inprove interrupt latency, not to mention xtime lock
contention.
Also, if a 250ms tick is needed, you might consider a timer (which is
also called from the "bh" code).
--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
On Thu, 10 Jan 2002, Linus Torvalds wrote:
> > First it cleans up the load balancer's interaction with the timer tick.
> > There are now two functions called from the timer tick: busy_cpu_tick()
> > and idle_cpu_tick(). It's completely up to the scheduler to use them
> > appropriately.
>
> This is _wrong_. The timer doesn't even know whether something is an idle
> task or not.
yes - thought of this after writing the mail.
> Proof: kapmd (right now the scheduler doesn't know this either, but at
> least we could teach it to know).
yes - pid == 0 is not the right information. I've fixed this in my tree.
Ingo
On 10 Jan 2002, Robert Love wrote:
> Along the same lines as the above, note this code snippet from
> try_to_wake_up:
> if (0 && !rt_task(p) && synchronous && (smp_processor_id() < p->cpu)) {
> + if (!rt_task(p) && synchronous && (smp_processor_id() > p->cpu)) {
you are right - but i have removed it completely from my current tree.
the reason is that i think, unless seeing some hard proof to the contrary,
it's not a good idea to balance from wakeups. Wakeups are high-frequency
and lightweight in nature, and despite all the idle-balancing magic we
tried in 2.4 (i wrote most of that code), there were important cases where
it failed.
so the current stategy i'd like us to try is to do 'high frequency idle
rebalancing' and 'slow frequency fairness rebalancing'. No rebalancing in
wakeup, at all. This makes wakeups simpler, faster and more scalable. (We
can also do slow rebalancing in some other, strategic places where we know
that it's the right time to push a process context to another CPU.)
Ingo
On Thu, 10 Jan 2002, george anzinger wrote:
> Currently this code is called from the interrupt code in timer. At
> this time the xtime(write) lock is held and interrupts are off.
yes, Linus is right - i've introduced a scheduler_tick() function which is
called with interrupts disabled - thus no __save_flags is needed.
Ingo
If I run 3 cpu-hog tasks on a 2 CPU system, then 1 task will get an
entire CPU while the other 2 tasks share the other CPU (easily verified
by a simple test program). On previous versions of the scheduler
'balancing' this load was achieved by the global nature of time slices.
No task was given a new time slice until the time slices of all runnable
tasks had expired. In the current scheduler, the decision to replenish
time slices is made at a local (pre-CPU) level. I assume the load
balancing code should take care of the above workload? OR is this the
behavior we desire? We certainly have optimal cache use.
--
Mike
On Thu, 10 Jan 2002, Linus Torvalds wrote:
> Don't try to make the timer code know stuff that the timer code should
> not and does not know about. Just call the scheduler on each tick, and
> let the scheduler make its decision.
agreed, the -H4 patch implements this cleanup:
http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-pre11-H4.patch
the timer code calls the scheduler_tick() function once per local timer
interrupt. This function then decides whether the current task is idle or
not. (this should also fix the timeslice-expiration bug of any possibly
pid 0 process.)
(there is one small ugliness left, the function also has to be prepared
for a not yet complete scheduler state, rq->idle == NULL and rq->curr ==
NULL. I'll fix this later, it's a detail.)
(-H4 has been sanity-compiled & booted on SMP.)
other changes since -H1:
- removed dead code from wakeup.
Ingo
On Thu, 10 Jan 2002, Mike Kravetz wrote:
> If I run 3 cpu-hog tasks on a 2 CPU system, then 1 task will get an
> entire CPU while the other 2 tasks share the other CPU (easily
> verified by a simple test program). On previous versions of the
> scheduler 'balancing' this load was achieved by the global nature of
> time slices. No task was given a new time slice until the time slices
> of all runnable tasks had expired. In the current scheduler, the
> decision to replenish time slices is made at a local (pre-CPU) level.
> I assume the load balancing code should take care of the above
> workload? OR is this the behavior we desire? [...]
Arguably this is the most extreme situation - every other distribution
(2:3, 3:4) is much less problematic. Will this cause problems? We could
make the fairness-balancer more 'sharp' so that it will oscillate the
length of the two runqueues at a slow pace, but it's still caching loss.
> We certainly have optimal cache use.
indeed. The question is, should we migrate processes around just to get
100% fairness in 'top' output? The (implicit) cost of a task migration
(caused by the destruction & rebuilding of cache state) can be 10
milliseconds easily on a system with big caches.
Ingo
On Fri, 11 Jan 2002, Ingo Molnar wrote:
> indeed. The question is, should we migrate processes around just to get
> 100% fairness in 'top' output? The (implicit) cost of a task migration
> (caused by the destruction & rebuilding of cache state) can be 10
> milliseconds easily on a system with big caches.
10 ms is exactly what i've observed while i was coding the BMQS balance
code. Leaving a cpu idle for more than 10ms will make real tests like
kernel builds to suffer performance degradation. By using 10ms i always
got the same time of the standard scheduler.
- Davide
Ingo Molnar wrote:
> On Thu, 10 Jan 2002, Mike Kravetz wrote:
>
>
>>If I run 3 cpu-hog tasks on a 2 CPU system, then 1 task will get an
>>entire CPU while the other 2 tasks share the other CPU (easily
>>verified by a simple test program). On previous versions of the
>>scheduler 'balancing' this load was achieved by the global nature of
>>time slices. No task was given a new time slice until the time slices
>>of all runnable tasks had expired. In the current scheduler, the
>>decision to replenish time slices is made at a local (pre-CPU) level.
>>I assume the load balancing code should take care of the above
>>workload? OR is this the behavior we desire? [...]
>>
>
> Arguably this is the most extreme situation - every other distribution
> (2:3, 3:4) is much less problematic. Will this cause problems? We could
> make the fairness-balancer more 'sharp' so that it will oscillate the
> length of the two runqueues at a slow pace, but it's still caching loss.
>
>
>>We certainly have optimal cache use.
>>
>
> indeed. The question is, should we migrate processes around just to get
> 100% fairness in 'top' output? The (implicit) cost of a task migration
> (caused by the destruction & rebuilding of cache state) can be 10
> milliseconds easily on a system with big caches.
>
> Ingo
I do vote for optimal cache use. Using squid (200MB process in my case)
can be much faster if squid stays on the same CPU for a while, instead
of hopping from one CPU to another (dual PII350 machine).
Fran?ois Cami
> I do vote for optimal cache use. Using squid (200MB process in my case)
> can be much faster if squid stays on the same CPU for a while, instead
> of hopping from one CPU to another (dual PII350 machine).
well, a typical desktop processor has 256-512K cache,
and at least 500 MB/s dram bandwidth, so cache flush time
is only ~1ms, and can be as low as .2 ms. the fact that
all current CPUs are out-of-order can hide some of this, as well...
On Friday 11 January 2002 15:42, Fran?ois Cami wrote:
> Ingo Molnar wrote:
> > On Thu, 10 Jan 2002, Mike Kravetz wrote:
> >>If I run 3 cpu-hog tasks on a 2 CPU system, then 1 task will get an
> >>entire CPU while the other 2 tasks share the other CPU (easily
> >>verified by a simple test program). On previous versions of the
> >>scheduler 'balancing' this load was achieved by the global nature of
> >>time slices. No task was given a new time slice until the time slices
> >>of all runnable tasks had expired. In the current scheduler, the
> >>decision to replenish time slices is made at a local (pre-CPU) level.
> >>I assume the load balancing code should take care of the above
> >>workload? OR is this the behavior we desire? [...]
> >
> > Arguably this is the most extreme situation - every other distribution
> > (2:3, 3:4) is much less problematic. Will this cause problems? We could
> > make the fairness-balancer more 'sharp' so that it will oscillate the
> > length of the two runqueues at a slow pace, but it's still caching loss.
> >
> >>We certainly have optimal cache use.
> >
> > indeed. The question is, should we migrate processes around just to get
> > 100% fairness in 'top' output? The (implicit) cost of a task migration
> > (caused by the destruction & rebuilding of cache state) can be 10
> > milliseconds easily on a system with big caches.
> >
> > Ingo
>
> I do vote for optimal cache use. Using squid (200MB process in my case)
> can be much faster if squid stays on the same CPU for a while, instead
> of hopping from one CPU to another (dual PII350 machine).
>
> Fran?ois Cami
But, given the above case, what happens when you have Sendmail on
the first CPU and Squid is sharing the second CPU? This is not optimal
either, or am I missing something?
--
[email protected].
Robert Love wrote:
> On Fri, 2002-01-11 at 16:46, Timothy Covell wrote:
>
>> But, given the above case, what happens when you have Sendmail on
>> the first CPU and Squid is sharing the second CPU? This is not optimal
>> either, or am I missing something?
>
> Correct. I sort of took the "optimal cache use" comment as
> tongue-in-cheek. If I am mistaken, correct me, but here is my
> perception of the scenario:
>
> 2 CPUs, 3 tasks. 1 task receives 100% of the CPU time on one CPU. The
> remaining two tasks share the second CPU. The result is, of three
> evenly prioritized tasks, one receives double as much CPU time as the
> others.
Yes, but that makes sense if the one that receives double as much
CPU time as the others _needs_ that CPU time. For example, an idle
MTA _should_ not receive as much CPU time as an overworked proxy
server... Yet, unless someone changes the priority, they are treated
evenly by the scheduler. This is not so good in my mind.
> Aside from the cache utilization, this is not really "fair" -- the
> problem is, the current design of load_balance (which is quite good)
> just won't throw the tasks around so readily. What could be done --
> cleanly -- to make this better?
>
> Robert Love
Is it possible for the scheduler to take into account another parameter
than priority to decide an app' CPU affinity ? For example, size
in memory and %CPU. For example, if an app accounts for
80% of the total CPU utilisation, by all means it should stay
on the same CPU. That, in my mind, would have the effect that this
app would run faster (because of the cache) AND that the load on
the machine would decrease (having to bump that app from one
CPU to another is _not_ good).
Is this feasible, or am I dreaming ?
Fran?ois Cami
On Friday 11 January 2002 23:45, Robert Love wrote:
> On Fri, 2002-01-11 at 16:46, Timothy Covell wrote:
> > But, given the above case, what happens when you have Sendmail on
> > the first CPU and Squid is sharing the second CPU? This is not optimal
> > either, or am I missing something?
>
> Correct. I sort of took the "optimal cache use" comment as
> tongue-in-cheek. If I am mistaken, correct me, but here is my
> perception of the scenario:
>
> 2 CPUs, 3 tasks. 1 task receives 100% of the CPU time on one CPU. The
> remaining two tasks share the second CPU. The result is, of three
> evenly prioritized tasks, one receives double as much CPU time as the
> others.
>
> Aside from the cache utilization, this is not really "fair" -- the
> problem is, the current design of load_balance (which is quite good)
> just won't throw the tasks around so readily. What could be done --
> cleanly -- to make this better?
>
> Robert Love
That's the million dollar question. I was just concerned that if that
were to be implemented in a production kernel, then lots of admins
would be confused.
--
[email protected].
On Sat, 2002-01-12 at 11:26, Timothy Covell wrote:
> That's the million dollar question. I was just concerned that if that
> were to be implemented in a production kernel, then lots of admins
> would be confused.
It is in 2.5. Let's agree if it is or is not a problem, and then find a
solution.
Robert Love
On 12 Jan 2002, Robert Love wrote:
> On Fri, 2002-01-11 at 16:46, Timothy Covell wrote:
>
> > But, given the above case, what happens when you have Sendmail on
> > the first CPU and Squid is sharing the second CPU? This is not optimal
> > either, or am I missing something?
>
> Correct. I sort of took the "optimal cache use" comment as
> tongue-in-cheek. If I am mistaken, correct me, but here is my
> perception of the scenario:
>
> 2 CPUs, 3 tasks. 1 task receives 100% of the CPU time on one CPU. The
> remaining two tasks share the second CPU. The result is, of three
> evenly prioritized tasks, one receives double as much CPU time as the
> others.
>
> Aside from the cache utilization, this is not really "fair" -- the
> problem is, the current design of load_balance (which is quite good)
> just won't throw the tasks around so readily. What could be done --
> cleanly -- to make this better?
My opinion is: if it can be solved with no more than 20 lines of code
let's do it, otherwise let's see what kind of catastrophe will happen by
allowing such behavior. Because i've already seen hundreds of lines of
code added to solve corner cases and removed after 3-4 years because
someone realized that maybe such corner cases does not matter more than a
whit.
I'll be happy to be shut down here ...
- Davide
On Saturday 12 January 2002 14:44, Davide Libenzi wrote:
> On 12 Jan 2002, Robert Love wrote:
> > On Fri, 2002-01-11 at 16:46, Timothy Covell wrote:
> > > But, given the above case, what happens when you have Sendmail on
> > > the first CPU and Squid is sharing the second CPU? This is not optimal
> > > either, or am I missing something?
> >
> > Correct. I sort of took the "optimal cache use" comment as
> > tongue-in-cheek. If I am mistaken, correct me, but here is my
> > perception of the scenario:
> >
> > 2 CPUs, 3 tasks. 1 task receives 100% of the CPU time on one CPU. The
> > remaining two tasks share the second CPU. The result is, of three
> > evenly prioritized tasks, one receives double as much CPU time as the
> > others.
> >
> > Aside from the cache utilization, this is not really "fair" -- the
> > problem is, the current design of load_balance (which is quite good)
> > just won't throw the tasks around so readily. What could be done --
> > cleanly -- to make this better?
>
> My opinion is: if it can be solved with no more than 20 lines of code
> let's do it, otherwise let's see what kind of catastrophe will happen by
> allowing such behavior. Because i've already seen hundreds of lines of
> code added to solve corner cases and removed after 3-4 years because
> someone realized that maybe such corner cases does not matter more than a
> whit.
> I'll be happy to be shut down here ...
>
>
>
> - Davide
There's always the 90/10 rule about 90% of the work being nescessary to
get out the last 10% of improvements. It looks like you guys have been
working quite hard at this already and making good progress. Personally,
I have nothing to offer beyond the ideas of Francois.
--
[email protected].
On Sat, 2002-01-12 at 15:44, Davide Libenzi wrote:
> My opinion is: if it can be solved with no more than 20 lines of code
> let's do it, otherwise let's see what kind of catastrophe will happen by
> allowing such behavior. Because i've already seen hundreds of lines of
> code added to solve corner cases and removed after 3-4 years because
> someone realized that maybe such corner cases does not matter more than a
> whit.
> I'll be happy to be shut down here ...
Completely agreed. I think its an unfair situation (we may see
administrators timing the order they start large batch tasks), but it is
a corner case and we do have an "optimal cache use" counterargument.
Robert Love
In message <[email protected]> y
ou write:
> > Aside from the cache utilization, this is not really "fair" -- the
> > problem is, the current design of load_balance (which is quite good)
> > just won't throw the tasks around so readily. What could be done --
> > cleanly -- to make this better?
>
> My opinion is: if it can be solved with no more than 20 lines of code
> let's do it, otherwise let's see what kind of catastrophe will happen by
> allowing such behavior.
Agree. Anyone who really has 3 CPU hogs on a 2 CPU machine, *and*
never runs two more tasks to perturb the system, *and* notices that
one runs twice the speed of the other two, *and* cares about fairness
(ie. not RC5 etc), feel free to Email abuse to me. Not Ingo, he has
real work to do 8)
Cheers!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Rusty wrote:
> Agree. Anyone who really has 3 CPU hogs on a 2 CPU machine, *and*
> never runs two more tasks to perturb the system, *and* notices that
> one runs twice the speed of the other two, *and* cares about fairness
> (ie. not RC5 etc), feel free to Email abuse to me. Not Ingo, he has
> real work to do 8)
Or buy a third CPU...;-)
Regards,
Dieter
BTW Not meant as flaming.
--
Dieter N?tzel
Graduate Student, Computer Science
University of Hamburg
Department of Computer Science
@home: [email protected]
Timothy Covell wrote:
...
>
>But, given the above case, what happens when you have Sendmail on
>the first CPU and Squid is sharing the second CPU? This is not optimal
>either, or am I missing something?
>
It will not happen (unless you have ligth speed disks and nics) in this
scenario, both squid and sendmail are io-hogs, not cpu.