LinuxLists.cc - Scheduler Cleanup

2001-11-26 19:48:44

Subject: Scheduler Cleanup

I'm happy to see the cleanup of scheduler code that went into
2.4.15/16. One small difference in behavior (I think) is that
the currently running task is not given preference over other
tasks on the runqueue with the same 'goodness' value. I would
think giving the current task preference is a good thing
(especially in light of recent discussions about too frequent
moving/rescheduling of tasks). Can someone provide the rational
for this change? Was it just the result of making the code
cleaner? Is it believed that this won't really make a difference?

--
Mike

2001-11-26 20:43:01

by Davide Libenzi

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Mon, 26 Nov 2001, Mike Kravetz wrote:

> I'm happy to see the cleanup of scheduler code that went into
> 2.4.15/16. One small difference in behavior (I think) is that
> the currently running task is not given preference over other
> tasks on the runqueue with the same 'goodness' value. I would
> think giving the current task preference is a good thing
> (especially in light of recent discussions about too frequent
> moving/rescheduling of tasks). Can someone provide the rational
> for this change? Was it just the result of making the code
> cleaner? Is it believed that this won't really make a difference?

Mike, I was actually surprised about the presence of that check inside the
previous code.
If you think about it, when a running task is scheduled ?

1) an IRQ wakeup some I/O bound task
2) the quota is expired

With 1) you've an incoming I/O bound task ( ie: ksoftirqd_* ) that is very
likely going to have a better dynamic priority ( if not reschedule_idle()
does not set need_resched ), while with 2) you've the task counter == 0.
In both cases not only the test is useless but is going to introduce 1)
the branch in the fast path 2) the cost of an extra goodness().

- Davide

2001-11-27 20:57:58

by Shaya Potter

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Mon, 2001-11-26 at 15:49, Davide Libenzi wrote:
> On Mon, 26 Nov 2001, Mike Kravetz wrote:
>
> > I'm happy to see the cleanup of scheduler code that went into
> > 2.4.15/16. One small difference in behavior (I think) is that
> > the currently running task is not given preference over other
> > tasks on the runqueue with the same 'goodness' value. I would
> > think giving the current task preference is a good thing
> > (especially in light of recent discussions about too frequent
> > moving/rescheduling of tasks). Can someone provide the rational
> > for this change? Was it just the result of making the code
> > cleaner? Is it believed that this won't really make a difference?
>
> Mike, I was actually surprised about the presence of that check inside the
> previous code.
> If you think about it, when a running task is scheduled ?
>
> 1) an IRQ wakeup some I/O bound task
> 2) the quota is expired
>
> With 1) you've an incoming I/O bound task ( ie: ksoftirqd_* ) that is very
> likely going to have a better dynamic priority ( if not reschedule_idle()
> does not set need_resched ), while with 2) you've the task counter == 0.
> In both cases not only the test is useless but is going to introduce 1)
> the branch in the fast path 2) the cost of an extra goodness().

doesn't schedule() also get called when a new task is put on the
runqueue?

when that happens, doesn't the check matter? or perhaps I'm just
mistaken.

thanks,

shaya

2001-11-27 21:54:28

by George Anzinger

[permalink] [raw]

Subject: Re: Scheduler Cleanup

Shaya Potter wrote:
>
> On Mon, 2001-11-26 at 15:49, Davide Libenzi wrote:
> > On Mon, 26 Nov 2001, Mike Kravetz wrote:
> >
> > > I'm happy to see the cleanup of scheduler code that went into
> > > 2.4.15/16. One small difference in behavior (I think) is that
> > > the currently running task is not given preference over other
> > > tasks on the runqueue with the same 'goodness' value. I would
> > > think giving the current task preference is a good thing
> > > (especially in light of recent discussions about too frequent
> > > moving/rescheduling of tasks). Can someone provide the rational
> > > for this change? Was it just the result of making the code
> > > cleaner? Is it believed that this won't really make a difference?
> >
> > Mike, I was actually surprised about the presence of that check inside the
> > previous code.
> > If you think about it, when a running task is scheduled ?
> >
> > 1) an IRQ wakeup some I/O bound task
> > 2) the quota is expired
> >
> > With 1) you've an incoming I/O bound task ( ie: ksoftirqd_* ) that is very
> > likely going to have a better dynamic priority ( if not reschedule_idle()
> > does not set need_resched ), while with 2) you've the task counter == 0.
> > In both cases not only the test is useless but is going to introduce 1)
> > the branch in the fast path 2) the cost of an extra goodness().
>
> doesn't schedule() also get called when a new task is put on the
> runqueue?
>
> when that happens, doesn't the check matter? or perhaps I'm just
> mistaken.
>
That is the same as 1) above. reschedule_idle() should determine if the
new task is to get the cpu and set need_resched as needed.

--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2001-11-27 23:09:54

by Ingo Molnar

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Mon, 26 Nov 2001, Mike Kravetz wrote:

> I'm happy to see the cleanup of scheduler code that went into
> 2.4.15/16. One small difference in behavior (I think) is that the
> currently running task is not given preference over other tasks on the
> runqueue with the same 'goodness' value. I would think giving the
> current task preference is a good thing (especially in light of recent
> discussions about too frequent moving/rescheduling of tasks). Can
> someone provide the rational for this change? Was it just the result
> of making the code cleaner? Is it believed that this won't really
> make a difference?

i've done this change as part of the sched_yield() fixes/cleanups, and the
main reason for it is that the current process is preferred *anyway*, due
to getting the +1 boost via current->mm == this_mm in goodness().

(and besides, the percentage/probability of cases where we'd fail reselect
a runnable process where the previous scheduler would reselect it is very
very low. It does not justify adding a branch to the scheduler hotpath
IMO. In 99.9% of the cases if a runnable process is executing schedule()
then there is a higher priority process around that will win the next
selection. Or if there is a wakeup race, then the process will win the
selection very likely because it won the previous selection.)

Ingo

2001-12-05 22:02:01

by Mike Kravetz

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Wed, Nov 28, 2001 at 02:07:08AM +0100, Ingo Molnar wrote:
>
> On Mon, 26 Nov 2001, Mike Kravetz wrote:
>
> > I'm happy to see the cleanup of scheduler code that went into
> > 2.4.15/16. One small difference in behavior (I think) is that the
> > currently running task is not given preference over other tasks on the
> > runqueue with the same 'goodness' value. I would think giving the
> > current task preference is a good thing (especially in light of recent
> > discussions about too frequent moving/rescheduling of tasks). Can
> > someone provide the rational for this change? Was it just the result
> > of making the code cleaner? Is it believed that this won't really
> > make a difference?
>
> i've done this change as part of the sched_yield() fixes/cleanups, and the
> main reason for it is that the current process is preferred *anyway*, due
> to getting the +1 boost via current->mm == this_mm in goodness().
>
> (and besides, the percentage/probability of cases where we'd fail reselect
> a runnable process where the previous scheduler would reselect it is very
> very low. It does not justify adding a branch to the scheduler hotpath
> IMO. In 99.9% of the cases if a runnable process is executing schedule()
> then there is a higher priority process around that will win the next
> selection. Or if there is a wakeup race, then the process will win the
> selection very likely because it won the previous selection.)
>
> Ingo

FYI - I have been having a heck of a time getting our MQ scheduler to
work well in 2.4.16/2.5.0. The problem was mainly with performance
when running (I'm afraid to say) VolanoMark. Turns out that the above
change makes a big difference in this benchmark when running with the
MQ scheduler. There is an almost 50% drop in performance on my 8 way.
I suspect that one would not see such a dramatic drop (if any) with
the current scheduler as its performance is mostly limited by lock
contention in this benchmark.

Now, I'm aware that very few people are actively using our MQ scheduler,
and even fewer care about VolanoMark performance (perhaps no one on this
list). However, this seemed like an interesting observation.

--
Mike

2001-12-05 22:14:08

by Robert Love

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Wed, 2001-12-05 at 16:58, Mike Kravetz wrote:

> FYI - I have been having a heck of a time getting our MQ scheduler to
> work well in 2.4.16/2.5.0. The problem was mainly with performance
> when running (I'm afraid to say) VolanoMark. Turns out that the above
> change makes a big difference in this benchmark when running with the
> MQ scheduler. There is an almost 50% drop in performance on my 8 way.
> I suspect that one would not see such a dramatic drop (if any) with
> the current scheduler as its performance is mostly limited by lock
> contention in this benchmark.

Ehh, odd. How does the dropped performance compare to MQ performance
before 2.4.16? In other words, are we solving problems in the newer
kernels and now MQ is becoming overhead?

> Now, I'm aware that very few people are actively using our MQ scheduler,
> and even fewer care about VolanoMark performance (perhaps no one on this
> list). However, this seemed like an interesting observation.

Perhaps, but many (myself) are interested in a multi-queue
scheduler...:)

Robert Love

2001-12-05 22:49:48

by Mike Kravetz

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Wed, Dec 05, 2001 at 05:13:14PM -0500, Robert Love wrote:
> Ehh, odd. How does the dropped performance compare to MQ performance
> before 2.4.16? In other words, are we solving problems in the newer
> kernels and now MQ is becoming overhead?

No I don't think that is the case. The throughput numbers for this
benchmark at one specific data point on an 8-way look like.

2.4.14 Default Scheduler 39463
2.4.14 MQ Scheduler 385687
2.4.16 Default Scheduler 51364
2.4.16 MQ Scheduler 240667

The MQ numbers are still quite a bit better even on 2.4.16. I can
easily get MQ 2.4.16 back up to MQ 2.4.14 levels by reintroducing the
code to give a slight preference to the currently running task.
However, our MQ scheduler is trying to closely match the behavior of
the default scheudler wherever possible (for better or worse). It
should also be noted that performance of the default scheduler
increased as a result of Ingo's changes.

It is entirely possible that there is some other bug/feature in our
MQ code that is causing this situation. As mentioned by others, the
scheduler code that was removed from 2.4.15 should have little if
any impact on performance. I need to do some more analysis here.

--
Mike

2001-12-05 23:34:08

by Davide Libenzi

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Wed, 5 Dec 2001, Mike Kravetz wrote:

> On Wed, Nov 28, 2001 at 02:07:08AM +0100, Ingo Molnar wrote:
> >
> > On Mon, 26 Nov 2001, Mike Kravetz wrote:
> >
> > > I'm happy to see the cleanup of scheduler code that went into
> > > 2.4.15/16. One small difference in behavior (I think) is that the
> > > currently running task is not given preference over other tasks on the
> > > runqueue with the same 'goodness' value. I would think giving the
> > > current task preference is a good thing (especially in light of recent
> > > discussions about too frequent moving/rescheduling of tasks). Can
> > > someone provide the rational for this change? Was it just the result
> > > of making the code cleaner? Is it believed that this won't really
> > > make a difference?
> >
> > i've done this change as part of the sched_yield() fixes/cleanups, and the
> > main reason for it is that the current process is preferred *anyway*, due
> > to getting the +1 boost via current->mm == this_mm in goodness().
> >
> > (and besides, the percentage/probability of cases where we'd fail reselect
> > a runnable process where the previous scheduler would reselect it is very
> > very low. It does not justify adding a branch to the scheduler hotpath
> > IMO. In 99.9% of the cases if a runnable process is executing schedule()
> > then there is a higher priority process around that will win the next
> > selection. Or if there is a wakeup race, then the process will win the
> > selection very likely because it won the previous selection.)
> >
> > Ingo
>
> FYI - I have been having a heck of a time getting our MQ scheduler to
> work well in 2.4.16/2.5.0. The problem was mainly with performance
> when running (I'm afraid to say) VolanoMark. Turns out that the above
> change makes a big difference in this benchmark when running with the
> MQ scheduler. There is an almost 50% drop in performance on my 8 way.
> I suspect that one would not see such a dramatic drop (if any) with
> the current scheduler as its performance is mostly limited by lock
> contention in this benchmark.
>
> Now, I'm aware that very few people are actively using our MQ scheduler,
> and even fewer care about VolanoMark performance (perhaps no one on this
> list). However, this seemed like an interesting observation.

Mike, before giving any scheduler benchmark a mean You've to verify if the
process distribution among CPUs is the same of the old scheduler.
Different process distributions can give very different numbers.
Anyway me too have verified an increased latency with sched_yield() test
and next days I'm going to try the real one with the cycle counter
sampler.
I've a suspect, but i've to see the disassembly of schedule() before
talking :)

- Davide

2001-12-05 23:47:08

by Mike Kravetz

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Wed, Dec 05, 2001 at 03:44:42PM -0800, Davide Libenzi wrote:
> Anyway me too have verified an increased latency with sched_yield() test
> and next days I'm going to try the real one with the cycle counter
> sampler.
> I've a suspect, but i've to see the disassembly of schedule() before
> talking :)

One thing to note is that possible acquisition of the runqueue
lock was reintroduced in sys_sched_yield(). From looking at
the code, it seems the purpose was to ?add fairness? in the case
of multiple yielders. Is that correct Ingo?

--
Mike

2001-12-06 00:15:48

by Davide Libenzi

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Wed, 5 Dec 2001, Mike Kravetz wrote:

> On Wed, Dec 05, 2001 at 03:44:42PM -0800, Davide Libenzi wrote:
> > Anyway me too have verified an increased latency with sched_yield() test
> > and next days I'm going to try the real one with the cycle counter
> > sampler.
> > I've a suspect, but i've to see the disassembly of schedule() before
> > talking :)
>
> One thing to note is that possible acquisition of the runqueue
> lock was reintroduced in sys_sched_yield(). From looking at
> the code, it seems the purpose was to ?add fairness? in the case
> of multiple yielders. Is that correct Ingo?

Yep, suppose you've three running tasks A, B and C ( in that order ) and
suppose A and B are yield()ing.
You get:

A B A B A B A B A B A ...

until the priority of A or B drops, then C has a chance to execute.
Since with the new counter decay code priority remains up until a sudden
drop, why don't we decrease counter by K ( 1? ) for each sched_yield() call ?
In this way we can easily avoid the above pattern in case of multiple
yield()ers.

PS: sched_yield() code is very important because it is used inside the
pthread spinlock() wait path for a bunch of times before falling into
usleep().

- Davide

2001-12-06 08:51:57

by Ingo Molnar

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Wed, 5 Dec 2001, Mike Kravetz wrote:

> One thing to note is that possible acquisition of the runqueue lock
> was reintroduced in sys_sched_yield(). From looking at the code, it
> seems the purpose was to ?add fairness? in the case of multiple
> yielders. Is that correct Ingo?

yes, it's to add fairness. You might remember that i did this
sched_yield() optimization originally to help Volanomark performance. But
it turned out that it's very unfair not to move yielded processes to the
end of the runqueue - it might even cause livelocks in user-space
spinlocks which use sched_yield(). (since the POLICY_YIELD bit prevents
the process from running only *once*, so it will handle a two-process lock
situation right, but if multiple processes are racing for the lock then
they might exclude the real owner of the spinlock for a *long* time.)

(plus the change also broke sched_yield() for RT-FIFO processes.)

(nevertheless i'm sure we can add any invariant optimizations or fixes to
sys_sched_yield(), but this one - no matter how nice it was to this
particular benchmark, was neither invariant, nor a fix.)

the pure fact that your workload is so sensitive to sched_yield() LIFO vs.
FIFO yielding of runnable threads shows that there is some serious locking
design problem. It shows that there are 1) too many threads running 2)
only a small subset of those runnable threads are doing the real work, and
the rest is only runnable for some unknown reason.

i think the user-space locking mechanizm the IBM JVM uses should *not* be
based on sched_yield(). [is it using LinuxThreads? LinuxThreads have some
serious design problems IMO.] One thing i remember from running Volanomark
was the huge amount of signal related, mostly bogus wakeups. This i
believe is mainly due to the manager thread design of LinuxThreads. That
aspect should be fixed i think, i believe it can be correctly done (we can
get rid of the manager thread) via the latest 2.4 kernel.

while i see the point that the multiqueue scheduler improves performance
visibly, i'm quite sure you could get an *order of magnitude* faster if
the basic threading model (and any possible underlying mechanizm, such as
LinuxThreads) was fixed. The current Linux scheduler just magnifies these
userspace design problems. LinuxThreads was done when there werent many
good mechanizms within the kernel to help threading. Things have changed
by today - but if something still cannot be done cleanly and efficiently
(such as userspace spinlocks or semaphores) then please let us know so we
can fix things, instead of trying to work around the symptoms. But this is
just a suggestion.

Ingo

2001-12-06 18:14:34

by Davide Libenzi

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Thu, 6 Dec 2001, Ingo Molnar wrote:

>
> On Wed, 5 Dec 2001, Mike Kravetz wrote:
>
> > One thing to note is that possible acquisition of the runqueue lock
> > was reintroduced in sys_sched_yield(). From looking at the code, it
> > seems the purpose was to ?add fairness? in the case of multiple
> > yielders. Is that correct Ingo?
>
> yes, it's to add fairness. You might remember that i did this
> sched_yield() optimization originally to help Volanomark performance. But
> it turned out that it's very unfair not to move yielded processes to the
> end of the runqueue - it might even cause livelocks in user-space
> spinlocks which use sched_yield(). (since the POLICY_YIELD bit prevents
> the process from running only *once*, so it will handle a two-process lock
> situation right, but if multiple processes are racing for the lock then
> they might exclude the real owner of the spinlock for a *long* time.)
>
> (plus the change also broke sched_yield() for RT-FIFO processes.)

What about decreasing counter by 1 for each sched_yield() call ?
Is this case we achieve ( expecially with the counter decay patch ) a
correct task recycling w/out lock acquiring inside the system call.

> while i see the point that the multiqueue scheduler improves performance
> visibly, i'm quite sure you could get an *order of magnitude* faster if
> the basic threading model (and any possible underlying mechanizm, such as
> LinuxThreads) was fixed. The current Linux scheduler just magnifies these
> userspace design problems. LinuxThreads was done when there werent many
> good mechanizms within the kernel to help threading. Things have changed
> by today - but if something still cannot be done cleanly and efficiently
> (such as userspace spinlocks or semaphores) then please let us know so we
> can fix things, instead of trying to work around the symptoms. But this is
> just a suggestion.

Ingo, you've to admit that having separate run queue with separate locks
is just more than a benchmark fix, it's matter of sane design.

- Davide

2001-12-06 20:06:10

by Ingo Molnar

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Thu, 6 Dec 2001, Davide Libenzi wrote:

> What about decreasing counter by 1 for each sched_yield() call ?

we did that in earlier kernels - it's not really the right thing to do, as
calling yield() does not mean we are willing to give up a *timeslice*. It
only means that right now we are not able to proceed.

Ingo

2001-12-06 20:27:43

by Davide Libenzi

[permalink] [raw]

Subject: Re: Scheduler Cleanup

On Thu, 6 Dec 2001, Ingo Molnar wrote:

>
> On Thu, 6 Dec 2001, Davide Libenzi wrote:
>
> > What about decreasing counter by 1 for each sched_yield() call ?
>
> we did that in earlier kernels - it's not really the right thing to do, as
> calling yield() does not mean we are willing to give up a *timeslice*. It
> only means that right now we are not able to proceed.

The old code was giving up the whole timeslice, that is a bit excessive.
Having the counter decay patch + descreasing by 1 the counter you've the
proper execution of non yielding tasks, while if you've only yielding
tasks, who cares if they repidly give up the whole timeslice sequentially.
The other, but more expensive solution is to have a side-counter
accumulation where time slice subtracted to counter accumulates and are
remerged at the proper time ( recalc loop ).

- Davide

2001-12-08 16:36:53

by Martin Devera

[permalink] [raw]

Subject: The best VM algorithm for Linux

Hi all,

anyone interested how far is current VM subsystem from
the best/fastest one ?

I've tried to find way how to compute it. I think it is
possible but I have not time to do it. You might be
interested in this, so read

http://luxik.cdi.cz/~devik/bestvm.htm

article. Comments are welcome.

regards, devik