2001-04-03 02:20:42

by Fabio Riccardi

[permalink] [raw]
Subject: a quest for a better scheduler

Hello,

I sent a message a few days ago about some limitations I found in the
linux scheduler.

In servers like Apache where a large (> 1000) number of processes can be
running at the same time and where many of them are runnable at the same
time, the default Linux scheduler just starts trashing and the machine
becomes very rapidly unusable.

Performance degradations are quite noticeable on a two-way SMP machine
(20-30% of the CPU gets lost) and are even more glaring on a multi-cpu
machine. As an example, an 8-way Compaq Proliant just crawls with linux.

>From the feedback I received I realized that there are at least two
possible solutions to the problem:

http://lse.sourceforge.net/scheduling/

http://resourcemanagement.unixsolutions.hp.com/WaRM/schedpolicy.html

Indeed I've tried the patches available on the sites for the multi-queue
scheduler and I was amazed by the performance improvement that I got.
Both patches allow me to get to a 100% real CPU utilization on a two way
machine running ~1500 processes.

What those patches do is quite simple, instead of having the single
global process queue present in the normal Linux scheduler, they add
multiple queues (one per CPU). In this way the scheduling decision can
be greatly simplified and almost made local to each CPU. No hotspots, no
global locks (well, almost).

Although some scalability problems are still there (there still is a
global decision to make), the performance improvement obtained and the
simplicity of the solution are remarkable.

The HP patch is probably the most interesting, since it consists of
really a few lines of code and it gets (for what I could measure) the
same kind of performance improvement of the more elaborate (but still
quite simple) sourceforge patch.

Is there any special reason why any of those patches didn't make it to
the mainstream kernel code?

TIA, ciao,

- Fabio



2001-04-03 09:57:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: a quest for a better scheduler


On Mon, 2 Apr 2001, Fabio Riccardi wrote:

> I sent a message a few days ago about some limitations I found in the
> linux scheduler.
>
> In servers like Apache where a large (> 1000) number of processes can
> be running at the same time and where many of them are runnable at the
> same time, the default Linux scheduler just starts trashing and the
> machine becomes very rapidly unusable.

it is *not* a scheduler problem. This is an application design problem.
Do not expect > 1000 runnable processes to perform well. Apache's
one-process-per-parallel-client-connection application model is especially
bad for such kind of loads. The scheduler has more important priorities to
worry about than to fix application design problems.

> Indeed I've tried the patches available on the sites for the
> multi-queue scheduler and I was amazed by the performance improvement
> that I got. Both patches allow me to get to a 100% real CPU
> utilization on a two way machine running ~1500 processes.

There were many promises along the years, and none as of now met all the
requirements the scheduler has to fulfill. 'many runnable processes' is
not on the top of the list - if we are scheduling like mad then we have
lost lots of performance already.

even with such 'improvements', the many-parallel-clients performance of
one-process-per-request HTTP server designs is an order slower than what
is possible.

having said that, improving this workload is still a useful task and we
added improvements to the scheduler to improve this case. But those
patches sacrifice other functionality just to get better
many-runnable-processes performance, which is unacceptable.

> What those patches do is quite simple, instead of having the single
> global process queue present in the normal Linux scheduler, they add
> multiple queues (one per CPU). In this way the scheduling decision can
> be greatly simplified and almost made local to each CPU. No hotspots,
> no global locks (well, almost).

the problem is that the *real* scheduling complexity (and needed
functionality) shows when the number of runnable processes is in the
neighborhood of the number of processors. Running many processes just
masks eg. load-balancing deficiencies in schedulers, so obviously the
simplest and most localized scheduler 'wins'. So comparing schedulers
based on many-runnable-processes load is like comparing cars solely based
on the size of their fuel tanks.

> Although some scalability problems are still there (there still is a
> global decision to make), the performance improvement obtained and the
> simplicity of the solution are remarkable.

you can easily make the scheduler fast in the many-processes case by
sacrificing functionality in the much more realistic, few-processes case.
None of the patch i've seen so far maintained the current scheduler's
few-processes logic. But i invite you to improve the current scheduler's
many-processes behavior, without hurting its behavior in the few-processes
case.

Ingo

2001-04-03 12:32:10

by Alan

[permalink] [raw]
Subject: Re: a quest for a better scheduler

> Is there any special reason why any of those patches didn't make it to
> the mainstream kernel code?

All of them are worse for the normal case. Also 1500 running apache's isnt
a remotely useful situation, you are thrashing the cache even if you are now
not thrashing the scheduler. Use an httpd designed for that situation. Then
you can also downgrade to a cheap pentium class box for the task ;)

2001-04-03 19:15:56

by Mike Kravetz

[permalink] [raw]
Subject: Re: a quest for a better scheduler

On Tue, Apr 03, 2001 at 10:55:12AM +0200, Ingo Molnar wrote:
>
> you can easily make the scheduler fast in the many-processes case by
> sacrificing functionality in the much more realistic, few-processes case.
> None of the patch i've seen so far maintained the current scheduler's
> few-processes logic. But i invite you to improve the current scheduler's
> many-processes behavior, without hurting its behavior in the few-processes
> case.
>

Maintaining the current scheduler's logic is exactly what we are trying
to do in the projects at:

http://lse.sourceforge.net/scheduling/

A common design goal for the the two alternative scheduler
implementations at this site is to maintain the current scheduler's
behavior/scheduling decisions. In the case of the priority queue
scheduler, we have actually used a copy of the existing scheduler
running in parallel (in the same kernel) to determine if we are
making the same scheduling decisions. Currently, in this implementation
we only deviate from the current scheduler in a small number of cases
where tasks get a boost due to having the same memory map.

The multi-queue implementation is more interesting. It is also
designed to maintain the behavior of the current scheduler. However,
as the runqueues get longer (and we start getting contention on the
runqueue locks) it starts to deviate from existing scheduler behavior
and make more local scheduling decisions. Ideally, this implementation
will exhibit the behavior of the current scheduler at low thread
counts and make more localized decisions as pressure on the scheduler
is increased.

Neither of these implementations are at a point where I would advocate
their adoption; yet.

Can someone tell me what a good workload/benchmark would be to
examine 'low thread count' performance? In the past people have
used the 'spinning on sched_yield' benchmark. However, this now
makes little sense with the sched_yield optimizations introduced
in 2.4. In addition, such a benchmark mostly ignores the
'reschedule_idle' component of the scheduler. We have developed
a 'token passing' benchmark which attempts to address these issues
(called reflex at the above site). However, I would really like
to get a pointer to a community acceptable workload/benchmark for
these low thread cases.

--
Mike Kravetz [email protected]
IBM Linux Technology Center

2001-04-03 19:50:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: a quest for a better scheduler


On Tue, 3 Apr 2001, Mike Kravetz wrote:

> [...] Currently, in this implementation we only deviate from the
> current scheduler in a small number of cases where tasks get a boost
> due to having the same memory map.

thread-thread-affinity pretty much makes it impossible to use a priority
queue. This 'small number of cases' is actually a fundamental issue: can
the 'goodness()' function be an arbitrary function of:

goodness(process_prev,process_next) := f(process_prev,process_next)

, or is the queue design restricting the choice of goodness() functions?
Priority queues for example restrict the choice of the goodness() function
to subset of functions:

goodness(process_prev,process_next) := f(process_next);

this restriction (independence of the priority from the previous process)
is a fundamentally bad property, scheduling is nonlinear and affinity
decisions depend on the previous context. [i had a priority-queue SMP
scheduler running 2-3 years ago but dropped the idea due to this issue.]
IMO it's more important to have a generic and flexible scheduler, and
arbitrary, nonnatural restrictions tend to bite us later on.

one issue regarding multiqueues is the ability of interactive processes to
preempt other, lower priority processes on other CPUs. These sort of
things tend to happen while using X. In a system where process priorities
are different, scheduling decisions cannot be localized per-CPU
(unfortunately), and 'the right to run' as such is a global thing.

> Can someone tell me what a good workload/benchmark would be to examine
> 'low thread count' performance? [...]

lmbench's lat_ctx for example, and other tools in lmbench trigger various
scheduler workloads as well.

Ingo

2001-04-03 22:44:29

by Mike Kravetz

[permalink] [raw]
Subject: Re: a quest for a better scheduler

On Tue, Apr 03, 2001 at 08:47:52PM +0200, Ingo Molnar wrote:
>
> this restriction (independence of the priority from the previous process)
> is a fundamentally bad property, scheduling is nonlinear and affinity
> decisions depend on the previous context. [i had a priority-queue SMP
> scheduler running 2-3 years ago but dropped the idea due to this issue.]
> IMO it's more important to have a generic and flexible scheduler, and
> arbitrary, nonnatural restrictions tend to bite us later on.


It seems like we may be talking about two different things.

Our 'priority queue' implementation uses almost the same
goodness function as the current scheduler. The main
difference between our 'priority queue' scheduler and the
current scheduler is the structure of the runqueue. We
break up the single runqueue into a set of priority based
runqueues. The 'goodness' of a task determines what sub-queue
a task is placed in. Tasks with higher goodness values
are placed in higher priority queues than tasks with lower
goodness values. This allows us to limit the scan of runnable
tasks to the highest priority sub-queue, as opposed to the
entire runquue. When scanning the highest priority sub-queue
we use almost the same goodness function as that which is
used today (it does take CPU affinity into account). In fact,
if we used the same goodness function I'm pretty sure we
would exactly match the behavior of the existing scheduler.

Hopefully, Hubertus Franke will speak up and provide more
details, as he is much more familiar with this implementation
than I am.

In any case, I think we have almost reached the conclusion
that our priority queue implementation may not be acceptable
due to the extra overhead in low thread counts.

> one issue regarding multiqueues is the ability of interactive processes to
> preempt other, lower priority processes on other CPUs. These sort of
> things tend to happen while using X. In a system where process priorities
> are different, scheduling decisions cannot be localized per-CPU
> (unfortunately), and 'the right to run' as such is a global thing.


Agreed. This is why our multi-queue scheduler always attempts
to make global decisions. It will preempt lower priority tasks
on other CPUs. This implementation tends to make 'more
localized' scheduling decisions as contention on the runqueue
locks increase. However, at this point one could argue that
we have moved away from a 'realistic' low task count system load.

> lmbench's lat_ctx for example, and other tools in lmbench trigger various
> scheduler workloads as well.

Thanks, I'll add these to our list.

--
Mike Kravetz [email protected]
IBM Linux Technology Center

2001-04-04 00:15:28

by Fabio Riccardi

[permalink] [raw]
Subject: Re: a quest for a better scheduler

Kernel compilation times
========================

three runs of "time make -j2"

--

Linux 2.4.2-ac26 + HP Multiqueue patch

216.34user 14.36system 2:00.03elapsed 192%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (482269major+691020minor)pagefaults 0swaps

216.53user 14.23system 1:57.91elapsed 195%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (482269major+691020minor)pagefaults 0swaps

217.65user 13.46system 1:58.05elapsed 195%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (482269major+691020minor)pagefaults 0swaps

--

Linux 2.4.2-ac26

220.07user 14.88system 2:02.67elapsed 191%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (482269major+691019minor)pagefaults 0swaps

220.31user 14.90system 2:00.64elapsed 194%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (482269major+691018minor)pagefaults 0swaps

220.58user 14.84system 2:00.57elapsed 195%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (482269major+691018minor)pagefaults 0swaps


LMBENCH-2BETA1
==============

First two: Linux 2.4.2-ac26 + HP Multiqueue patch

Second Two: Linux 2.4.2-ac26

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host OS Mhz null null open selct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
skinny Linux 2.4.2-a 997 0.34 0.56 3.96 5.04 27 0.89 2.72 245 1128 4044
skinny Linux 2.4.2-a 997 0.34 0.57 4.19 5.32 25 0.89 2.71 247 1150 4067
skinny Linux 2.4.2-a 997 0.34 0.58 3.90 5.00 25 0.89 2.69 249 1121 3968
skinny Linux 2.4.2-a 997 0.34 0.57 3.90 5.01 25 0.87 2.70 246 1126 4018

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
skinny Linux 2.4.2-a 1.820 4.7700 12 8.0800 109 27 110
skinny Linux 2.4.2-a 1.890 4.7300 20 6.6500 109 27 110
skinny Linux 2.4.2-a 1.620 4.5900 12 6.7000 109 24 109
skinny Linux 2.4.2-a 1.700 4.6400 12 7.0600 109 26 109

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
skinny Linux 2.4.2-a 1.820 7.390 15 16 52 23 91 55
skinny Linux 2.4.2-a 1.890 7.185 14 16 41 23 56 54
skinny Linux 2.4.2-a 1.620 6.793 15 16 40 21 56 54
skinny Linux 2.4.2-a 1.700 6.801 15 16 40 21 55 54

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page
Create Delete Create Delete Latency Fault Fault
--------- ------------- ------ ------ ------ ------ ------- ----- -----
skinny Linux 2.4.2-a 6.2054 0.7192 12 1.6973 451 0.672 2.00000
skinny Linux 2.4.2-a 6.2469 0.7360 12 1.7142 465 0.668 2.00000
skinny Linux 2.4.2-a 3.1992 0.7182 12 1.5857 449 0.680 2.00000
skinny Linux 2.4.2-a 6.1576 0.7119 12 1.5817 448 0.669 2.00000

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
skinny Linux 2.4.2-a 823 233 162 434 558 271 218 558 282
skinny Linux 2.4.2-a 828 289 243 408 558 269 216 558 281
skinny Linux 2.4.2-a 839 285 249 445 558 222 147 558 219
skinny Linux 2.4.2-a 811 284 242 445 558 222 147 558 219

Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
---------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Guesses
--------- ------------- ---- ----- ------ -------- -------
skinny Linux 2.4.2-a 997 3.009 7.0230 101
skinny Linux 2.4.2-a 997 3.010 7.0220 101
skinny Linux 2.4.2-a 997 3.010 7.0280 101
skinny Linux 2.4.2-a 997 3.009 7.0290 101


Attachments:
mqpatch242ac26 (6.94 kB)
compile_times (4.76 kB)
Download all attachments

2001-04-04 00:34:23

by Alan

[permalink] [raw]
Subject: Re: a quest for a better scheduler

> for the "normal case" performance see my other message.

I did - and with a lot of interest

> I agree that a better threading model would surely help in a web server, but to
> me this is not an excuse to live up with a broken scheduler.

The problem has always been - alternative scheduler, crappier performance for
2 tasks running (which is most boxes). If your numbers are right then the
HP patch is working as well for 1 or 2 tasks too

> Unless we want to maintain the position tha the only way to achieve good
> performance is to embed server applications in the kernel, some minimal help
> should be provided to goodwilling user applications :)

Indeed. I'd love to see you beat tux entirely in userspace. It proves the
rest of the API for the kernel is right


2001-04-04 00:31:22

by Fabio Riccardi

[permalink] [raw]
Subject: Re: a quest for a better scheduler

Alan,

for the "normal case" performance see my other message.

I agree that a better threading model would surely help in a web server, but to
me this is not an excuse to live up with a broken scheduler.

The X15 server I'm working on now is a sort of user-space TUX, it uses only 8
threads per CPU and it achieves the same level of performance of the kernel
space TUX. Even in this case the performance advantage of the multiqueue
scheduler is remarkable, especially on a multi-CPU (> 2) architecture.

To achieve some decent disk/CPU/network overlapping with the current linux
blocking disk IO limitations there is no way to avoid a "bunch of server
threads". I've (experimentally) figured out that 8-16 threads per CPU can
assure some reasonable overlapping, depending on the memory size and disk
subsystem speed. On a 8-way machine this means 64-128 active tasks, a total
disaster with the current scheduler.

Unless we want to maintain the position tha the only way to achieve good
performance is to embed server applications in the kernel, some minimal help
should be provided to goodwilling user applications :)

TIA, ciao,

- Fabio

Alan Cox wrote:

> > Is there any special reason why any of those patches didn't make it to
> > the mainstream kernel code?
>
> All of them are worse for the normal case. Also 1500 running apache's isnt
> a remotely useful situation, you are thrashing the cache even if you are now
> not thrashing the scheduler. Use an httpd designed for that situation. Then
> you can also downgrade to a cheap pentium class box for the task ;)

2001-04-04 01:14:49

by Fabio Riccardi

[permalink] [raw]
Subject: Re: a quest for a better scheduler

Alan Cox wrote:

> > for the "normal case" performance see my other message.
>
> I did - and with a lot of interest

thanks! :)

> > I agree that a better threading model would surely help in a web server, but to
> > me this is not an excuse to live up with a broken scheduler.
>
> The problem has always been - alternative scheduler, crappier performance for
> 2 tasks running (which is most boxes). If your numbers are right then the
> HP patch is working as well for 1 or 2 tasks too

Please verify them if you have a couple of spare hours.

BTW: I measured similar results for the "scalability" patches on a 2.4.1 kernel, it
would be worth the effort to seriously compare them from an architectural point of
view, but I don't have the time right now...

> > Unless we want to maintain the position tha the only way to achieve good
> > performance is to embed server applications in the kernel, some minimal help
> > should be provided to goodwilling user applications :)
>
> Indeed. I'd love to see you beat tux entirely in userspace. It proves the
> rest of the API for the kernel is right

Indeed, I'm using RT sigio/sigwait event scheduling, bare clone threads and
zero-copy io.

If only I had a really asynchronous sendfile, or a smarter madvise that wouldn't
require to map files :)

My server cannot execute dynamic stuff yet, it relies on Apache for that.

Running X15 and TUX in the same conditions (i.e. dynamic code in Apache) I get
exactly the same score in both cases.

I'm adding a TUX-like dynamic interface, I hope to get it to work by next week, then
I'll make a real confrontation.

Regards, ciao,

- Fabio


2001-04-04 01:53:46

by Christopher Smith

[permalink] [raw]
Subject: Re: a quest for a better scheduler

--On Tuesday, April 03, 2001 18:17:30 -0700 Fabio Riccardi
<[email protected]> wrote:
> Alan Cox wrote:
> Indeed, I'm using RT sigio/sigwait event scheduling, bare clone threads
> and zero-copy io.

Fabio, I'm working on a similar solution, although I'm experimenting with
SGI's KAIO patch to see what it can do. I've had to patch the kernel to
implement POSIX style signal dispatch symantics (so that the thread which
posted an I/O request doesn't have to be the one which catches the signal).
Are you taking a similar approach, or is the lack of this behavior the
reason you are using so many threads?

--Chris

2001-04-04 03:52:09

by Mike Kravetz

[permalink] [raw]
Subject: Re: a quest for a better scheduler

On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote:
>
> I have measured the HP and not the "scalability" patch because the two do more
> or less the same thing and give me the same performance advantages, but the
> former is a lot simpler and I could port it with no effort on any recent
> kernel.

Actually, there is a significant difference between the HP patch and
the one I developed. In the HP patch, if there is a schedulable task
on the 'local' (current CPU) runqueue it will ignore runnable tasks on
other (remote) runqueues. In the multi-queue patch I developed, the
scheduler always attempts to make the same global scheduling decisions
as the current scheduler.

--
Mike Kravetz [email protected]
IBM Linux Technology Center

2001-04-04 04:19:22

by Fabio Riccardi

[permalink] [raw]
Subject: Re: a quest for a better scheduler

I was actually suspecting that the extra lines in your patch were there for a
reason :)

A few questions:

What is the real impact of a (slight) change in scheduling semantics?

Under which situation one should notice a difference?

As you state in your papers the global decision comes with a cost, is it worth it?

Could you make a port of your thing on recent kernels?
I tried and I failed and I don't have enough time to figure out why, that should be
trivial for you though.

TIA, ciao,

- Fabio

Mike Kravetz wrote:

> On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote:
> >
> > I have measured the HP and not the "scalability" patch because the two do more
> > or less the same thing and give me the same performance advantages, but the
> > former is a lot simpler and I could port it with no effort on any recent
> > kernel.
>
> Actually, there is a significant difference between the HP patch and
> the one I developed. In the HP patch, if there is a schedulable task
> on the 'local' (current CPU) runqueue it will ignore runnable tasks on
> other (remote) runqueues. In the multi-queue patch I developed, the
> scheduler always attempts to make the same global scheduling decisions
> as the current scheduler.
>
> --
> Mike Kravetz [email protected]
> IBM Linux Technology Center

2001-04-04 07:30:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: a quest for a better scheduler


On Tue, 3 Apr 2001, Mike Kravetz wrote:

> Our 'priority queue' implementation uses almost the same goodness
> function as the current scheduler. The main difference between our
> 'priority queue' scheduler and the current scheduler is the structure
> of the runqueue. We break up the single runqueue into a set of
> priority based runqueues. The 'goodness' of a task determines what
> sub-queue a task is placed in. Tasks with higher goodness values are
> placed in higher priority queues than tasks with lower goodness
> values. [...]

we are talking about the same thing, re-read my mail. this design assumes
that 'goodness' is constant in the sense that it does not depend on the
previous process.

and no, your are not using the same goodness() function, you omitted the
prev->mm == next->mm component to goodness(), due to this design
limitation.

Ingo


2001-04-04 07:55:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: a quest for a better scheduler


On Tue, 3 Apr 2001, Fabio Riccardi wrote:

> I've spent my afternoon running some benchmarks to see if MQ patches
> would degrade performance in the "normal case".

no doubt priority-queue can run almost as fast as the current scheduler.
What i'm worried about is the restriction of the 'priority' of processes,
it cannot depend on previous processes (and other 'current state')
anymore.

to so we have two separate issues:

#1: priority-queue: has the fundamental goodness() design limitation.

#2: per-CPU-runqueues: changes semantics, makes scheduler less
effective due to nonglobal decisions.

about #1: while right now the prev->mm rule appears to be a tiny issue (it
might not affect performance significantly), but forbidding it by
hardcoding the assumption into data structures is a limitation of *future*
goodness() functions. Eg. with the possible emergence of CPU-level
threading and other, new multiprocessing technologies, this could be a
*big* mistake.

the Linux scheduler is not designed for the case of 1000 runnable
processes.

Ingo

2001-04-04 12:53:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: a quest for a better scheduler


On Tue, 3 Apr 2001, Fabio Riccardi wrote:

> I agree that a better threading model would surely help in a web
> server, but to me this is not an excuse to live up with a broken
> scheduler.

believe me, there are many other parts of the kernel that are not
optimized for the nutcase. In this case it's the scheduler that punishes
you first. Do not shoot the messenger.

> The X15 server I'm working on now is a sort of user-space TUX, it uses
> only 8 threads per CPU and it achieves the same level of performance
> of the kernel space TUX. Even in this case the performance advantage
> of the multiqueue scheduler is remarkable, especially on a multi-CPU
> (> 2) architecture.

then your design is still bad. TUX has no problems with the scheduler, and
TUX doesnt have many runnable processes at once. And this has nothing to
do with TUX being within the kernel.

> To achieve some decent disk/CPU/network overlapping with the current
> linux blocking disk IO limitations there is no way to avoid a "bunch
> of server threads". [...]

false.

> [...] I've (experimentally) figured out that 8-16 threads per CPU can
> assure some reasonable overlapping, depending on the memory size and
> disk subsystem speed. On a 8-way machine this means 64-128 active
> tasks, [...]

if they are doing IO *only* then there wont be 64-128 active tasks. This
is how TUX does async IO: helper threads do the IO and *only* the IO. This
way cache locality is maximized. (the scheduler is irrelevant in this
case.) Again you are blaming the scheduler - the poor scheduler is simply
the first kernel subsystem that tells you: your design still sucks. (if
you have 64-128 active tasks then you already pay a very big cache
footprint price as well.) [I'm wondering how your solution was just as
fast as TUX, although you claim that the scheduler was already killing
things. Perhaps your test was done over localhost or was saturating
network bandwidth?]

> Unless we want to maintain the position tha the only way to achieve
> good performance is to embed server applications in the kernel, [...]

FYI, TUX related functionality is constantly being pushed out and made
accessible to userspace as well. Witness zerocopy, IRQ-affinity, and tons
of generic patches such as pagecache-scalability and timer-scalability.
(and soon the remaining, TUX-private improvements too.) TUX is intended to
be a testbed for kernel subsystems, where performance optimizations and
APIs can be tested without having to export them to userspace. (exporting
to userspace cannot be done lightly and makes things less flexible.)

Ingo

2001-04-04 12:59:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: a quest for a better scheduler


On Wed, 4 Apr 2001, Alan Cox wrote:

> The problem has always been - alternative scheduler, crappier
> performance for 2 tasks running (which is most boxes). [...]

it's not only the 2-task case, but also less flexibility or lost
semantics.

> Indeed. I'd love to see you beat tux entirely in userspace. It proves
> the rest of the API for the kernel is right

well, until the cost of entry into the kernel is eliminated, this is not
possible - unless there are performance bugs in TUX :-)

but yes, getting a userspace solution that gets 'close enough' in eg.
SPECweb99 benchmarks (which is complex enough to be trusted as a generic
performance metric) would be a nice thing to have. There are existing
SIGIO based, multithreaded solutions (eg. phttpd), with varying success.

Ingo

2001-04-04 16:11:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: a quest for a better scheduler


On 04-Apr-2001 Ingo Molnar wrote:
>
> On Tue, 3 Apr 2001, Fabio Riccardi wrote:
>
>> I've spent my afternoon running some benchmarks to see if MQ patches
>> would degrade performance in the "normal case".
>
> no doubt priority-queue can run almost as fast as the current scheduler.
> What i'm worried about is the restriction of the 'priority' of processes,
> it cannot depend on previous processes (and other 'current state')
> anymore.
>
> to so we have two separate issues:
>
>#1: priority-queue: has the fundamental goodness() design limitation.
>
>#2: per-CPU-runqueues: changes semantics, makes scheduler less
> effective due to nonglobal decisions.
>
> about #1: while right now the prev->mm rule appears to be a tiny issue (it
> might not affect performance significantly), but forbidding it by
> hardcoding the assumption into data structures is a limitation of *future*
> goodness() functions. Eg. with the possible emergence of CPU-level
> threading and other, new multiprocessing technologies, this could be a
> *big* mistake.

This is not correct Ingo. I haven't seen the HP code but if You store processes
in slots S :

S = FS( goodness(p, p->processor, p->mm) )

and You start scanning from the higher slots, as soon as you find a task with a
goodness G' that is equal to the max goodness in slot You can choose that
process to run.
Again, if You haven't found such a goodness during the slot scan but You've
found a task with a goodness G' :

G' >= SG - DD

where :

SG = max slot goodness
DD = SG(i) - SG(i - 1)

You can select that task as the next to spin.
This was the behaviour that was implemented in my scheduler patch about 2 years
ago.
Beside this, I this that with such loads We've more serious problem to face
with inside the kernel.



- Davide

2001-04-04 17:29:56

by Mike Kravetz

[permalink] [raw]
Subject: Re: a quest for a better scheduler

On Tue, Apr 03, 2001 at 09:21:57PM -0700, Fabio Riccardi wrote:
> I was actually suspecting that the extra lines in your patch were there for a
> reason :)
>
> A few questions:
>
> What is the real impact of a (slight) change in scheduling semantics?
>
> Under which situation one should notice a difference?

I assume your questions are directed at the difference between local
and global scheduling decisions. As with most things the impact of
these differences depends on workload. Most multi-queue scheduler
implementations make local scheduling decisions and depend on load
balancing for system wide fairness. Schedulers which make global
decisions handle load balancing via their global policy.

The HP multi-queue implementation you are using performs no load
balancing. Therefore, in a worst case situation you could have
low priority tasks sharing one CPU while high priority tasks are
sharing another. To be fair, I have talked to the HP people and
this scheduler was never intended to be a general purpose solution.
Therefore, it makes little sense to comment its merits as such.

> As you state in your papers the global decision comes with a cost,
> is it worth it?

My primary motivation for attempting to perform the same global
decisions as the current scheduler was so that meaningful comparisons
could be made. Also, by using the same global decision policy I
was able to avoid the issue of load balancing. In most multi-queue
implementations, load balancing algorithms take considerable effort
to get working in a reasonable well performing manner.

>
> Could you make a port of your thing on recent kernels?

There is a 2.4.2 patch on the web page. I'll put out a 2.4.3 patch
as soon as I get some time.

--
Mike Kravetz [email protected]
IBM Linux Technology Center