LinuxLists.cc - Re: a quest for a better scheduler

2001-04-04 14:03:36

Subject: Re: a quest for a better scheduler

I grant you that the code is not as clean as the current scheduler,
so maybe you missed that part.

For the priority scheduler:
Yes the task_to_qid assumes a NON-affinity (no cpu, no mm) to determine
the list index to where the task has to be enqueued.
However, if you wonder down to the search_range section of the scheduler,
you see that we do add the "+1" for equal mm. All I did here was take
the goodness() function a part for search_range in order to insert some
extra code that speeds up the code.

Again, I don't want to lean to hard on the priority scheduler, because
we only did this to have a reference point regarding lock contention and
lock hold. This is stuff that many operating systems did years ago, most
of which have moved on to add MQ to that. So that is what the LSE
scheduling team is pushing.

I understand the dilemma that the Linux scheduler is in, namely satisfy
the low end at all cost. But I believe that in order for Linux to push
into the high end we need to address the scalability of the current
scheduler.
Simply handwaving and declaring that lots of running tasks is a stupid
idea doesn't get us there.
If indeed you assume that there #running-tasks ~ #cpus then each task
should alot a reasonable amount of work and any small overhead incurred
should be amortizable. However as we contend if #running-tasks >> #cpus
is a situation we need to deal with, the living with the current lock
contention can really drag us down.

Hubertus Franke
Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability)

email: [email protected]
(w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003

Ingo Molnar <[email protected]>@elte.hu> on 04/04/2001 02:28:31 AM

Please respond to <[email protected]>

Sent by: <[email protected]>

To: Mike Kravetz <[email protected]>
cc: Hubertus Franke/Watson/IBM@IBMUS, Fabio Riccardi
<[email protected]>, Linux Kernel List
<[email protected]>
Subject: Re: a quest for a better scheduler

On Tue, 3 Apr 2001, Mike Kravetz wrote:

> Our 'priority queue' implementation uses almost the same goodness
> function as the current scheduler. The main difference between our
> 'priority queue' scheduler and the current scheduler is the structure
> of the runqueue. We break up the single runqueue into a set of
> priority based runqueues. The 'goodness' of a task determines what
> sub-queue a task is placed in. Tasks with higher goodness values are
> placed in higher priority queues than tasks with lower goodness
> values. [...]

we are talking about the same thing, re-read my mail. this design assumes
that 'goodness' is constant in the sense that it does not depend on the
previous process.

and no, your are not using the same goodness() function, you omitted the
prev->mm == next->mm component to goodness(), due to this design
limitation.

Ingo

2001-04-04 14:26:48

by Ingo Molnar

[permalink] [raw]

Subject: Re: a quest for a better scheduler

On Wed, 4 Apr 2001, Hubertus Franke wrote:

> I understand the dilemma that the Linux scheduler is in, namely
> satisfy the low end at all cost. [...]

nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
You are playing word games by suggesting that the current behavior prefers
'low end'. 'thousands of runnable processes' is not 'high end' at all,
it's 'broken end'. Thousands of runnable processes are the sign of a
broken application design, and 'fixing' the scheduler to perform better in
that case is just fixing the symptom. [changing the scheduler to perform
better in such situations is possible too, but all solutions proposed so
far had strings attached.]

Ingo

2001-04-04 15:14:29

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: a quest for a better scheduler

On Wed, Apr 04, 2001 at 10:03:10AM -0400, Hubertus Franke wrote:
> I understand the dilemma that the Linux scheduler is in, namely satisfy
> the low end at all cost. [..]

We can satisfy the low end by making the numa scheduler at compile time (that's
what I did in my patch at least).

Andrea

2001-04-04 15:50:51

by Khalid Aziz

[permalink] [raw]

Subject: Re: a quest for a better scheduler

Andrea Arcangeli wrote:
>
> On Wed, Apr 04, 2001 at 10:03:10AM -0400, Hubertus Franke wrote:
> > I understand the dilemma that the Linux scheduler is in, namely satisfy
> > the low end at all cost. [..]
>
> We can satisfy the low end by making the numa scheduler at compile time (that's
> what I did in my patch at least).
>
> Andrea

I fully agree with this approach. It would be very hard to design a
scheduler that performs equally well on a UP machine running couple of
processes and a NUMA machine. These two cases represent the two ends of
spectrum. The two schedulers should be separate IMO and one of them
should be selected at compile time.

--
Khalid

====================================================================
Khalid Aziz Linux Development Laboratory
(970)898-9214 Hewlett-Packard
[email protected] Fort Collins, CO

2001-04-04 22:18:17

by Tim Wright

[permalink] [raw]

Subject: Re: a quest for a better scheduler

On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote:
>
> On Wed, 4 Apr 2001, Hubertus Franke wrote:
>
> > I understand the dilemma that the Linux scheduler is in, namely
> > satisfy the low end at all cost. [...]
>
> nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
> You are playing word games by suggesting that the current behavior prefers
> 'low end'. 'thousands of runnable processes' is not 'high end' at all,
> it's 'broken end'. Thousands of runnable processes are the sign of a
> broken application design, and 'fixing' the scheduler to perform better in
> that case is just fixing the symptom. [changing the scheduler to perform
> better in such situations is possible too, but all solutions proposed so
> far had strings attached.]
>

Ingo, you continue to assert this without giving much evidence to back it up.
All the world is not a web server. If I'm running a large OLTP database with
thousands of clients, it's not at all unreasonable to expect periods where
several hundred (forget the thousands) want to be serviced by the database
engine. That sounds like hundreds of schedulable entities be they processes
or threads or whatever. This sort of load is regularly run on machine with
16-64 CPUs.

Now I will admit that it is conceivable that you can design an application that
finds out how many CPUs are available, creates threads to match that number
and tries to divvy up the work between them using some combination of polling
and asynchronous I/O etc. There are, however a number of problems with this
approach:
1) It assumes that this is the only workload on the machine. If not it quickly
becomes sub-optimal due to interactions between the workloads. This is a
problem that the kernel scheduler does not suffer from.
2) It requires *every* application designer to design an effective scheduler
into their application. I would submit that an effective scheduler is better
situated in the operating system.
3) It is not a familiar programming paradigm to many Unix/Linux/POSIX
programmers, so you have a sort of impedance mismatch going on.

Since the proposed scheduler changes being talked about here have been shown
to not hurt the "low end" and to dramatically improve the "high end", I fail
to understand the hostility to the changes. I can understand that you do not
feel that this is the correct way to architect an application, but if the
changes don't hurt you, why sabotage changes that also allow a different
method to work. There isn't one true way to do anything in computing.

Tim

--
Tim Wright - [email protected] or [email protected] or [email protected]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI

2001-04-04 22:58:25

by Christopher Smith

[permalink] [raw]

Subject: Re: a quest for a better scheduler

--On Wednesday, April 04, 2001 15:16:32 -0700 Tim Wright <[email protected]>
wrote:
> On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote:
>> nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
>> You are playing word games by suggesting that the current behavior
>> prefers 'low end'. 'thousands of runnable processes' is not 'high end'
>> at all, it's 'broken end'. Thousands of runnable processes are the sign
>> of a broken application design, and 'fixing' the scheduler to perform
>> better in that case is just fixing the symptom. [changing the scheduler
>> to perform better in such situations is possible too, but all solutions
>> proposed so far had strings attached.]
>
> Ingo, you continue to assert this without giving much evidence to back it
> up. All the world is not a web server. If I'm running a large OLTP
> database with thousands of clients, it's not at all unreasonable to
> expect periods where several hundred (forget the thousands) want to be
> serviced by the database engine. That sounds like hundreds of schedulable
> entities be they processes or threads or whatever. This sort of load is
> regularly run on machine with 16-64 CPUs.

Actually, it's not just OLTP, anytime you are doing time sharing between
hundreds of users (something POSIX systems are supposed to be good at) this
will happen.

> Now I will admit that it is conceivable that you can design an
> application that finds out how many CPUs are available, creates threads
> to match that number and tries to divvy up the work between them using
> some combination of polling and asynchronous I/O etc. There are, however
> a number of problems with this approach:

Actually, one way to semi-support this approach is to implement
many-to-many threads as per the Solaris approach. This also requires
significant hacking of both the kernel and the runtime, and certainly is
significantly more error prone than trying to write a flexible scheduler.

One problem you didn't highlight that even the above case does not happily
identify is that for security reasons you may very well need each user's
requests to take place in a different process. If you don't, then you have
to implement a very well tested and secure user-level security mechanism to
ensure things like privacy (above and beyond the time-sharing).

The world is filled with a wide variety of types of applications, and
unless you know two programming approaches are functionaly equivalent (and
event driven/polling I/O vs. tons of running processes are NOT), you
shouldn't say one approach is "broken". You could say it's a "broken"
approach to building web servers. Unfortunately, things like kernels and
standard libraries should work well in the general case.

--Chris

2001-04-05 22:40:15

by Timothy D. Witham

[permalink] [raw]

Subject: Re: a quest for a better scheduler

I have been following this thread and thinking that everybody has some truth in
what they are saying but with the absence of a repeatable test environment there
really isn't a way of arriving at a data driven decision.

Given the following conditions.

1)The diversity of the problem sets that developers are working on results in a
real concern that another developers solution to their performance issue
might result in a worsening of my performance situation.

2)Most of the developers have a given set of tests that they use when they
make changes to determine if they change did what they want.

3)The Open Source Development Lab has the faculties for providing a large scale
testing environment on several different configurations that remain the same over
time.

I propose that we work on setting up a straight forward test harness that allows
developers to quickly test a kernel patch against various performance yardsticks.
If we use the same set of hardware for the generation of the performance data
from patch to patch there will be a correlation between the runs allowing for
a real comparison of the different changes.

The following should be taken only as a starting point.

As for the base hardware configurations I propose:
2 way with 1 GB main memory and 2 IDE drives
4 way with 4 GB main memory and 16 disk drives
8 way with 8 GB main memory and 24 disk drives

The types of applications that I have heard people express concern are:

Web infrastructure:
Apache
TUX
Jabber

Corporate infrastructure:
NFS
raw TCP/IP performance
Samba

Database performance:
Raw storage I/O performance
OLTP workload

General usage:
compile speed (usually measured by kernel compile)

The OSDL also has the ability to make some of the "for fee" benchmarks
available for use for testing that is done onsite (network access to OSDL
machines counts as onsite) so that people don't have to purchase individual
copies. Things like SECMAIL2001, SPECSFS2.0 and SEPCWEB99 come to mind.

Comments, suggestions, volunteers?

--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2455 (fax)

On Wed, Apr 04, 2001 at 03:54:54PM -0700, Christopher Smith wrote:
> --On Wednesday, April 04, 2001 15:16:32 -0700 Tim Wright <[email protected]>
> wrote:
> > On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote:
> >> nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
> >> You are playing word games by suggesting that the current behavior
> >> prefers 'low end'. 'thousands of runnable processes' is not 'high end'
> >> at all, it's 'broken end'. Thousands of runnable processes are the sign
> >> of a broken application design, and 'fixing' the scheduler to perform
> >> better in that case is just fixing the symptom. [changing the scheduler
> >> to perform better in such situations is possible too, but all solutions
> >> proposed so far had strings attached.]
> >
> > Ingo, you continue to assert this without giving much evidence to back it
> > up. All the world is not a web server. If I'm running a large OLTP
> > database with thousands of clients, it's not at all unreasonable to
> > expect periods where several hundred (forget the thousands) want to be
> > serviced by the database engine. That sounds like hundreds of schedulable
> > entities be they processes or threads or whatever. This sort of load is
> > regularly run on machine with 16-64 CPUs.
>
> Actually, it's not just OLTP, anytime you are doing time sharing between
> hundreds of users (something POSIX systems are supposed to be good at) this
> will happen.
>
> > Now I will admit that it is conceivable that you can design an
> > application that finds out how many CPUs are available, creates threads
> > to match that number and tries to divvy up the work between them using
> > some combination of polling and asynchronous I/O etc. There are, however
> > a number of problems with this approach:
>
> Actually, one way to semi-support this approach is to implement
> many-to-many threads as per the Solaris approach. This also requires
> significant hacking of both the kernel and the runtime, and certainly is
> significantly more error prone than trying to write a flexible scheduler.
>
> One problem you didn't highlight that even the above case does not happily
> identify is that for security reasons you may very well need each user's
> requests to take place in a different process. If you don't, then you have
> to implement a very well tested and secure user-level security mechanism to
> ensure things like privacy (above and beyond the time-sharing).
>
> The world is filled with a wide variety of types of applications, and
> unless you know two programming approaches are functionaly equivalent (and
> event driven/polling I/O vs. tons of running processes are NOT), you
> shouldn't say one approach is "broken". You could say it's a "broken"
> approach to building web servers. Unfortunately, things like kernels and
> standard libraries should work well in the general case.
>
> --Chris
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2455 (fax)

2001-04-06 03:31:06

by Christopher Smith

[permalink] [raw]

Subject: Re: a quest for a better scheduler

--On Thursday, April 05, 2001 15:38:41 -0700 "Timothy D. Witham"
<[email protected]> wrote:
> Database performance:
> Raw storage I/O performance
> OLTP workload
You probably want to add an OLAP scenario as well.

--Chris

2001-04-06 18:07:45

by Timothy D. Witham

[permalink] [raw]

Subject: Re: a quest for a better scheduler

Timothy D. Witham wrote :
[...]
> I propose that we work on setting up a straight forward test harness
> that allows developers to quickly test a kernel patch against
> various performance yardsticks.

[...
(proposed large server testbeds)
...]

OK, so I have received some feedback on my proposal to provide a
reference set of machines so that any kernel modifications could be
checked across a range of machines and a range of tests. It was
pointed out that there are lots of smaller servers out there and
they should be part of any test plan. There was also some concern
that a 4 way server didn't add any value in a test lineup. But I
have to think that with the number of 4 ways out there they should
be included.

One additional piece of feedback was that any comprehensive
characterization plan should include desktops, tablet devices and
older machines and the performance tests that address those
configurations usage models and I agree that it is something that
needs to be done. But as for providing hardware for that effort the
OSDL is not the group to do that. Hopefully somebody with an
interest in these configurations will step forward to do that
portion of the job.

So the server hardware configurations have evolved to look like
the following.

1 way, 512 MB, 2 IDE
2 way, 1 GB, 10 SCSI (1 SCSI channel)
4 way, 4 GB, 20 SCSI (2 channels)
8 way, 8 GB, 40 SCSI (4 channels) maybe Fibre Channel (FC)
16 way, 16 GB, 80 FC (8 channels)

The types of server applications that I have heard people express concern are:

Web infrastructure:
Apache (SPECWEB99???)
TUX (SPECWEB99???)
Jabber (have their own)

Corporate infrastructure:
NFS (SPECSFS2.0???)
raw TCP/IP performance
Samba (Have their own)
email (SPECMAIL2001???)

Database performance:
Raw storage I/O performance (various)
OLTP workload (something like TPC-C???)
OLAP workload

General usage:
compile speed (usually measured by kernel compile)

Further comments? I will start contacting folks who have expressed
interest.
--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2455 (fax)

2001-04-06 21:14:11

by Michael Peddemors

[permalink] [raw]

Subject: Re: a quest for a better scheduler

Missing an important one, our VPN's routinely run on 16 MG Ram, no HD or swap..
Loaded from an initrd on a floppy..

Don't we need to test on minimalistic machines as well :)

> So the server hardware configurations have evolved to look like
> the following.
>
> 1 way, 512 MB, 2 IDE
> 2 way, 1 GB, 10 SCSI (1 SCSI channel)
> 4 way, 4 GB, 20 SCSI (2 channels)
> 8 way, 8 GB, 40 SCSI (4 channels) maybe Fibre Channel (FC)
> 16 way, 16 GB, 80 FC (8 channels)
>

--
"Catch the Magic of Linux..."
--------------------------------------------------------
Michael Peddemors - Senior Consultant
LinuxAdministration - Internet Services
NetworkServices - Programming - Security
WizardInternet Services http://www.wizard.ca
Linux Support Specialist - http://www.linuxmagic.com
--------------------------------------------------------
(604)589-0037 Beautiful British Columbia, Canada

2001-04-06 22:34:12

by Nathan Straz

[permalink] [raw]

Subject: Re: a quest for a better scheduler

On Fri, Apr 06, 2001 at 11:06:03AM -0700, Timothy D. Witham wrote:
> Timothy D. Witham wrote :
> > I propose that we work on setting up a straight forward test harness
> > that allows developers to quickly test a kernel patch against
> > various performance yardsticks.

The Linux Test Project would like to help out here. At the least, we
would like to add the scripts, wrappers, and configuration files used to
create the test systems to LTP. Making test systems available is
definitely a great step forward. Showing people how to build similar
test systems is another step forward.

Let us know what parts you need and we'll see what we can come up with.

> Further comments? I will start contacting folks who have expressed
> interest.

If anyone has loose programs that they use to test the scheduler, please
submit them to the Linux Test Project. Chances are other people will
also find them useful and add functionality. It doesn't have to be a
formal test program or use the test libraries that we use. We can take
care of that when we add it to the CVS tree.

While we probably aren't going to package up full applications for
testing purposes, we could definitely keep track of useful configuration
files and scripts that people find useful for testing.

--
Nate Straz [email protected]
sgi, inc http://www.sgi.com/
Linux Test Project http://oss.sgi.com/projects/ltp/