2002-06-20 16:37:41

by Sandy Harris

[permalink] [raw]
Subject: McVoy's Clusters (was Re: latest linus-2.5 BK broken)

[ I removed half a dozen cc's on this, and am just sending to the
list. Do people actually want the cc's?]

Larry McVoy wrote:

> > Checkpointing buys three things. The ability to preempt jobs, the
> > ability to migrate processes,

For large multi-processor systems, it isn't clear that those matter
much. On single user systems I've tried , ps -ax | wc -l usually
gives some number 50 < n < 100. For a multi-user general purpose
system, my guess would be something under 50 system processes plus
50 per user. So for a dozen to 20 users on a departmental server,
under 1000. A server for a big application, like database or web,
would have fewer users and more threads, but still only a few 100
or at most, say 2000.

So at something like 8 CPUs in a personal workstation and 128 or
256 for a server, things average out to 8 processes per CPU, and
it is not clear that process migration or any form of pre-emption
beyond the usual kernel scheduling is needed.

What combination of resources and loads do you think preemption
and migration are need for?

> > and the ability to recover from failed nodes, (assuming the
> > failed hardware didn't corrupt your jobs checkpoint).

That matters, but it isn't entirely clear that it needs to be done
in the kernel. Things like databases and journalling filesystems
already have their own mechanisms and it is not remarkably onerous
to put them into applications where required.

[big snip]

> Larry McVoy's SMP Clusters
>
> Discussion on November 8, 2001
>
> Larry McVoy, Ted T'so, and Paul McKenney
>
> What is SMP Clusters?
>
> SMP Clusters is a method of partioning an SMP (symmetric
> multiprocessing) machine's CPUs, memory, and I/O devices
> so that multiple "OSlets" run on this machine. Each OSlet
> owns and controls its partition. A given partition is
> expected to contain from 4-8 CPUs, its share of memory,
> and its share of I/O devices. A machine large enough to
> have SMP Clusters profitably applied is expected to have
> enough of the standard I/O adapters (e.g., ethernet,
> SCSI, FC, etc.) so that each OSlet would have at least
> one of each.

I'm not sure whose definition this is:
supercomputer: a device for converting compute-bound problems
into I/O-bound problems
but I suspect it is at least partially correct, and Beowulfs are
sometimes just devices to convert them to network-bound problems.

For a network-bound task like web serving, I can see a large
payoff in having each OSlet doing its own I/O.

However, in general I fail to see why each OSlet should have
independent resources rather than something like using one to
run a shared file system and another to handle the networking
for everybody.


2002-06-20 17:10:36

by William Lee Irwin III

[permalink] [raw]
Subject: Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)

On Thu, Jun 20, 2002 at 11:41:45AM -0400, Sandy Harris wrote:
> For large multi-processor systems, it isn't clear that those matter
> much. On single user systems I've tried , ps -ax | wc -l usually
> gives some number 50 < n < 100. For a multi-user general purpose
> system, my guess would be something under 50 system processes plus
> 50 per user. So for a dozen to 20 users on a departmental server,
> under 1000. A server for a big application, like database or web,
> would have fewer users and more threads, but still only a few 100
> or at most, say 2000.

Certain unnameable databases like to have 2K processes at minimum and
see task counts soar even higher under significant loads.

Also, the scholastic departmental servers I've seen in action generally
host 300+ users with something less than 50/logged in user and something
more than 50 for the baseline. For the school-wide one I used hosting
10K+ (40K+?) users generally only between 500 and 2500 (where the non-rare
maximum was around 1500) are logged in simultaneously, and the task/user
count was more like 5-10, with a number of them (most?) riding at 2 or 3
(shell + MUA or shell + 2 tasks for rlogin to elsewhere). The uncertainty
with respect to number of accounts is due to no userlists being visible.

I can try to contact some of the users or administrators if better
numbers are needed, though it may not work as I've long since graduated.

Cheers,
Bill

2002-06-20 17:23:41

by Jesse Pollard

[permalink] [raw]
Subject: Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)

Sandy Harris <[email protected]>
>
> [ I removed half a dozen cc's on this, and am just sending to the
> list. Do people actually want the cc's?]
>
> Larry McVoy wrote:
>
> > > Checkpointing buys three things. The ability to preempt jobs, the
> > > ability to migrate processes,
>
> For large multi-processor systems, it isn't clear that those matter
> much. On single user systems I've tried , ps -ax | wc -l usually
> gives some number 50 < n < 100. For a multi-user general purpose
> system, my guess would be something under 50 system processes plus
> 50 per user. So for a dozen to 20 users on a departmental server,
> under 1000. A server for a big application, like database or web,
> would have fewer users and more threads, but still only a few 100
> or at most, say 2000.

You don't use compute servers much? The problems we are currently running
require the cluster (IBM SP) to have 100% uptime for a single job. that
job may run for several days. If a detected problem is reported (not yet
catastrophic) it is desired/demanded to checkpoint the users process.

Currently, we can't - but should be able to by this fall.

Having the users job checkpoint midway in it's computations will allow us
to remove a node from active service, substitute a different node, and
resume the users process without losing many hours of computation (we have
a maximum of 300 nodes for computation, another 30 for I/O and front end).

Just because a network interface fails is no reason to lose the job.

> So at something like 8 CPUs in a personal workstation and 128 or
> 256 for a server, things average out to 8 processes per CPU, and
> it is not clear that process migration or any form of pre-emption
> beyond the usual kernel scheduling is needed.
>
> What combination of resources and loads do you think preemption
> and migration are need for?

It depends on the job. A web server farm shouldn't need one. A distributed
compute cluster needs it to:

a. be able to suspend large (256-300 nodes), long running (4-8 hours),
low priority jobs, to favor high priority production jobs (which may
also be relatively long running: say 2-4 hours on 256 nodes.
b. be able to replace/substitute nodes (switch processing from a failing
node to allow for on-line replacement of the failing node or to wait for
spare parts).

> > > and the ability to recover from failed nodes, (assuming the
> > > failed hardware didn't corrupt your jobs checkpoint).
>
> That matters, but it isn't entirely clear that it needs to be done
> in the kernel. Things like databases and journalling filesystems
> already have their own mechanisms and it is not remarkably onerous
> to put them into applications where required.

Which is why I realized you don't use compute clusters very often.

1. User jobs, written in fortran/C/other do not usually come with the ability
to take snapshots of computation.
2. there is the problem of redirecting network connections (MPI/PVM) from one
place to another.
3. (related to 2) Synchronized process suspension is difficult-to-impossible
to do outside the kernel.

> [big snip]
>
> > Larry McVoy's SMP Clusters
> >
> > Discussion on November 8, 2001
> >
> > Larry McVoy, Ted T'so, and Paul McKenney
> >
> > What is SMP Clusters?
> >
> > SMP Clusters is a method of partioning an SMP (symmetric
> > multiprocessing) machine's CPUs, memory, and I/O devices
> > so that multiple "OSlets" run on this machine. Each OSlet
> > owns and controls its partition. A given partition is
> > expected to contain from 4-8 CPUs, its share of memory,
> > and its share of I/O devices. A machine large enough to
> > have SMP Clusters profitably applied is expected to have
> > enough of the standard I/O adapters (e.g., ethernet,
> > SCSI, FC, etc.) so that each OSlet would have at least
> > one of each.
>
> I'm not sure whose definition this is:
> supercomputer: a device for converting compute-bound problems
> into I/O-bound problems
> but I suspect it is at least partially correct, and Beowulfs are
> sometimes just devices to convert them to network-bound problems.
>
> For a network-bound task like web serving, I can see a large
> payoff in having each OSlet doing its own I/O.
>
> However, in general I fail to see why each OSlet should have
> independent resources rather than something like using one to
> run a shared file system and another to handle the networking
> for everybody.

How about reliability, security isolation (accounting server isolated
from a web server or audit server.. or both).

See Suns use of "domains" in Solaris which does this in a single host.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2002-06-20 17:43:27

by Nick LeRoy

[permalink] [raw]
Subject: Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)

On Thursday 20 June 2002 12:23 pm, Jesse Pollard wrote:
<snip>
> You don't use compute servers much? The problems we are currently running
> require the cluster (IBM SP) to have 100% uptime for a single job. that
> job may run for several days. If a detected problem is reported (not yet
> catastrophic) it is desired/demanded to checkpoint the users process.
>
> Currently, we can't - but should be able to by this fall.
>
> Having the users job checkpoint midway in it's computations will allow us
> to remove a node from active service, substitute a different node, and
> resume the users process without losing many hours of computation (we have
> a maximum of 300 nodes for computation, another 30 for I/O and front end).

Have you tried Condor? Condor is a "high throughput computing" package,
specifically targetted at such applications, with the ability to checkpoint &
migrate jobs, etc. Condor is free as in beer, but currently not as in speech
(sorry), and is developed by the University of Wisconsin.
http://www.condorproject.org is the URL to learn more. Version 6.4.0 is in
the process of being released and should be available within the next couple
of days.

Condor runs on Linux (x86 & Alpha), Solaris, IRIX, HPUX, Digital Unix, and
NT, although the NT usually lags the Unix releases.

-Nick
Academic Staff at UW on the Condor Team

2002-06-20 18:32:25

by Jesse Pollard

[permalink] [raw]
Subject: Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)

Nick LeRoy <[email protected]>:
>
> On Thursday 20 June 2002 12:23 pm, Jesse Pollard wrote:
> <snip>
> > You don't use compute servers much? The problems we are currently running
> > require the cluster (IBM SP) to have 100% uptime for a single job. that
> > job may run for several days. If a detected problem is reported (not yet
> > catastrophic) it is desired/demanded to checkpoint the users process.
> >
> > Currently, we can't - but should be able to by this fall.
> >
> > Having the users job checkpoint midway in it's computations will allow us
> > to remove a node from active service, substitute a different node, and
> > resume the users process without losing many hours of computation (we have
> > a maximum of 300 nodes for computation, another 30 for I/O and front end).
>
> Have you tried Condor? Condor is a "high throughput computing" package,
> specifically targetted at such applications, with the ability to checkpoint &
> migrate jobs, etc. Condor is free as in beer, but currently not as in speech
> (sorry), and is developed by the University of Wisconsin.
> http://www.condorproject.org is the URL to learn more. Version 6.4.0 is in
> the process of being released and should be available within the next couple
> of days.
>
> Condor runs on Linux (x86 & Alpha), Solaris, IRIX, HPUX, Digital Unix, and
> NT, although the NT usually lags the Unix releases.

Condor is designed for a relatively low performance network (10-100Mbit) and
not for things like an IBM SP switch which can carry Gbit data. It needs
availablility on SP-3/4 and Cray SV systems (not that we have problems
with checkpoint there). Also note:

Cannot use IPC (pipes shared memory), which also leave out PVM/MPI
job cannot use threads
cannot use forks

In many of our cases, the jobs are split across many nodes, then spread
across multiple processors in a single node (SP 3 has 4 cpus per node,
SP 4 will have 8-32). The current scientific library uses PVM/MPI to determine
whether it is using shared memory or node/node RPC.

Tightly integrated models wouldn't work well with Condor (disclaimer:
based on a fast look by me, and I don't work on the current jobs).

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2002-06-20 20:44:19

by Timothy D. Witham

[permalink] [raw]
Subject: Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)

Another point is that I've seen large multi-user machines that roll
a 32 bi pid in less than 1/2 hour. So not only is it a large
number of process but also a very dynamic process environment.

Tim

On Thu, 2002-06-20 at 10:10, William Lee Irwin III wrote:
> On Thu, Jun 20, 2002 at 11:41:45AM -0400, Sandy Harris wrote:
> > For large multi-processor systems, it isn't clear that those matter
> > much. On single user systems I've tried , ps -ax | wc -l usually
> > gives some number 50 < n < 100. For a multi-user general purpose
> > system, my guess would be something under 50 system processes plus
> > 50 per user. So for a dozen to 20 users on a departmental server,
> > under 1000. A server for a big application, like database or web,
> > would have fewer users and more threads, but still only a few 100
> > or at most, say 2000.
>
> Certain unnameable databases like to have 2K processes at minimum and
> see task counts soar even higher under significant loads.
>
> Also, the scholastic departmental servers I've seen in action generally
> host 300+ users with something less than 50/logged in user and something
> more than 50 for the baseline. For the school-wide one I used hosting
> 10K+ (40K+?) users generally only between 500 and 2500 (where the non-rare
> maximum was around 1500) are logged in simultaneously, and the task/user
> count was more like 5-10, with a number of them (most?) riding at 2 or 3
> (shell + MUA or shell + 2 tasks for rlogin to elsewhere). The uncertainty
> with respect to number of accounts is due to no userlists being visible.
>
> I can try to contact some of the users or administrators if better
> numbers are needed, though it may not work as I've long since graduated.
>
> Cheers,
> Bill
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)

2002-06-21 05:26:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)

Sandy Harris <[email protected]> writes:

> [ I removed half a dozen cc's on this, and am just sending to the
> list. Do people actually want the cc's?]
>
> Larry McVoy wrote:
>
> > > Checkpointing buys three things. The ability to preempt jobs, the
> > > ability to migrate processes,


> For large multi-processor systems, it isn't clear that those matter
> much.

The systems that are built because there is no machine that can
run your compute intensive application fast enough they matter quite
a bit.

> What combination of resources and loads do you think preemption
> and migration are need for?

Good answers have already been given.
The problem domain I am looking at are compute clusters. The
solutions are useful elsewhere but in compute clusters they are
extremely valuable.

> > > and the ability to recover from failed nodes, (assuming the
> > > failed hardware didn't corrupt your jobs checkpoint).
>
> That matters, but it isn't entirely clear that it needs to be done
> in the kernel.

I agree, glibc would be fine, but it must be below the level of
the application. Generally it is a pretty onerous task to checkpoint
a random program. For a proof attempt to checkpoint your X desktop,
the infrastructure is there to do it.

Every application must be capable of checkpointing it for the cluster
batch scheduler to take advantage of it.

Example case.
[Preemption]
You start job 1, a compute intensive application that runs for 4 days,
on 100 cpus. Your job is low priority.

In comes job2, a high priority job that runs for 4 hours and needs 256
cpus.

job1 is preempted. With checkpoint support it can be saved and
restarted later. Without checkpointing support it is simply killed.

[Migration]
Migration is needed for failing hardware or to get low priority jobs
out of the way onto less capable nodes that are going unused.

Or to restart a job that failed on other hardware.

Eric

2002-06-23 00:48:15

by kaih

[permalink] [raw]
Subject: Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)

[email protected] (Sandy Harris) wrote on 20.06.02 in <[email protected]>:

> For large multi-processor systems, it isn't clear that those matter
> much. On single user systems I've tried , ps -ax | wc -l usually
> gives some number 50 < n < 100. For a multi-user general purpose

156 here right now, and I'd call that a light load. On a

processor : 0
vendor_id : AuthenticAMD
cpu family : 5
model : 8
model name : AMD-K6(tm) 3D processor
stepping : 12
cpu MHz : 350.818

with 768 MB - not the fastest machine around.


MfG Kai